AWS re:INVENT 2017 上發表的 S3 / Glacier select「S3 Select and Glacier Select – Retrieving Subsets of Objects」,這個和 Athena 用 SQL Query 的用法很像。
從圖看到過去你有 1GB 的 Object 就要整個 1GB Object 下載後再取 Content 出來。
從 example 可以看到只取某個 Object 的 content 出來就好
import boto3 from s3select import ResponseHandler class PrintingResponseHandler(ResponseHandler): def handle_records(self, record_data): print(record_data.decode('utf-8')) handler = PrintingResponseHandler() s3 = boto3.client('s3') response = s3.select_object_content( Bucket="super-secret-reinvent-stuff", Key="stuff.csv", SelectRequest={ 'ExpressionType': 'SQL', 'Expression': 'SELECT s._1 FROM S3Object AS s'', 'InputSerialization': { 'CompressionType': 'NONE', 'CSV': { 'FileHeaderInfo': 'IGNORE', 'RecordDelimiter': '\n', 'FieldDelimiter': ',', } }, 'OutputSerialization': { 'CSV': { 'RecordDelimiter': '\n', 'FieldDelimiter': ',', } } } ) handler.handle_response(response['Body'])
如果 Query 下的夠準,可以節省非常多的成本,效率也大幅提昇 (AWS 號稱提昇 400%)
用 S3 跑了 35s,用了 S3 select 只花 6s
[hadoop@ip-172-31-19-123 ~]$ time presto-cli --catalog hive --schema default --session hive.s3_optimized_select_enabled=false -f query.sql "31.965496","127178","5976","70.89902","130147","6996","37.17715","138092","8678","135.49536","103926","11446","82.35177","116816","8484","67.308304","135811","10104" real 0m35.910s user 0m2.320s sys 0m0.124s [hadoop@ip-172-31-19-123 ~]$ time presto-cli --catalog hive --schema default --session hive.s3_optimized_select_enabled=true -f query.sql "31.965496","127178","5976","70.89902","130147","6996","37.17715","138092","8678","135.49536","103926","11446","82.35177","116816","8484","67.308304","135811","10104" real 0m6.566s user 0m2.136s sys 0m0.088s
Glacier select 跟 S3 select 的用法相同。
Glacier select 價格跟你的 SQL Query 技巧有很大的關係。
- GB of Data Scanned
- GB of Data Returned
- Select Requests
你可以加錢讓取出的速度再快一點
Pricing for each dimension is determined by the speed at which you want your results returned: expedited (1-5 minutes), standard (3-5 hours), and bulk (5-12 hours).
後面還預告了 Athena 在 2018 年跟 Glacier select 會有一些服務上的結合 ?