AWS re:INVENT 2017 上發表的 S3 / Glacier select「S3 Select and Glacier Select – Retrieving Subsets of Objects」,這個和 Athena 用 SQL Query 的用法很像。
從圖看到過去你有 1GB 的 Object 就要整個 1GB Object 下載後再取 Content 出來。

從 example 可以看到只取某個 Object 的 content 出來就好
import boto3
from s3select import ResponseHandler
class PrintingResponseHandler(ResponseHandler):
def handle_records(self, record_data):
print(record_data.decode('utf-8'))
handler = PrintingResponseHandler()
s3 = boto3.client('s3')
response = s3.select_object_content(
Bucket="super-secret-reinvent-stuff",
Key="stuff.csv",
SelectRequest={
'ExpressionType': 'SQL',
'Expression': 'SELECT s._1 FROM S3Object AS s'',
'InputSerialization': {
'CompressionType': 'NONE',
'CSV': {
'FileHeaderInfo': 'IGNORE',
'RecordDelimiter': '\n',
'FieldDelimiter': ',',
}
},
'OutputSerialization': {
'CSV': {
'RecordDelimiter': '\n',
'FieldDelimiter': ',',
}
}
}
)
handler.handle_response(response['Body'])
如果 Query 下的夠準,可以節省非常多的成本,效率也大幅提昇 (AWS 號稱提昇 400%)
用 S3 跑了 35s,用了 S3 select 只花 6s
[hadoop@ip-172-31-19-123 ~]$ time presto-cli --catalog hive --schema default --session hive.s3_optimized_select_enabled=false -f query.sql "31.965496","127178","5976","70.89902","130147","6996","37.17715","138092","8678","135.49536","103926","11446","82.35177","116816","8484","67.308304","135811","10104" real 0m35.910s user 0m2.320s sys 0m0.124s [hadoop@ip-172-31-19-123 ~]$ time presto-cli --catalog hive --schema default --session hive.s3_optimized_select_enabled=true -f query.sql "31.965496","127178","5976","70.89902","130147","6996","37.17715","138092","8678","135.49536","103926","11446","82.35177","116816","8484","67.308304","135811","10104" real 0m6.566s user 0m2.136s sys 0m0.088s
Glacier select 跟 S3 select 的用法相同。
Glacier select 價格跟你的 SQL Query 技巧有很大的關係。
- GB of Data Scanned
- GB of Data Returned
- Select Requests
你可以加錢讓取出的速度再快一點
Pricing for each dimension is determined by the speed at which you want your results returned: expedited (1-5 minutes), standard (3-5 hours), and bulk (5-12 hours).
後面還預告了 Athena 在 2018 年跟 Glacier select 會有一些服務上的結合 ?
