AWS re:INVENT 2017 新功能 S3 / Glacier select 也能用 SQL Query 了

2017-12-02 AWS, AWS re:invent

AWS re:INVENT 2017 上發表的 S3 / Glacier select「S3 Select and Glacier Select – Retrieving Subsets of Objects」,這個和 Athena 用 SQL Query 的用法很像。

 

從圖看到過去你有 1GB 的 Object 就要整個 1GB Object 下載後再取 Content 出來。

 

從 example 可以看到只取某個 Object 的 content 出來就好

import boto3
from s3select import ResponseHandler

class PrintingResponseHandler(ResponseHandler):
    def handle_records(self, record_data):
        print(record_data.decode('utf-8'))

handler = PrintingResponseHandler()
s3 = boto3.client('s3')
response = s3.select_object_content(
    Bucket="super-secret-reinvent-stuff",
    Key="stuff.csv",
    SelectRequest={
        'ExpressionType': 'SQL',
     'Expression': 'SELECT s._1 FROM S3Object AS s'',
        'InputSerialization': {
            'CompressionType': 'NONE',
            'CSV': {
                'FileHeaderInfo': 'IGNORE',
                'RecordDelimiter': '\n',
                'FieldDelimiter': ',',
            }
        },
        'OutputSerialization': {
            'CSV': {
                'RecordDelimiter': '\n',
                'FieldDelimiter': ',',
            }
        }
    }
)
handler.handle_response(response['Body'])

如果 Query 下的夠準,可以節省非常多的成本,效率也大幅提昇 (AWS 號稱提昇 400%)

 

用 S3 跑了 35s,用了 S3 select 只花 6s

[hadoop@ip-172-31-19-123 ~]$ time presto-cli --catalog hive --schema default --session hive.s3_optimized_select_enabled=false -f query.sql
"31.965496","127178","5976","70.89902","130147","6996","37.17715","138092","8678","135.49536","103926","11446","82.35177","116816","8484","67.308304","135811","10104"
 
real  0m35.910s
user  0m2.320s
sys   0m0.124s
[hadoop@ip-172-31-19-123 ~]$ time presto-cli --catalog hive --schema default --session hive.s3_optimized_select_enabled=true -f query.sql
"31.965496","127178","5976","70.89902","130147","6996","37.17715","138092","8678","135.49536","103926","11446","82.35177","116816","8484","67.308304","135811","10104"
 
real  0m6.566s
user  0m2.136s
sys   0m0.088s

 

Glacier select 跟 S3 select 的用法相同。

 

Glacier select 價格跟你的 SQL Query 技巧有很大的關係。

  • GB of Data Scanned
  • GB of Data Returned
  • Select Requests

你可以加錢讓取出的速度再快一點

Pricing for each dimension is determined by the speed at which you want your results returned: expedited (1-5 minutes), standard (3-5 hours), and bulk (5-12 hours).

 

後面還預告了 Athena 在 2018 年跟 Glacier select 會有一些服務上的結合 ?

 

給 Mr. 沙先生一點建議

彙整

分類

展開全部 | 收合全部

License

訂閱 Mr. 沙先生 的文章

輸入你的 email 用於訂閱