先把文件放在最前面「Querying Amazon VPC Flow Logs」,在這篇記錄 TOP10 常見的 Query 語法
在這篇文章的必要條件是已經開啟 VPC Flow Logs,參考「VPC Flow Logs – Publishing to Amazon S3」
Environment
- Region:us-east-1
- Account Id (範例):0123456789
- VPC Flow logs Bucket:vpcflow-logs-WETX02MSQ8213l
- Bucket Object 結構:s3://vpcflow-logs-WETX02MSQ8213l/AWSLogs/0123456789/vpcflowlogs/us-east-1/2020/08/06/*.log.gz
s3://<your_bucket>/AWSLogs/<your_account_id>/vpcflowlogs/<your_region>/YYYY/MM/dd/*.log.gz
Athena
Athena 可以透過 SQL 語法直接查詢 S3 中的資料,非常方便
- 建立 VPC Flow logs Table
建立 Table 不一定要新建 Database,可以直接在 default 這個資料庫建立與 S3 關聯的 Table
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
version int,
account string,
interfaceid string,
sourceaddress string,
destinationaddress string,
sourceport int,
destinationport int,
protocol int,
numpackets int,
numbytes bigint,
starttime int,
endtime int,
action string,
logstatus string
)
PARTITIONED BY (`date` date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION 's3://vpcflow-logs-WETX02MSQ8213l/AWSLogs/0123456789/vpcflowlogs/us-east-1'
TBLPROPERTIES ("skip.header.line.count"="1");
如果都沒有改過則會是 Default Schema,也可以 Custom Schema 讓顯示的資訊更多
- 建立 Table Partition,按照日期切分以寫文章的日期 2020-08-06 為例
ALTER TABLE vpc_flow_logs
ADD PARTITION (`date`='2020-08-06')
location 's3://vpcflow-logs-WETX02MSQ8213l/AWSLogs/0123456789/vpcflowlogs/us-east-1/2020/08/06';
先用簡單的 SELECT 撈一下確認資料正常
SELECT * FROM default.vpc_flow_logs LIMIT 10;
TOP 10 VPC Flow logs Query
範例幾個很常見的 TOP 10 Query 示範
- TOP 10
Packet
count ofInterface-Id
要查哪一個 Interface-Id 浪費了大量 VPC traffic 費用時很好用。
SELECT SUM(numpackets) AS
packetcount,
interfaceid
FROM raw
GROUP BY interfaceid
ORDER BY packetcount DESC
LIMIT 10;
- TOP 10 按照
Source -> Dest
將Protocol
封包數量統計
SELECT
action,
sourceaddress,
destinationaddress,
sum( CASE WHEN protocol = 6 THEN numpackets ELSE 0 END ) AS tcp,
sum( CASE WHEN protocol = 17 THEN numpackets ELSE 0 END ) AS udp,
sum( CASE WHEN protocol = 2 THEN numpackets ELSE 0 END ) AS icmp,
sum( numpackets ) AS total
FROM raw
GROUP BY action, sourceaddress, destinationaddress
ORDER BY action DESC, sourceaddress, destinationaddress
LIMIT 10;
- TOP 10 按照
Source -> Dest
將traffic bytes
做統計
可以找到不正常的 traffic 是誰在傳輸
SELECT
sourceaddress,
destinationaddress,
sum(numbytes) as traffic_in_bytes
FROM raw
WHERE action = 'ACCEPT'
GROUP BY sourceaddress, destinationaddress
ORDER BY traffic_in_bytes DESC
LIMIT 10;