Amazon EMR on EKS 的設計上與 Amazon EMR 很大的不同在於把元件切分到了 Job-level,在 Amazon EMR 要拆分成 Multi-tenant 非常不容易,尤其是 authorization 與 auditing 都會比較挑戰。而 AWS EMR on EKS 本身就提供了更細緻的 IRSA 方法來拆分不同 users 僅能使用自己的 IAM Role,這對於企業在管理人員權限上提供更彈性的選擇,這篇作者將介紹如何做到這件事。
在開始前作者先描述目標場景以驗證需求:
- 有 n+1 以上的 Data engineer/scientist 會對這座 Amazon EMR on EKS 發送 PySpark Job
- 每一個 PySpark Job 都有獨立的 execution Role
- Data engineer/scientist 會是一到多組 PySpark Job 的擁有者,同時需要有 Start Job 的權限
- Data engineer/scientist 彼此不能互相 across execution Role
- 必須能 tracing 幾個部分
- 是誰 submit Job
- Job 訪問過哪些 Amazon S3,並且不能訪問未授權的 Amazon S3 bucket
想象中的大概就像是上面架構所示,所以一路從 user 端到 execution Role 要能夠 mapping 以外還要限縮能訪問的範圍。
「Using job execution roles with Amazon EMR on EKS」這篇文件內有提到作法,其實就是利用 IAM Condition 來限制能訪問的 execution role,emr-containers:ExecutionRoleArn
這個參數提供一個陣列參數把 execution role ARN 作為白名單列表。
Execution role
Execution role 是基於 IRSA(IAM roles for service accounts) 提供 PySpark Pods 訪問 Amazon services 的權限
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::AWS_ACCOUNT_ID:oidc-provider/OIDC_PROVIDER"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringLike": {
"OIDC_PROVIDER:sub": "system:serviceaccount:emr-containers:emr-on-eks-job-execution-role",
"OIDC_PROVIDER:sub": "sts.amazonaws.com"
}
}
},
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::AWS_ACCOUNT_ID:oidc-provider/OIDC_PROVIDER"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringLike": {
"OIDC_PROVIDER:sub": "system:serviceaccount:NAMESPACE:emr-containers-sa-*-*-AWS_ACCOUNT_ID-BASE36_ENCODED_ROLE_NAME"
}
}
}
]
}
IAM Condition 會鎖定來源的 OIDC Provider, Namespace 和 Service account,而後綴是使用 Base36 加密 Role name 的值,所以 execution role 與 Kubernetes service account 是 1:1 對應的概念,Base36 + Role name 這段 trust policy 可以用 aws-cli aws emr-containers
來生成。
$ aws emr-containers update-role-trust-policy \
--cluster-name cluster \
--namespace namespace \
--role-name iam_role_name_for_job_execution
- arn:aws:iam::01234567890:role/AmazonEMRContainersJobExecutionRole1
- arn:aws:iam::01234567890:role/AmazonEMRContainersJobExecutionRole2
依照上述 trust policy 建立兩個 IAM Role 提供對照組,此時 Pod 已經能夠使用各自的 execution role 訪問 AWS services,但如何限制 users 只能執行特定的 execution role?
Data engineer IAM Role/User
Data engineer 的 IAM Policy 除了要允許 emr-containers:StartJobRun
API 權限以外,還必須另外加上 Condition 來限制能執行哪些 execution role
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "emr-containers:StartJobRun",
"Resource": "arn:aws:emr-containers:REGION:AWS_ACCOUNT_ID:/virtualclusters/VIRTUAL_CLUSTER_ID",
"Condition": {
"StringEquals": {
"emr-containers:ExecutionRoleArn": [
"execution_role_arn_1"
]
}
}
}
]
}
Testing
使用上述建立好的 IAM Role/User 來跑一個 AmazonEMRContainersJobExecutionRole1 的 PySpark Job 試試看是否成正常運行:
$ aws emr-containers start-job-run \
--execution-role-arn arn:aws:iam::01234567890:role/AmazonEMRContainersJobExecutionRole1 \
...
{
"id": "VIRTUAL_CLUSTER_JOB_ID",
"name": "job-run",
"arn": "VIRTUAL_CLUSTER_JOB_ARN",
"virtualClusterId": "VIRTUAL_CLUSTER_ID"
}
因為在上述 IAM Policy Condition 已經允許了 execution_role_arn_1 的執行權限,所以能夠成功 submit Job。
如果將 execution role 設定為 AmazonEMRContainersJobExecutionRole2 就會遇到 AccessDeniedException
的訊息,因為在這個 IAM User/Role 並未授權擁有 AmazonEMRContainersJobExecutionRole2 的 IAM Condition。
$ aws emr-containers start-job-run \
--execution-role-arn arn:aws:iam::01234567890:role/AmazonEMRContainersJobExecutionRole2 \
...
An error occurred (AccessDeniedException) when calling the StartJobRun operation: User is not authorized to perform: emr-containers:StartJobRun on resource: VIRTUAL_CLUSTER_JOB_ID
Tracing user access with data event in AWS CloudTrail
通常 authorization 伴隨著來的是 auditing,而每一次的 execution role 都會被採集到 AWS CloudTrail 上,在 Amazon EMR on EKS 上最常見的是 Amazon S3 auditing,要開啟 Amazon S3 Object-level 的紀錄可以參考 Logging data events for trails 或者用以下 AWS CDK 方法建立。
const bucket = new s3.Bucket(this, 'cloudtrail-bucket', {
bucketName: 'cloudtrail-logs-' + this.region + '-' + this.account,
})
const targetBucket = s3.Bucket.fromBucketName(this, 'existing-bucket',
'aws-emr-on-eks-' + this.region + '-' + this.account
);
const trail = new cloudtrail.Trail(this, 'cloudtrail', {
trailName: 'CdkTrailStack',
bucket,
sendToCloudWatchLogs: true,
cloudWatchLogsRetention: logs.RetentionDays.ONE_MONTH
})
trail.addS3EventSelector([{ bucket: targetBucket }], {
readWriteType: cloudtrail.ReadWriteType.ALL,
})
CloudTrail 內建支援將 Log 送到 CloudWatch Logs,透過以下語法來查詢剛剛我們用 AmazonEMRContainersJobExecutionRole1 來執行 PySpark Job 訪問 Amazon S3 的紀錄 (前提是你的 PySpark 有使用 Amazon S3)
filter eventSource="s3.amazonaws.com" and userIdentity.sessionContext.sessionIssuer.arn="arn:aws:iam::01234567890:role/AmazonEMRContainersJobExecutionRole1"
| fields @timestamp, eventName, requestParameters.bucketName, requestParameters.key, @message
總結
作者在這幾次 Amazon EMR on EKS 的體驗後,一開始認為 Data engineer 要擁抱 Hadoop 已經是一大挑戰,再加上 Kubernetes 這個大怪獸實在吃不消,但是透過 AWS Fargate 減輕不少維運成本以外,對於大數據的 Authorization & Auditing 確實解決了不少問題。