Site icon Mr. 沙先生

How to control/trace user access with AWS EMR on EKS

Amazon EMR on EKS 的設計上與 Amazon EMR 很大的不同在於把元件切分到了 Job-level,在 Amazon EMR 要拆分成 Multi-tenant 非常不容易,尤其是 authorization 與 auditing 都會比較挑戰。而 AWS EMR on EKS 本身就提供了更細緻的 IRSA 方法來拆分不同 users 僅能使用自己的 IAM Role,這對於企業在管理人員權限上提供更彈性的選擇,這篇作者將介紹如何做到這件事。

在開始前作者先描述目標場景以驗證需求:

想象中的大概就像是上面架構所示,所以一路從 user 端到 execution Role 要能夠 mapping 以外還要限縮能訪問的範圍。

Using job execution roles with Amazon EMR on EKS」這篇文件內有提到作法,其實就是利用 IAM Condition 來限制能訪問的 execution role,emr-containers:ExecutionRoleArn 這個參數提供一個陣列參數把 execution role ARN 作為白名單列表。

Execution role

Execution role 是基於 IRSA(IAM roles for service accounts) 提供 PySpark Pods 訪問 Amazon services 的權限

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::AWS_ACCOUNT_ID:oidc-provider/OIDC_PROVIDER"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringLike": {
          "OIDC_PROVIDER:sub": "system:serviceaccount:emr-containers:emr-on-eks-job-execution-role",
          "OIDC_PROVIDER:sub": "sts.amazonaws.com"
        }
      }
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::AWS_ACCOUNT_ID:oidc-provider/OIDC_PROVIDER"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringLike": {
          "OIDC_PROVIDER:sub": "system:serviceaccount:NAMESPACE:emr-containers-sa-*-*-AWS_ACCOUNT_ID-BASE36_ENCODED_ROLE_NAME"
        }
      }
    }
  ]
}

IAM Condition 會鎖定來源的 OIDC Provider, Namespace 和 Service account,而後綴是使用 Base36 加密 Role name 的值,所以 execution role 與 Kubernetes service account 是 1:1 對應的概念,Base36 + Role name 這段 trust policy 可以用 aws-cli aws emr-containers 來生成。

$ aws emr-containers update-role-trust-policy \
       --cluster-name cluster \
       --namespace namespace \
       --role-name iam_role_name_for_job_execution

依照上述 trust policy 建立兩個 IAM Role 提供對照組,此時 Pod 已經能夠使用各自的 execution role 訪問 AWS services,但如何限制 users 只能執行特定的 execution role?

Data engineer IAM Role/User

Data engineer 的 IAM Policy 除了要允許 emr-containers:StartJobRun API 權限以外,還必須另外加上 Condition 來限制能執行哪些 execution role

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "emr-containers:StartJobRun",
      "Resource": "arn:aws:emr-containers:REGION:AWS_ACCOUNT_ID:/virtualclusters/VIRTUAL_CLUSTER_ID",
      "Condition": {
        "StringEquals": {
          "emr-containers:ExecutionRoleArn": [
            "execution_role_arn_1"
          ]
        }
      }
    }
  ]
}
 

Testing

使用上述建立好的 IAM Role/User 來跑一個 AmazonEMRContainersJobExecutionRole1 的 PySpark Job 試試看是否成正常運行:

$ aws emr-containers start-job-run \
    --execution-role-arn arn:aws:iam::01234567890:role/AmazonEMRContainersJobExecutionRole1 \
    ...


{
    "id": "VIRTUAL_CLUSTER_JOB_ID",
    "name": "job-run",
    "arn": "VIRTUAL_CLUSTER_JOB_ARN",
    "virtualClusterId": "VIRTUAL_CLUSTER_ID"
}

因為在上述 IAM Policy Condition 已經允許了 execution_role_arn_1 的執行權限,所以能夠成功 submit Job。

如果將 execution role 設定為 AmazonEMRContainersJobExecutionRole2 就會遇到 AccessDeniedException 的訊息,因為在這個 IAM User/Role 並未授權擁有 AmazonEMRContainersJobExecutionRole2 的 IAM Condition。

$ aws emr-containers start-job-run \
    --execution-role-arn arn:aws:iam::01234567890:role/AmazonEMRContainersJobExecutionRole2 \
    ...

An error occurred (AccessDeniedException) when calling the StartJobRun operation: User is not authorized to perform: emr-containers:StartJobRun on resource: VIRTUAL_CLUSTER_JOB_ID

Tracing user access with data event in AWS CloudTrail

通常 authorization 伴隨著來的是 auditing,而每一次的 execution role 都會被採集到 AWS CloudTrail 上,在 Amazon EMR on EKS 上最常見的是 Amazon S3 auditing,要開啟 Amazon S3 Object-level 的紀錄可以參考 Logging data events for trails 或者用以下 AWS CDK 方法建立。

const bucket = new s3.Bucket(this, 'cloudtrail-bucket', {
    bucketName: 'cloudtrail-logs-' + this.region + '-' + this.account,
})

const targetBucket = s3.Bucket.fromBucketName(this, 'existing-bucket',
    'aws-emr-on-eks-' + this.region + '-' + this.account
);

const trail = new cloudtrail.Trail(this, 'cloudtrail', {
    trailName: 'CdkTrailStack',
    bucket,
    sendToCloudWatchLogs: true,
    cloudWatchLogsRetention: logs.RetentionDays.ONE_MONTH
})


trail.addS3EventSelector([{ bucket: targetBucket }], {
    readWriteType: cloudtrail.ReadWriteType.ALL,
})

CloudTrail 內建支援將 Log 送到 CloudWatch Logs,透過以下語法來查詢剛剛我們用 AmazonEMRContainersJobExecutionRole1 來執行 PySpark Job 訪問 Amazon S3 的紀錄 (前提是你的 PySpark 有使用 Amazon S3)

filter eventSource="s3.amazonaws.com" and userIdentity.sessionContext.sessionIssuer.arn="arn:aws:iam::01234567890:role/AmazonEMRContainersJobExecutionRole1"
    | fields @timestamp, eventName, requestParameters.bucketName, requestParameters.key, @message 

總結

作者在這幾次 Amazon EMR on EKS 的體驗後,一開始認為 Data engineer 要擁抱 Hadoop 已經是一大挑戰,再加上 Kubernetes 這個大怪獸實在吃不消,但是透過 AWS Fargate 減輕不少維運成本以外,對於大數據的 Authorization & Auditing 確實解決了不少問題。

Exit mobile version