Privacera Platform master publication

Spark on EKS
:
Privacera plugin in Spark on EKS

This section covers how you can use Privacera Manager to generate the setup script and Spark custom configuration for SSL to install the Privacera plugin in Spark on an EKS cluster.

Prerequisites

Ensure the following prerequisites are met:

Configuration
  1. SSH to the instance as USER.

  2. Run the following commands.

    cd ~/privacera/privacera-manager
    cp config/sample-vars/vars.spark-standalone.yml config/custom-vars/
    vi config/custom-vars/vars.spark-standalone.yml
    
  3. Edit the following properties. For property details and description, refer to the Configuration Properties below.

    SPARK_STANDALONE_ENABLE:"true"
    SPARK_ENV_TYPE:"<PLEASE_CHANGE>"
    SPARK_HOME:"<PLEASE_CHANGE>"
    SPARK_USER_HOME:"<PLEASE_CHANGE>"
  4. Run the following commands:

    cd ~/privacera/privacera-manager
    ./privacera-manager.sh update
    

    After the update is complete, the Spark custom configuration (spark_custom_conf.zip) for SSL will be generated at the path, cd ~/privacera/privacera-manager/output/spark-standalone.

  5. Create the Spark Docker Image

    1. Run the following commands to export PRIVACERA_BASE_DOWNLOAD_URL:

      exportPRIVACERA_BASE_DOWNLOAD_URL=<PRIVACERA_BASE_DOWNLOAD_URL>
      
    2. Create a folder.

      mkdir -p ~/privacera-spark-plugin
      cd ~/privacera-spark-plugin
      
    3. Download and extract package using wget.

      wget ${PRIVACERA_BASE_DOWNLOAD_URL}/spark-plugin/k8s-spark-pkg.tar.gz -O k8s-spark-pkg.tar.gz
      tar xzf k8s-spark-pkg.tar.gz
      rm -r k8s-spark-pkg.tar.gz
      
    4. Copy spark_custom_conf.zip file from the Privacera Manager output folder into the files folder.

      cp ~/privacera/privacera-manager/output/spark-standalone/spark_custom_conf.zip files/spark_custom_conf.zip
      
    5. You can either built OLAC Docker image or FGAC Docker image.

      OLAC

      To built the OLAC Docker image, use the following command:

      ./build_image.sh ${PRIVACERA_BASE_DOWNLOAD_URL} OLAC
      

      FGAC

      To built the FGAC Docker image, use the following command:

      ./build_image.sh ${PRIVACERA_BASE_DOWNLOAD_URL} FGAC
      
  6. Test the Spark Docker image.

    1. Create a S3 bucket ${S3_BUCKET} for sample testing.

    2. Download sample data using the following link and put it in the ${S3_BUCKET} at location (s3://${S3_BUCKET}/customer_data).

      wget https://privacera-demo.s3.amazonaws.com/data/uploads/customer_data_clear/customer_data_without_header.csv
      
    3. Start Docker in an interactive mode.

      IMAGE=privacera-spark-plugin:latest
      docker run  --rm -i -t ${IMAGE} bash
      
    4. Start spark-shell inside the Docker container.

      JWT_TOKEN="<PLEASE_CHANGE>"
      cd /opt/privacera/spark/bin
      ./spark-shell \
      --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"\
      --conf "spark.hadoop.privacera.jwt.oauth.enable=true"
    5. Run the following command to read the S3 file:

      val df= spark.read.csv("s3a://${S3_BUCKET}/customer_data/customer_data_without_header.csv")
    6. Exit the Docker shell.

      exit
  7. Publish the Spark Docker Image into your Docker Registry.

    • For HUB, HUB_USERNAME, and HUB_PASSWORD, use the Docker hub URL and login credentials.

    • For ENV_TAG, its value can be user-defined depending on your deployment environment such as development, production or test. For example, ENV_TAG=dev can be used for a development environment.

    HUB=<PLEASE_CHANGE>
    HUB_USERNAME=<PLEASE_CHANGE>
    HUB_PASSWORD=<PLEASE_CHANGE>
    ENV_TAG=<PLEASE_CHANGE>
    DEST_IMAGE=${HUB}/privacera-spark-plugin:${ENV_TAG}
    SOURCE_IMAGE=privacera-spark-plugin:latest
    docker login -u ${HUB_USERNAME} -p ${HUB_PASSWORD}${HUB}
    docker tag ${SOURCE_IMAGE}${DEST_IMAGE}
    docker push ${DEST_IMAGE}
  8. Deploy Spark Plugin on EKS cluster.

    1. SSH to EKS cluster where you want to deploy Spark on EKS cluster.

    2. Run the following commands to export PRIVACERA_BASE_DOWNLOAD_URL:

      exportPRIVACERA_BASE_DOWNLOAD_URL=<PRIVACERA_BASE_DOWNLOAD_URL>
      
    3. Create a folder.

      mkdir ~/privacera-spark-plugin
      cd ~/privacera-spark-plugin
      
    4. Download and extract package using wget.

      wget ${PRIVACERA_DOWNLOAD_URL}/plugin/spark/k8s-spark-deploy.tar.gz -O k8s-spark-deploy.tar.gz
      tar xzf k8s-spark-deploy.tar.gz
      rm -r k8s-spark-deploy.tar.gz
      cd k8s-spark-deploy/
      
    5. Open penv.sh file and substitute the values of the following properties, refer to the table below:

      Property

      Description

      Example

      SPARK_NAME_SPACE

      Kubernetes namespace

      privacera-spark-plugin-test

      SPARK_PLUGIN_ROLE_BINDING

      Spark role Binding

      privacera-sa-spark-plugin-role-binding

      SPARK_PLUGIN_SERVICE_ACCOUNT

      Spark services account

      privacera-sa-spark-plugin

      SPARK_PLUGN_ROLE

      Spark services account role

      privacera-sa-spark-plugin-role

      SPARK_PLUGIN_APP_NAME

      Spark services account role

      privacera-sa-spark-plugin-role

      SPARK_PLUGIN_IMAGE

      Docker image with hub

      myhub.docker.com}/privacera-spark-plugin:prod-olac

      SPARK_DOCKER_PULL_SECRET

      Secret for docker-registry

      spark-plugin-docker-hub

    6. Run the following command to replace the properties value in the Kubernetes deployment .yml file:

      mkdir -p backup
      cp *.yml backup/
      ./replace.sh
      
    7. Run the following command to create Kubernetes resources:

      kubectl apply -f namespace.yml
      kubectl apply -f service-account.yml
      kubectl apply -f role.yml
      kubectl apply -f role-binding.yml
      
    8. Run the following command to create secret for docker-registry:

      kubectl create secret docker-registry spark-plugin-docker-hub --docker-server=<PLEASE_CHANGE> --docker-username=<PLEASE_CHANGE>  --docker-password='<PLEASE_CHANGE>' --namespace=<PLEASE_CHANGE>
      
    9. Run the following command to deploy a sample Spark application:

      Note

      This is an sample file used for deployment. As per your use case, you can create Spark deployment file and deploy a Docker image.

      kubectl apply -f privacera-spark-examples.yml -n ${SPARK_NAME_SPACE}

      This will deploy spark application in Kubernetes pod with Privacera plugin and it will keep the pod running, so that you can use it in interactive mode.

Configuration properties

Property

Description

Example

SPARK_STANDALONE_ENABLE

Property to enable generating setup script and configs for Spark standalone plugin installation.

true

SPARK_ENV_TYPE

Set the environment type. It can be any user-defined type.

For example, if you're working in an environment that runs locally, you can set the type as local; for a production environment, set it as prod.

local

SPARK_HOME

Home path of your Spark installation.

~/privacera/spark/spark-3.1.1-bin-hadoop3.2

SPARK_USER_HOME

User home directory of your Spark installation.

/home/ec2-user

SPARK_STANDALONE_RANGER_IS_FALLBACK_SUPPORTED

Use the property to enable/disable the fallback behavior to the privacera_files and privacera_hive services. It confirms whether the resources files should be allowed/denied access to the user.

To enable the fallback, set to true; to disable, set to false.

true

Validation
  1. Get all the resources.

    kubectl get all -n ${SPARK_NAME_SPACE}

    Copy POD ID that you will need for spark-master connection.

  2. Get the cluster info.

    kubectl cluster-info
    

    Copy Kubernetes control plane URL from the above output that we need during spark-shell command, for example ( https://xxxxxxxxxxxxxxxxxxxxxxx.yl4.us-east-1.eks.amazonaws.com).

    When using the URL for EKS_SERVER property in step 4, prefix the property value with k8s://. The following is an example of the property:

    EKS_SERVER="k8s://https://xxxxxxxxxxxxxxxxxxxxxxx.yl4.us-east-1.eks.amazonaws.com"
  3. Connect to Kubernetes master node.

    kubectl -n ${SPARK_NAME_SPACE}exec -it  <POD_ID>  -- bash
    
  4. Set the following properties:

    SPARK_NAME_SPACE="<PLEASE_CHANGE>"
    SPARK_PLUGIN_SERVICE_ACCOUNT="<PLEASE_CHANGE>"
    SPARK_PLUGIN_IMAGE="<PLEASE_CHANGE>"
    SPARK_DOCKER_PULL_SECRET="spark-plugin-docker-hub"
    EKS_SERVER="<PLEASE_CHANGE>"
    JWT_TOKEN="<PLEASE_CHANGE>"
  5. Run the following commands to open spark-shell. The command contains all the setup which is required to open the spark-shell.

    cd /opt/privacera/spark/bin
    ./spark-shell --master ${EKS_SERVER}\
    --deploy-mode client \
    --conf spark.kubernetes.authenticate.serviceAccountName=${SPARK_PLUGIN_SERVICE_ACCOUNT}\
    --conf spark.kubernetes.namespace=${SPARK_NAME_SPACE}\
    --conf spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
    --conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=${SPARK_PLUGIN_SERVICE_ACCOUNT}\
    --conf spark.kubernetes.container.image=${SPARK_PLUGIN_IMAGE}\
    --conf spark.kubernetes.container.image.pullPolicy=Always \
    --conf spark.kubernetes.container.image.pullSecrets=${SPARK_DOCKER_PULL_SECRET}\
    --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"\
    --conf "spark.hadoop.privacera.jwt.oauth.enable=true"\
    --conf spark.driver.bindAddress='0.0.0.0'\
    --conf spark.driver.host=$SPARK_PLUGIN_POD_IP\
    --conf spark.port.maxRetries=4\
    --conf spark.kubernetes.driver.pod.name=$SPARK_PLUGIN_POD_NAME
  6. Run the following command using spark-submit with JWT authentication.

    ./spark-submit \
    --master ${EKS_SERVER}\
    --name spark-cloud-new \
    --deploy-mode cluster \
    --conf spark.kubernetes.authenticate.serviceAccountName=${SPARK_PLUGIN_SERVICE_ACCOUNT}\
    --conf spark.kubernetes.namespace=${SPARK_NAME_SPACE}\
    --conf spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
    --conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=${SPARK_PLUGIN_SERVICE_ACCOUNT}\
    --conf spark.kubernetes.container.image=${SPARK_PLUGIN_IMAGE}\
    --conf spark.kubernetes.container.image.pullPolicy=Always \
    --conf spark.kubernetes.container.image.pullSecrets=${SPARK_DOCKER_PULL_SECRET}\
    --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"\
    --conf spark.driver.bindAddress='0.0.0.0'\
    --conf spark.driver.host=$SPARK_PLUGIN_POD_IP\
    --conf spark.port.maxRetries=4\
    --conf spark.kubernetes.driver.pod.name=$SPARK_PLUGIN_POD_NAME\
    --class com.privacera.spark.poc.SparkSample \
    <your-code-jar/file>
    
  7. To check the read access on the S3 file, run the following command in the open spark-shell:

    val df= spark.read.csv("s3a://${S3_BUCKET}/customer_data/customer_data_without_header.csv")
    df.show()
  8. To check the write access on the S3 file, run the following command in the open spark-shell:

    df.write.format("csv").mode("overwrite").save("s3a://${S3_BUCKET}/output/k8s/sample/csv")
  9. Check the Audit logs on the Privacera Portal.

  10. To verify the spark-shell setup, open another SSH connection for Kubernetes cluster and run the following command to check the running pods:

    kubectl get pods -n ${SPARK_NAME_SPACE}

    You will see the spark executor pods -exec-x. For example, spark-shell-xxxxxxxxxxxxxxxx-exec-1 and spark-shell-xxxxxxxxxxxxxxxx-exec-2.