Skip to content

Databricks in GCP

Databricks Spark Plug-in (Python/SQL)#

These instructions guide the installation of the Privacera Spark plug-in in GCP Databricks.

Prerequisite

Ensure the following prerequisite is met:

  • All the Privacera core (default) services should be installed and running.

Configuration

  1. Run the following commands.

    cd ~/privacera/privacera-manager
    cp config/sample-vars/vars.databricks.plugin.yml config/custom-vars/
    vi config/custom-vars/vars.databricks.plugin.yml
    
  2. Update DATABRICKS_MANAGE_INIT_SCRIPT as we will manually upload the init script to GCP Cloud Storage in the step below.

    DATABRICKS_MANAGE_INIT_SCRIPT: "false"
    
  3. Run the following commands.

    cd ~/privacera/privacera-manager
    ./privacera-manager.sh update
    

    After the update is completed, the init script (ranger_enable.sh) and Privacera custom configuration (privacera_custom_conf.zip) for SSL will be generated at the location,~/privacera/privacera-manager/output/databricks.

Manage init Script and Spark Configurations#

  1. Upload init Script and Spark Configurations to the GCS bucket.

    1. Get the GCS bucket bucket that is mounted to the Databricks File System (DBFS).

      In the Databricks UI, click an existing cluster, click Driver Logs, and then click log4j-active.log file.

    2. Download the file and open it.

    3. To get the GCS bucket, search for gs://databricks-xxxxxxxx/xxxxxxxxx/ where databricks-xxxxxxxx is the bucket name. For example: databricks-1558328210275731.

    4. Log in to the GCP console, and navigate to the GCS bucket.

    5. In the GCS bucket, create a folder, privacera/<ENV-NAME>. Where <ENV-NAME> is the value set for DEPLOYMENT_ENV_NAME variable in the vars.privacera.yml file. See Environment Setup.

    6. Upload the ranger_enable.sh and privacera_custom_conf.zip to location privacera/<ENV_NAME> in the GCS bucket.

  2. In the Databricks UI:

    1. Open the target cluster or create a new cluster.

    2. Open the Cluster dialog and go to Edit mode.

    3. Open Advanced Options, open the tab Init Scripts. Enter (paste) the following file path for the init script location. Where Save (Confirm) this configuration.

      dbfs:/privacera/<ENV_NAME>/ranger_enable.sh
      
    4. Open Advanced Options, open the tab Spark. Add the following content to the Spark Config edit box:

      spark.driver.extraJavaOptions -javaagent:/databricks/jars/privacera-agent.jar
      spark.databricks.isv.product privacera
      spark.databricks.pyspark.enableProcessIsolation false
      
    5. Save (Confirm) this configuration.

    6. Start (or Restart) the selected Databricks Cluster.

Validation#

In order to help evaluate the use of Privacera with Databricks, Privacera provides a set of Privacera Manager 'demo' notebooks. These can be downloaded from Privacera S3 repository using either your favorite browser, or a command line 'wget'. Use the notebook/sql sequence that matches your cluster.

  1. Download using your browser (just click on the correct file for your cluster, below:

    https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPlugin.sql

    If AWS S3 is configured from your Databricks cluster: https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginS3.sql

    If ADLS Gen2 is configured from your Databricks cluster: https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginADLS.sql

    or, if you are working from a Linux command line, use the 'wget' command to download.

    wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPlugin.sql -O PrivaceraSparkPlugin.sql

    wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginS3.sql -O PrivaceraSparkPluginS3.sql

    wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginADLS.sql -O PrivaceraSparkPluginADLS.sql

  2. Import the Databricks notebook:

    • Login to Databricks Console ->
    • Select Workspace -> Users -> Your User ->
    • Click on drop down ->
    • Click on Import and Choose the file downloaded
  3. Follow the suggested steps in the text of the notebook to exercise and validate Privacera with Databricks.