Databricks in GCP
Databricks Spark Plug-in (Python/SQL)#
These instructions guide the installation of the Privacera Spark plug-in in GCP Databricks.
Prerequisite
Ensure the following prerequisite is met:
- All the Privacera core (default) services should be installed and running.
Configuration
-
Run the following commands.
cd ~/privacera/privacera-manager cp config/sample-vars/vars.databricks.plugin.yml config/custom-vars/ vi config/custom-vars/vars.databricks.plugin.yml
-
Update
DATABRICKS_MANAGE_INIT_SCRIPT
as we will manually upload the init script to GCP Cloud Storage in the step below.DATABRICKS_MANAGE_INIT_SCRIPT: "false"
-
Run the following commands.
cd ~/privacera/privacera-manager ./privacera-manager.sh update
After the update is completed, the init script (ranger_enable.sh) and Privacera custom configuration (privacera_custom_conf.zip) for SSL will be generated at the location,
~/privacera/privacera-manager/output/databricks
.
Manage init Script and Spark Configurations#
-
Upload init Script and Spark Configurations to the GCS bucket.
-
Get the GCS bucket bucket that is mounted to the Databricks File System (DBFS).
In the Databricks UI, click an existing cluster, click Driver Logs, and then click log4j-active.log file.
-
Download the file and open it.
-
To get the GCS bucket, search for gs://databricks-xxxxxxxx/xxxxxxxxx/ where databricks-xxxxxxxx is the bucket name. For example: databricks-1558328210275731.
-
Log in to the GCP console, and navigate to the GCS bucket.
-
In the GCS bucket, create a folder,
privacera/<ENV-NAME>
. Where<ENV-NAME>
is the value set forDEPLOYMENT_ENV_NAME
variable in thevars.privacera.yml
file. See Environment Setup. -
Upload the
ranger_enable.sh
andprivacera_custom_conf.zip
to locationprivacera/<ENV_NAME>
in the GCS bucket.
-
-
In the Databricks UI:
-
Open the target cluster or create a new cluster.
-
Open the Cluster dialog and go to Edit mode.
-
Open Advanced Options, open the tab Init Scripts. Enter (paste) the following file path for the init script location. Where Save (Confirm) this configuration.
dbfs:/privacera/<ENV_NAME>/ranger_enable.sh
-
Open Advanced Options, open the tab Spark. Add the following content to the Spark Config edit box:
spark.driver.extraJavaOptions -javaagent:/databricks/jars/privacera-agent.jar spark.databricks.isv.product privacera spark.databricks.pyspark.enableProcessIsolation false
-
Save (Confirm) this configuration.
-
Start (or Restart) the selected Databricks Cluster.
-
Validation#
In order to help evaluate the use of Privacera with Databricks, Privacera provides a set of Privacera Manager 'demo' notebooks. These can be downloaded from Privacera S3 repository using either your favorite browser, or a command line 'wget'. Use the notebook/sql sequence that matches your cluster.
-
Download using your browser (just click on the correct file for your cluster, below:
https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPlugin.sql
If AWS S3 is configured from your Databricks cluster: https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginS3.sql
If ADLS Gen2 is configured from your Databricks cluster: https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginADLS.sql
or, if you are working from a Linux command line, use the 'wget' command to download.
wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPlugin.sql -O PrivaceraSparkPlugin.sql
wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginS3.sql -O PrivaceraSparkPluginS3.sql
wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginADLS.sql -O PrivaceraSparkPluginADLS.sql
-
Import the Databricks notebook:
- Login to Databricks Console ->
- Select Workspace -> Users -> Your User ->
- Click on drop down ->
- Click on Import and Choose the file downloaded
-
Follow the suggested steps in the text of the notebook to exercise and validate Privacera with Databricks.