AWS EMR
This topic shows how to configure AWS EMR with Privacera using Privacera Manager.
Configuration
-
SSH to the instance as USER.
-
Run the following commands.
cd ~/privacera/privacera-manager cp config/sample-vars/vars.emr.yml config/custom-vars/ vi config/custom-vars/vars.emr.yml
-
Edit the following properties.
Property Description Example EMR_ENABLE Enable EMR template creation. true EMR_CLUSTER_NAME Define a unique name for the EMR cluster. Privacera-EMR EMR_CREATE_SG Set this to true if you don't have existing security groups and want Privacera Manager to take care of adding security group creation steps in the EMR CF template. false EMR_MASTER_SG_ID If EMR_CREATE_SG is false, set this property. Security Group ID for EMR Master Node Group. sg-xxxxxxx EMR_SLAVE_SG_ID If EMR_CREATE_SG is false, set this property. Security Group ID for EMR Slave Node Group. sg-xxxxxxx EMR_SERVICE_ACCESS_SG_ID If EMR_CREATE_SG is false, set this property. Security Group ID for EMR ServiceAccessSecurity. Fill this property only if you are creating EMR in a Private Network. sg-xxxxxxx EMR_SG_VPC_ID If EMR_CREATE_SG is true, set this property. VPC ID in which you want to create the EMR Cluster. vpc-xxxxxxxxxxx EMR_MASTER_SG_NAME If EMR_CREATE_SG is true, set this property. Security Group Name for EMR Master Node Group. The security group name will be added to the emr-template.json
.priv-master-sg EMR_SLAVE_SG_NAME If EMR_CREATE_SG is true, set this property. Security Group Name for EMR Slave Node Group. The security group name will be added to the emr-template.json
.priv-slave-sg EMR_SERVICE_ACCESS_SG_NAME If EMR_CREATE_SG is true, set this property. Security Group Name for EMR ServiceAccessSecurity. The security group name will be added to the emr-template.json
. Fill this property only if you are creating EMR in a Private Network.priv-private-sg EMR_SUBNET_ID Subnet ID EMR_KEYPAIR An existing EC2 key pair to SSH into the master node of the cluster. privacera-test-pair EMR_EC2_MARKET_TYPE Set market type as SPOT or ON_DEMAND. SPOT EMR_EC2_INSTANCE_TYPE Set the instance type. Instances can be of different types such as m5.xlarge, r5.xlarge and so on. m5.large EMR_MASTER_NODE_COUNT Node count for Master. The number of nodes can be 1, 2 and so on. 1 EMR_CORE_NODE_COUNT Node count for Core. The number of cores can be 1, 2 and so on. 1 EMR_VERSION Version of EMR. emr-x.xx.x EMR_EC2_DOMAIN Domain used by the nodes. It depends on EMR Region, for example, ".ec2.internal" is for us-east-1. .ec2.internal EMR_USE_STS_REGIONAL_ENDPOINTS Set the property to enable/disable regional endpoints for S3 requests.
Default value isfalse
.true EMR_TERMINATION_PROTECT Set to enable/disable termination protection. true EMR_LOGS_PATH S3 location for storing EMR logs. s3://privacera-logs-bucket/ EMR_KERBEROS_ENABLE Set to true if you want to enable kerberization on EMR. false EMR_KDC_ADMIN_PASSWORD If EMR_KERBEROS_ENABLE is true, set this property. The password used within the cluster for the kadmin service. EMR_CROSS_REALM_PASSWORD If EMR_KERBEROS_ENABLE is true, set this property. The cross-realm trust principal password, which must be identical across realms. EMR_SECURITY_CONFIG Name of the Security Configurations created for EMR. This can be a pre-created configuration, or Privacera Manager can generate a template through which you can create this configuration. EMR_KERB_TICKET_LIFETIME Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The period for which a Kerberos ticket issued by the cluster’s KDC is valid. Cluster applications and services auto-renew tickets after they expire. EMR_KERB_TICKET_LIFETIME: 24 EMR_KERB_REALM Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The Kerberos realm name for the other realm in the trust relationship. EMR_KERB_DOMAIN Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The domain name of the other realm in the trust relationship. EMR_KERB_ADMIN_SERVER Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The fully qualified domain name (FQDN) and an optional port for the Kerberos admin server in the other realm. If a port is not specified, 749 is used. EMR_KERB_KDC_SERVER Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The fully qualified domain name (FQDN) and an optional port for the KDC in the other realm. If a port is not specified, 88 is used. EMR_AWS_ACCT_ID AWS Account ID where EMR Cluster resides 9999999 EMR_DEFAULT_ROLE Default role attached to EMR Cluster for performing cluster-related activities. This should be a pre-created role. EMR_DefaultRole EMR_ROLE_FOR_CLUSTER_NODES The IAM Role will be attached to each node in the EMR Cluster.
This should have only minimal permissions for downloading theprivacera_cust_conf.zip
and basic EMR capabilities. It can be an existing one, if not, you can use the IAM role CF template to generate it after the Privacera Manager update.restricted_node_role EMR_USE_SINGLE_ROLE_FOR_APPS If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. Create a Single IAM Role that will be used by All EMR Applications. true EMR_ROLE_FOR_APPS If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by all EMR Apps app_data_access_role EMR_ROLE_FOR_SPARK If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. Create multiple IAM Roles to be used by specific applications. Set EMR_USE_SINGLE_ROLE_FOR_APPS to be false. IAM Role name which will be used by Spark Application (Dataserver) for data access. spark_data_access_role EMR_ROLE_FOR_HIVE If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by Hive Application for data access. hive_data_access_role EMR_ROLE_FOR_PRESTO If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by Presto Application for data access. presto_data_access_role EMR_HIVE_METASTORE Metastore type. e.g. "glue", "hive" (For external hive-metastore) glue EMR_HIVE_METASTORE_PATH S3 location for hive metastore s3://hive-warehouse EMR_HIVE_METASTORE_CONNECTION_URL If EMR_HIVE_METASTORE is hive, set this property. JDBC Connection URL for connecting to hive. jdbc:mysql://<jdbc-host>:3306/<hive-db-name>?createDatabaseIfNotExist=true EMR_HIVE_METASTORE_CONNECTION_DRIVER If EMR_HIVE_METASTORE is hive, set this property. JDBC Driver Name org.mariadb.jdbc.Driver EMR_HIVE_METASTORE_CONNECTION_USERNAME If EMR_HIVE_METASTORE is hive, set this property. JDBC UserName hive EMR_HIVE_METASTORE_CONNECTION_PASSWORD If EMR_HIVE_METASTORE is hive, set this property. JDBC Password StRong@PassWord EMR_APP_SPARK_OLAC_ENABLE To install Spark application with Privacera plugin, set the property to true. OLAC is known as Object Level Access Control.
Note:
- Recommended when complete access control on the objects in AWS S3 is required.
- When the property is set to true, s3 and s3n protocols will not be supported on EMR clusters while running Spark queries.
true EMR_APP_SPARK_FGAC_ENABLE To install Spark application with Privacera plugin, set the property to true. FGAC is known as Fine Grained Access Control for Table and Column.
Note: Recommended for compliance purposes, since the whole cluster will still have direct access to AWS S3 data.
false EMR_APP_PRESTO_DB_ENABLE To install PrestoDB application with Privacera plugin, set the property to true.
PrestoDB and Trino are mutually exclusive. Only one should be enabled at a time.
false EMR_APP_PRESTO_SQL_ENABLE To install Trino application with Privacera plugin, set the property to true.
PrestoDB and Trino are mutually exclusive. Only one should be enabled at a time.
Note: Trino is supported for EMR versions 6.1.0 and higher.
Note: If the EMR version is 6.4.0, setting this flag installs the Trino plugin.
false EMR_APP_HIVE_ENABLE To install Hive application with Privacera plugin, set the property to true. true EMR_APP_ZEPPELIN_ENABLE To install Zeppelin application, set the property to true. true EMR_APP_LIVY_ENABLE To install Livy application, set the property to true. true EMR_CUST_CONF_ZIP_PATH A path where the privacera_cust_conf.zip
file will be placed should be added. Privacera Manager will generate aprivacera_cust_conf.zip
under~/privacera/privacera-manager/output/emr
folder. Thisprivacera_cust_conf.zip
needs to be placed at an s3 or any https location from which the EMR cluster can download it.s3://privacera-artifacts/ EMR_SPARK_ENABLE_VIEW_LEVEL_ACCESS_CONTROL Set the property to true to enable view-level column masking and row filter for SparkSQL. The property can be used only when you set
EMR_APP_SPARK_FGAC_ENABLE
totrue
.To learn how to use view-level access control in Spark, click here.
false EMR_RANGER_IS_FALLBACK_SUPPORTED Use the property to enable/disable the fallback behavior to the privacera_files and privacera_hive services. It confirms whether the resources files should be allowed/denied access to the user.
To enable the fallback, set to true; to disable, set to false.
true EMR_SPARK_DELTA_LAKE_ENABLE Set this property to true to enable Delta Lake on EMR Spark.
true EMR_SPARK_DELTA_LAKE_CORE_JAR_DOWNLOAD_URL Download URL of Delta Lake core JAR. The Delta Lake core JAR has dependency with Spark version.
You have to find the appropriate version for your EMR. See Delta Lake compatibility with Apache Spark.
Get the appropriate Delta Lake core JAR download link and update the property. See Delta Core.
For example, for Spark version 3.1.x, the download URL is
https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.1/delta-core_2.12-1.0.1.jar
.https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.1/delta-core_2.12-1.0.1.jar Note
If your cluster was running while External Hive Metastore was down, and you are unable to connect to it, restart the following three servers.
sudo systemctl restart hive-hcatalog-server sudo systemctl restart hive-server2 sudo systemctl restart presto-server
-
Run the following commands.
cd ~/privacera/privacera-manager ./privacera-manager.sh update
After the update is finished, all the cloud-formation JSON template files and
privacera_cust_conf.zip
will be available at the path,~/privacera/privacera-manager/output/emr
. -
Configure and run the following in AWS instance where Privacera is installed.
-
(Optional) Create IAM roles using the
emr-roles-creation-template.json
template. Run the following command.aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-role-creation --template-body file://emr-roles-creation-template.json --capabilities CAPABILITY_NAMED_IAM
Note
This will create IAM roles with minimal permissions. You can add bucket permissions into respective IAM roles as per your requirements.
-
(Optional) Create Security Configurations using the
emr-security-config-template.json
template. Run the following command.aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-security-config-creation --template-body file://emr-security-config-template.json
Note
If you are upgrading EMR to version 6.4 and higher from EMR version <=6.3 to use Trino plug-in, then you must re-create the EMR security configuration based on the new template generated via PM since the security configuration has
trino
user newly added. -
Confirm the
privacera_cust_conf.zip
file has been copied to the location specified inEMR_CUST_CONF_ZIP_PATH
. -
Create EMR using the
emr-template.json
template. Run the following command.aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-creation --template-body file://emr-template.json
-
Note
-
For PrestoDB, secrets encryption of Solr authentication password is not supported. However, the properties file where the password resides is accessible only to the presto service user, hence it is invulnerable.
-
If your cluster was running while External Hive Metastore was down, and you are unable to connect to it, restart the following three servers:
sudo systemctl restart hive-hcatalog-server sudo systemctl restart hive-server2 sudo systemctl restart presto-server