Setup and Installation¶
Notebook Session Development¶
Follow these set up instructions to submit Spark Jobs to Data Flow from an OCI Notebook Session.
Pyspark Environment¶
To setup PySpark environment, install one of the PySpark conda environments from the Environment Explorer
Example -
odsc conda install -s pyspark30_p37_cpu_v5
Find the information about the latest pyspark conda environment here
Activate the conda environment to upgrade to the latest oracle-ads
conda activate /home/datascience/conda/pyspark30_p37_cpu_v5
pip install "oracle-ads[data_science, data, opctl]" --upgrade
Configuring core-site.xml¶
When the conda environment is installed, a templated version of core-site.xml is also installed. You can update the core-site.xml file using an automated configuration or manually.
Authentication with Resource Principals
Authentication to Object Storage can be done with a resource principal.
For automated configuration, run the following command in a terminal
odsc core-site config -a resource_principal
This command will populate the file ~/spark_conf_dir/core-site.xml
with the values needed to connect to Object Storage.
The following command line options are available:
-a, –authentication Authentication mode. Supports resource_principal and api_key (default).
-r, –region Name of the region.
-o, –overwrite Overwrite core-site.xml.
-O, –output Output path for core-site.xml.
-q, –quiet Suppress non-error output.
-h, –help Show help message and exit.
To manually configure the core-site.xml
file, you edit the file, and then specify these values:
fs.oci.client.hostname
: The Object Storage endpoint for your region. See https://docs.oracle.com/iaas/api/#/en/objectstorage/20160918/ for available endpoints.
fs.oci.client.custom.authenticator
: Set the value to com.oracle.bmc.hdfs.auth.ResourcePrincipalsCustomAuthenticator.
When using resource principals, these properties don’t need to be configured:
fs.oci.client.auth.tenantId
fs.oci.client.auth.userId
fs.oci.client.auth.fingerprint
fs.oci.client.auth.pemfilepath
The following example core-site.xml file illustrates using resource principals for authentication to access Object Storage:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.oci.client.hostname</name>
<value>https://objectstorage.us-ashburn-1.oraclecloud.com</value>
</property>
<property>
<name>fs.oci.client.custom.authenticator</name>
<value>com.oracle.bmc.hdfs.auth.ResourcePrincipalsCustomAuthenticator</value>
</property>
</configuration>
For details, see HDFS connector for Object Storage #using resource principals for authentication.
Authentication with API Keys
When using authentication with API keys, the core-site.xml file is be updated in two ways, automated or manual configuration.
For automated configuration, you use the odsc command line tool. With an OCI configuration file, you can run
odsc core-site config -o
By default, this command uses the OCI configuration file stored in ~/.oci/config
, automatically populates the core-site.xml
file,
and then saves it to ~/spark_conf_dir/core-site.xml
.
The following command line options are available:
-a, –authentication Authentication mode. Supports resource_principal and api_key (default).
-c, –configuration Path to the OCI configuration file.
-p, –profile Name of the profile.
-r, –region Name of the region.
-o, –overwrite Overwrite core-site.xml.
-O, –output Output path for core-site.xml.
-q, –quiet Suppress non-error output.
-h, --help Show help message and exit.
To manually configure the core-site.xml
file, you must specify these parameters:
fs.oci.client.hostname
: Address of Object Storage. For example, https://objectstorage.us-ashburn-1.oraclecloud.com. You must replace us-ashburn-1 with the region you are in.
fs.oci.client.auth.tenantId
: OCID of your tenancy.
fs.oci.client.auth.userId
: Your user OCID.
fs.oci.client.auth.fingerprint
: Fingerprint for the key pair.
fs.oci.client.auth.pemfilepath
: The fully qualified file name of the private key used for authentication.
The values of these parameters are found in the OCI configuration file.
Local Development¶
Follow these set up instructions to submit Spark Jobs to Data Flow from your local machine.
PySpark Environment¶
Prerequisite
You have completed Local Development Environment Setup
Use ADS CLI
to setup a PySpark conda environment. Currently, the ADS CLI only supports fetching conda packs published
by you. If you haven’t already published a conda pack, you can create one using ADS CLI
To install from your published environment source -
ads conda install oci://mybucket@mynamespace/path/to/pyspark/env
To create a conda pack for your local use -
cat <<EOF> pyspark.yaml
dependencies:
- pyspark
- pip
- pip:
- oracle-ads
name: pysparkenv
EOF
ads create -f pyspark.yaml
ads publish -s pysparkenv
Developing in Visual Studio Code¶
Prerequisites
Setup Visual Studio Code development environment by following steps from Local Development Environment Setup
ads conda install <oci uri of pyspark conda environment>
. Currently, we cannot access service pack directly. You can instead publish a pyspark service pack to your object storage and use the URI for the pack in OCI Object Storage.
Once the development environment is setup, you could write your code and run it from the terminal of the Visual Studio Code.
core-site.xml
is setup automatically when you install a pyspark conda pack.
Logging From Data Flow¶
If using the ADS Python SDK,
To create and run a Data Flow application, you must specify a compartment and a bucket for storing logs under the same compartment:
compartment_id = "<compartment_id>"
logs_bucket_uri = "<logs_bucket_uri>"
Ensure that you set up the correct policies. For instance, for Data Flow to access logs bucket, use a policy like:
ALLOW SERVICE dataflow TO READ objects IN tenancy WHERE target.bucket.name='dataflow-logs'
For more information, see the Data Flow documentation.