Getting Started

Key Steps

  1. Initialize the code workspace to prepare for distributed training

  2. Build container image

  3. Tag and Push the image to ocir

  4. Define the cluster requirement using YAML Spec

  5. Start distributed training using ads opctl run –f <yaml file>

  6. Monitor the job using ads jobs watch <main job run id>

Prepare container image for distributed workload

Prerequisite:

  1. Internet Connection

  2. ADS cli is installed

  3. Docker engine

ads opctl distributed-training init --framework <framework choice>

To run a distributed workload on OCI Data Science Jobs, you need prepare a container image with the source code that you want to run and the framework (Dask|Horovod|PyTorch) setup. OCI Data Science provides you with the Dockerfiles and bootstrapping scripts to build framework specific container images. This step creates a folder in the current working directory called oci_distributed_training. This folder contains all the artifacts required to setup and bootstrap the framework code. Refer to README.md file to see more details on how to build and push the container image to the ocir

Check Config File generated by the Main and the Worker Nodes

Prerequisite:

  1. A cluster that is in In-Progress, Succeeded or Failed (Supported only in some cases)

  2. Job OCID and the work dir of the cluster or a yaml file which contains all the details displayed during cluster creation.

ads opctl distributed-training show-config -f <cluster yaml file>

The main node generates MAIN_config.json and worker nodes generate WORKER_<job run ocid>_config.json. You may want to check the configuration details for find the IP address of the Main node. This can be useful to bring up dashboard for dask or debugging.