Initialize the code workspace to prepare for distributed training
Build container image
Tag and Push the image to
Define the cluster requirement using YAML Spec
Start distributed training using ads opctl run –f <yaml file>
Monitor the job using ads jobs watch <main job run id>
Prepare container image for distributed workload#
ADS cli is installed
ads opctl distributed-training init --framework <framework choice>
To run a distributed workload on
OCI Data Science Jobs, you need prepare a container image with the source code that you want to run and the framework (Dask|Horovod|PyTorch) setup.
OCI Data Science provides you with the Dockerfiles and bootstrapping scripts to build framework specific container images. This step creates a folder in the current working directory called
oci_distributed_training. This folder contains all the artifacts required to setup and bootstrap the framework code. Refer to
README.md file to see more details on how to build and push the container image to the ocir
Check Config File generated by the Main and the Worker Nodes#
A cluster that is in In-Progress, Succeeded or Failed (Supported only in some cases)
work dirof the cluster or a yaml file which contains all the details displayed during cluster creation.
ads opctl distributed-training show-config -f <cluster yaml file>
The main node generates
MAIN_config.json and worker nodes generate
WORKER_<job run ocid>_config.json. You may want to check the configuration details for find the IP address of the Main node. This can be useful to bring up dashboard for dask or debugging.