=============== Getting Started =============== +++++++++ Key Steps +++++++++ 1. Initialize the code workspace to prepare for distributed training 2. Build container image 3. Tag and Push the image to ``ocir`` 4. Define the cluster requirement using :doc:`YAML Spec<../yaml_schema>` 5. Start distributed training using ads opctl run –f 6. Monitor the job using ads jobs watch
++++++++++++++++++++++++++++++++++++++++++++++++ Prepare container image for distributed workload ++++++++++++++++++++++++++++++++++++++++++++++++ **Prerequisite**: 1. Internet Connection 2. ADS cli is :doc:`installed<../../../cli/quickstart>` 3. Docker engine .. code-block:: shell ads opctl distributed-training init --framework To run a distributed workload on ``OCI Data Science Jobs``, you need prepare a container image with the source code that you want to run and the framework (Dask|Horovod|PyTorch) setup. ``OCI Data Science`` provides you with the Dockerfiles and bootstrapping scripts to build framework specific container images. This step creates a folder in the current working directory called ``oci_distributed_training``. This folder contains all the artifacts required to setup and bootstrap the framework code. Refer to ``README.md`` file to see more details on how to build and push the container image to the ocir ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Check Config File generated by the Main and the Worker Nodes ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ **Prerequisite**: 1. A cluster that is in In-Progress, Succeeded or Failed (Supported only in some cases) 2. Job ``OCID`` and the ``work dir`` of the cluster or a yaml file which contains all the details displayed during cluster creation. .. code-block:: shell ads opctl distributed-training show-config -f The main node generates ``MAIN_config.json`` and worker nodes generate ``WORKER__config.json``. You may want to check the configuration details for find the IP address of the Main node. This can be useful to bring up dashboard for dask or debugging.