============== Configurations ============== -------- Networks -------- You need to use a private subnet for distributed training and configure the security list to allow traffic through specific ports for communication between nodes. The following default ports are used by the corresponding frameworks: * `Dask`: * ``Scheduler Port``: **8786**. More information `here `_ * ``Dashboard Port``: **8787**. More information `here `_ * ``Worker Ports``: **Default is Random**. It is good to open a specific range of port and then provide the value in the startup option. More information `here `_ * ``Nanny Process Ports``: **Default is Random**. It is good to open a specific range of port and then provide the value in the startup option. More information `here `_ * `PyTorch`: By default, ``PyTorch`` uses **29400**. * `Horovod`: allow TCP traffic on all ports within the subnet. * `Tensorflow`: Worker Port: Allow traffic from all source ports to one worker port (default: 12345). If changed, provide this in train.yaml config. See also: `Security Lists `_ ------------ OCI Policies ------------ Several OCI policies are needed for distributed training. .. admonition:: Policy subject :class: note In the following example, ``group `` is the subject of the policy. When starting the job from an OCI notebook session using resource principal, the subject should be ``dynamic-group``, for example, ``dynamic-group `` Distributed training uses `OCI Container Registry `_ to store the container image. To push images to container registry, the ``manage repos`` policy is needed, for example: .. code-block:: Allow group to manage repos in compartment To pull images from container registry for local testing, the ``use repos`` policy is needed, for example: .. code-block:: Allow group to read repos in compartment You can also restrict the permission to specific repository, for example: .. code-block:: Allow group to read repos in compartment where all { target.repo.name= } See also: `Policies to Control Repository Access `_ To start distributed training jobs, the user will need access to multiple resources, including: * ``read repos`` * ``manage data-science-jobs`` * ``manage data-science-job-runs`` * ``use virtual-network-family`` * ``manage log-groups`` * ``use log-content`` * ``read metrics`` For example: .. code-block:: Allow group to manage data-science-jobs in compartment Allow group to manage data-science-job-runs in compartment Allow group to use virtual-network-family in compartment Allow group to manage log-groups in compartment Allow group to use logging-family in compartment Allow group to use read metrics in compartment We also need policies for job runs, for example: .. code-block:: Allow dynamic-group to read repos in compartment Allow dynamic-group to use data-science-family in compartment Allow dynamic-group to use virtual-network-family in compartment Allow dynamic-group to use log-groups in compartment Allow dynamic-group to use logging-family in compartment See also `Data Science Policies `_. Distributed training uses OCI Object Storage to store artifacts and outputs. The bucket should be created before starting any distributed training. The ``manage objects`` policy is needed for users and job runs to read/write files in the bucket. The ``manage buckets`` policy is required for job runs to synchronize generated artifacts. For example: .. code-block:: Allow group to manage objects in compartment your_compartment_name where all {target.bucket.name=} Allow dynamic-group to manage objects in compartment your_compartment_name where all {target.bucket.name=} Allow dynamic-group to manage buckets in compartment your_compartment_name where all {target.bucket.name=} See also `Object Storage Policies `_ ------------- Policy Syntax ------------- The overall syntax of a policy statement is as follows: ``Allow to in where `` See also: https://docs.oracle.com/en-us/iaas/Content/Identity/Concepts/policysyntax.htm For ````: * If you are using API key authentication, ```` should be the group your user belongs to. For example, ``group ``. * If you are using resource principal or instance principal authentication, ```` should be the dynamic group to which your OCI resource belongs. Here the resource is where you initialize the API requests, which is usually a job run, a notebook session or compute instance. For example, ``dynamic-group `` `Dynamic group `_ allows you to group OCI resources like job runs and notebook sessions. Distributed training is running on Data Science Jobs, for the training process to access resources, the job runs need to be defined as a dynamic group and use as the ```` for policies. In the following examples, we define ``distributed_training_job_runs`` dynamic group as: ``all { resource.type='datasciencejobrun', resource.compartment.id='' }`` We also assume the user in ``group `` is preparing the docker image and starting the training job. The ` `_ determines the ability of the to work on the ````. Four options are available: inspect, read, user and manage. The ```` specifies the resources we would like to access. Distributed training uses the following OCI resources/services: * `Data Science Jobs `_. Resource Type: ``data-science-jobs`` and ``data-science-job-runs`` * `Object Storage `_. Resource Type: ``buckets`` and ``objects`` * `Container Registry `_. Resource Type: ``repos`` The ```` is usually the compartment or tenancy that your resources (specified by ````) resides. * If you would like the ```` to have access to all resources (specified by ````) in the tenancy, you can use ``tenancy`` as ````. * If you would like the ```` to have access to resources in specific compartment, you can use ``compartment your_compartment_name`` as ````. The where ```` can be used to filter the resources specified in ````.