PyTorch is an open source machine learning framework used for applications such as computer vision and natural language processing, primarily developed by Facebook’s AI Research lab. ADS supports running PyTorch’s native distributed training code (
DistributedDataParallel) with OCI Data Science Jobs. Provided you are following the official PyTorch distributed data parallel guidelines, no changes to your PyTorch code are required.
PyTorch distributed training requires initialization using the
torch.distributed.init_process_group() function. By default this function collects uses environment variables to initialize the communications for the training cluster. When using ADS to run PyTorch distributed training on OCI data science Jobs, the environment variables, including
LOCAL_RANK will automatically be set in the job runs. By default
MASTER_PORT will be set to