Train PyTorch Models

Added in version 2.8.8.

The PyTorchDistributedRuntime is designed for training PyTorch models, including large language models (LLMs) with multiple GPUs from multiple nodes. If you develop you training code that is compatible with torchrun, DeepSpeed, or Accelerate, you can run them using OCI Data Science Jobs with zero code change. For multi-node training, ADS will launch multiple job runs, each corresponding to one node.

See Distributed Data Parallel in PyTorch for a series of tutorials on PyTorch distributed training.

Prerequisite

You need oracle-ads>=2.8.8 to create a job with PyTorchDistributedRuntime.

You also need to specify a conda environment with PyTorch>=1.10 and oracle-ads>=2.6.8 for the job. See the Conda Environment about specifying the conda environment for a job.

We recommend using the pytorch20_p39_gpu_v1 service conda environment and add additional packages as needed.

You need to specify a subnet ID and allow ingress traffic within the subnet.

Torchrun Example

Here is an example to train a GPT model using the source code directly from the official PyTorch Examples Github repository. See Training “Real-World” models with DDP tutorial for a walkthrough of the source code.

  • Python
  • YAML
from ads.jobs import Job, DataScienceJob, PyTorchDistributedRuntime

job = (
    Job(name="PyTorch DDP Job")
    .with_infrastructure(
        DataScienceJob()
        # Configure logging for getting the job run outputs.
        .with_log_group_id("<log_group_ocid>")
        # Log resource will be auto-generated if log ID is not specified.
        .with_log_id("<log_ocid>")
        # If you are in an OCI data science notebook session,
        # the following configurations are not required.
        # Configurations from the notebook session will be used as defaults.
        .with_compartment_id("<compartment_ocid>")
        .with_project_id("<project_ocid>")
        .with_subnet_id("<subnet_ocid>")
        .with_shape_name("VM.GPU.A10.1")
        # Minimum/Default block storage size is 50 (GB).
        .with_block_storage_size(50)
    )
    .with_runtime(
        PyTorchDistributedRuntime()
        # Specify the service conda environment by slug name.
        .with_service_conda("pytorch20_p39_gpu_v1")
        .with_git(url="https://github.com/pytorch/examples.git", commit="d91085d2181bf6342ac7dafbeee6fc0a1f64dcec")
        .with_dependency("distributed/minGPT-ddp/requirements.txt")
        .with_inputs({
          "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt": "data/input.txt"
        })
        .with_output("data", "oci://bucket_name@namespace/path/to/dir")
        .with_command("torchrun distributed/minGPT-ddp/mingpt/main.py data_config.path=data/input.txt trainer_config.snapshot_path=data/snapshot.pt")
        .with_replica(2)
    )
)
kind: job
apiVersion: v1.0
spec:
  name: PyTorch-MinGPT
  infrastructure:
    kind: infrastructure
    spec:
      blockStorageSize: 50
      compartmentId: "{{ compartment_id }}"
      logGroupId: "{{ log_group_id }}"
      logId: "{{ log_id }}"
      projectId: "{{ project_id }}"
      subnetId: "{{ subnet_id }}"
      shapeName: VM.GPU.A10.1
    type: dataScienceJob
  runtime:
    kind: runtime
    type: pyTorchDistributed
    spec:
      replicas: 2
      conda:
        type: service
        slug: pytorch110_p38_gpu_v1
      dependencies:
        pipRequirements: distributed/minGPT-ddp/requirements.txt
      inputs:
        "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt": "data/input.txt"
      outputDir: data
      outputUri: oci://bucket_name@namespace/path/to/dir
      git:
        url: https://github.com/pytorch/examples.git
        commit: d91085d2181bf6342ac7dafbeee6fc0a1f64dcec
      command: >-
        torchrun distributed/minGPT-ddp/mingpt/main.py
        data_config.path=data/input.txt
        trainer_config.snapshot_path=data/snapshot.pt

To create and start running the job:

  • Python
  • CLI
# Create the job on OCI Data Science
job.create()
# Start a job run
run = job.run()
# Stream the job run outputs (from the first node)
run.watch()
# Use the following command to start the job run
ads opctl run -f your_job.yaml

Source Code

The source code location can be specified as Git repository, local path or remote URI supported by fsspec.

You can use the with_git() method to specify the source code url on a Git repository. You can optionally specify the branch or commit for checking out the source code.

For a public repository, we recommend the “http://” or “https://” URL. Authentication may be required for the SSH URL even if the repository is public.

To use a private repository, you must first save an SSH key to OCI Vault as a secret, and provide the secret_ocid when calling with_source(). For more information about creating and using secrets, see Managing Secret with Vault. For repository on GitHub, you could setup the GitHub Deploy Key as secret.

Git Version for Private Repository

Git version of 2.3+ is required to use a private repository.

Alternatively, you can use the with_source() method to specify the source code as e a local path or a remote URI supported by fsspec. For example, you can specify files on OCI object storage using URI like oci://bucket@namespace/path/to/prefix. ADS will use the authentication method configured by ads.set_auth() to fetch the files and upload them as job artifact. The source code can be a single file, a compressed file/archive (zip/tar), or a folder.

Working Directory

The default working directory depends on how the source code is specified. * When the source code is specified as Git repository URL, the default working directory is the root of the Git repository. * When the source code is a single file (script), the default working directory containing the file. * When the source code is specified as a local or remote directory, the default working directory is the directory containing the source code directory.

The working directory of your workload can be configured by with_working_dir(). See Python Runtime Working Directory for more details.

Input Data

You can specify the input (training) data for the job using the with_inputs() method, which takes a dictionary mapping the “source” to the “destination”. The “source” can be an OCI object storage URI, HTTP or FTP URL. The “destination” is the local path in a job run. If the “destination” is specified as relative path, it will be relative to the working directory.

Outputs

You can specify the output data to be copied to the object storage by using the with_output() method. It allows you to specify the output path output_path in the job run and a remote URI (output_uri). Files in the output_path are copied to the remote output URI after the job run finishes successfully. Note that the output_path should be a path relative to the working directory.

OCI object storage location can be specified in the format of oci://bucket_name@namespace/path/to/dir. Please make sure you configure the I AM policy to allow the job run dynamic group to use object storage.

Number of nodes

The with_replica() method helps you to specify the number node for the training job.

Command

The command to start your workload is specified by using the with_command() method.

For torchrun, ADS will set --nnode, --nproc_per_node, --rdzv_backend and --rdzv_endpoint automatically. You do not need to specify them in the command unless you would like to override the values. The default rdzv_backend will be c10d. The default port for rdzv_endpoint is 29400

If you workload uses Deepspeed, you also need to set use_deepspeed to True when specifying the command. For Deepspeed, ADS will generate the hostfile automatically and setup the SSH configurations.

For accelerate launch, you can add your config YAML to the source code and specify it using --config_file argument. In your config, please use LOCAL_MACHINE as the compute environment. The same config file will be used by all nodes in multi-node workload. ADS will set --num_processes, --num_machines, --machine_rank, --main_process_ip and --main_process_port automatically. For these arguments, ADS will override the values from your config YAML. If you would like to use your own values, you need to specify them as command arguments. The default main_process_port is 29400.

Additional dependencies

The with_dependency() method helps you to specify additional dependencies to be installed into the conda environment before starting your workload. * pip_req specifies the path of the requirements.txt file in your source code. * pip_pkg specifies the packages to be installed as a string.

Python Paths

The working directory is added to the Python paths automatically. You can call with_python_path() to add additional python paths as needed. The paths should be relative paths from the working directory.