Training Large Language Model

Added in version 2.8.8.

Oracle Cloud Infrastructure (OCI) Data Science Jobs (Jobs) provides fully managed infrastructure to enable training large language model at scale. This page shows an example of fine-tuning the Llama 2 model. For model details on the APIs, see Train PyTorch Models.

Distributed Training with OCI Data Science

You need to configure your networking and IAM policies. We recommend running the training on a private subnet. In this example, internet access is needed to download the source code and the pre-trained model.

The llama-recipes repository contains example code to fine-tune llama2 model. The example fine-tuning script supports both full parameter fine-tuning and Parameter-Efficient Fine-Tuning (PEFT). With ADS, you can start the training job by taking the source code directly from Github with no code change.

Access the Pre-Trained Model

To fine-tune the model, you will first need to access the pre-trained model. The pre-trained model can be obtained from Meta or HuggingFace. In this example, we will use the access token to download the pre-trained model from HuggingFace (by setting the HUGGING_FACE_HUB_TOKEN environment variable).

Fine-Tuning the Model

You can define the training job with ADS Python APIs or YAML. Here the examples for fine-tuning full parameters of the 7B model using FSDP.

  • Python
  • YAML
from import Job, DataScienceJob, PyTorchDistributedRuntime

job = (
        # Specify the service conda environment by slug name.
          pip_pkg=" ".join([
            "--extra-index-url torch==2.1.0",
        .with_output("/home/datascience/outputs", "oci://bucket@namespace/outputs/$JOB_RUN_OCID")
        .with_command(" ".join([
          "torchrun examples/",
          "--batch_size_training 1",
          "--model_name $MODEL_NAME",
          "--dist_checkpoint_root_folder /home/datascience/outputs",
          "--dist_checkpoint_folder fine-tuned"
kind: job
apiVersion: v1.0
  name: LLAMA2-Fine-Tuning
    kind: infrastructure
      blockStorageSize: 256
      compartmentId: "<compartment_ocid>"
      logGroupId: "<log_group_id>"
      logId: "<log_id>"
      projectId: "<project_id>"
      subnetId: "<subnet_id>"
      shapeName: VM.GPU.A10.2
    type: dataScienceJob
    kind: runtime
    type: pyTorchDistributed
        commit: 1aecd00924738239f8d86f342b36bacad180d2b3
      command: >-
        --batch_size_training 1
        --model_name $MODEL_NAME
        --dist_checkpoint_root_folder /home/datascience/outputs
        --dist_checkpoint_folder fine-tuned
      replicas: 2
        type: service
        slug: pytorch20_p39_gpu_v2
        pipPackages: >-
          --extra-index-url torch==2.1.0
      outputDir: /home/datascience/outputs
      outputUri: oci://bucket@namespace/outputs/$JOB_RUN_OCID
        - name: MODEL_NAME
          value: meta-llama/Llama-2-7b-hf
        - name: HUGGING_FACE_HUB_TOKEN
          value: "<access_token>"
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib

You can create and start the job run API call or ADS CLI.

To create and start running the job:

  • Python
  • CLI
# Create the job on OCI Data Science
# Start a job run
run =
# Stream the job run outputs (from the first node)
# Use the following command to start the job run
ads opctl run -f your_job.yaml

The job run will:

  • Setup the PyTorch conda environment and install additional dependencies.

  • Fetch the source code from GitHub and checkout the specific commit.

  • Run the training script with the specific arguments, which includes downloading the model and dataset.

  • Save the outputs to OCI object storage once the training finishes.

Note that in the training command, there is no need specify the number of nodes, or the number of GPUs. ADS will automatically configure that base on the replica and shape you specified.

The fine-tuning runs on the samsum dataset by default. You can also add your custom datasets.

Once the fine-tuning is finished, the checkpoints will be saved into OCI object storage bucket as specified. You can load the FSDP checkpoints for inferencing.

The same training script also support Parameter-Efficient Fine-Tuning (PEFT). You can change the command to the following for PEFT with LoRA. Note that for PEFT, the fine-tuned weights are stored in the location specified by --output_dir, while for full parameter fine-tuning, the checkpoints are stored in the location specified by --dist_checkpoint_root_folder and --dist_checkpoint_folder

torchrun --enable_fsdp --use_peft --peft_method lora \
--pure_bf16 --batch_size_training 1 \
--model_name meta-llama/Llama-2-7b-hf --output_dir /home/datascience/outputs