Training Large Language Model#

New in version 2.8.8.

Oracle Cloud Infrastructure (OCI) Data Science Jobs (Jobs) provides fully managed infrastructure to enable training large language model at scale. This page shows an example of fine-tuning the Llama 2 model. For model details on the APIs, see Train PyTorch Models.

Distributed Training with OCI Data Science

You need to configure your networking and IAM policies. We recommend running the training on a private subnet. In this example, internet access is needed to download the source code and the pre-trained model.

The llama-recipes repository contains example code to fine-tune llama2 model. The example fine-tuning script support full parameter fine-tuning and Parameter-Efficient Fine-Tuning (PEFT). With ADS, you can start the training job by taking the source code directly from Github.

Access the Pre-Trained Model#

To fine-tune the model, you will first need to access the pre-trained model. The pre-trained model can be obtained from Meta or HuggingFace. In this example, we will use the access token to download the pre-trained model from HuggingFace (by setting the HUGGING_FACE_HUB_TOKEN environment variable).

Fine-Tuning the Model#

You can define the training job with ADS Python APIs or YAML. Here the examples for fine-tuning full parameters of the 7B model using FSDP.

  • Python
  • YAML
from ads.jobs import Job, DataScienceJob, PyTorchDistributedRuntime

job = (
    Job(name="LLAMA2-Fine-Tuning")
    .with_infrastructure(
        DataScienceJob()
        .with_log_group_id("<log_group_ocid>")
        .with_log_id("<log_ocid>")
        .with_compartment_id("<compartment_ocid>")
        .with_project_id("<project_ocid>")
        .with_subnet_id("<subnet_ocid>")
        .with_shape_name("VM.GPU.A10.1")
        .with_block_storage_size(256)
    )
    .with_runtime(
        PyTorchDistributedRuntime()
        # Specify the service conda environment by slug name.
        .with_service_conda("pytorch20_p39_gpu_v1")
        .with_git(
          url="https://github.com/facebookresearch/llama-recipes.git",
          commit="03faba661f079ee1ecaeb66deaa6bdec920a7bab"
        )
        .with_dependency(
          pip_pkg=" ".join([
            "'accelerate>=0.21.0'",
            "appdirs",
            "loralib",
            "bitsandbytes==0.39.1",
            "black",
            "'black[jupyter]'",
            "datasets",
            "fire",
            "'git+https://github.com/huggingface/peft.git'",
            "'transformers>=4.31.0'",
            "sentencepiece",
            "py7zr",
            "scipy",
            "optimum"
          ])
        )
        .with_output("/home/datascience/outputs", "oci://bucket@namespace/outputs/$JOB_RUN_OCID")
        .with_command(" ".join([
          "torchrun llama_finetuning.py",
          "--enable_fsdp",
          "--pure_bf16",
          "--batch_size_training 1",
          "--micro_batch_size 1",
          "--model_name $MODEL_NAME",
          "--dist_checkpoint_root_folder /home/datascience/outputs",
          "--dist_checkpoint_folder fine-tuned"
        ]))
        .with_replica(2)
        .with_environment_variable(
          MODEL_NAME="meta-llama/Llama-2-7b-hf",
          HUGGING_FACE_HUB_TOKEN="<access_token>",
          LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib",
        )
    )
)
kind: job
apiVersion: v1.0
spec:
  name: LLAMA2-Fine-Tuning
  infrastructure:
    kind: infrastructure
    spec:
      blockStorageSize: 256
      compartmentId: "<compartment_ocid>"
      logGroupId: "<log_group_id>"
      logId: "<log_id>"
      projectId: "<project_id>"
      subnetId: "<subnet_id>"
      shapeName: VM.GPU.A10.2
    type: dataScienceJob
  runtime:
    kind: runtime
    type: pyTorchDistributed
    spec:
      git:
        url: https://github.com/facebookresearch/llama-recipes.git
        commit: 03faba661f079ee1ecaeb66deaa6bdec920a7bab
      command: >-
        torchrun llama_finetuning.py
        --enable_fsdp
        --pure_bf16
        --batch_size_training 1
        --micro_batch_size 1
        --model_name $MODEL_NAME
        --dist_checkpoint_root_folder /home/datascience/outputs
        --dist_checkpoint_folder fine-tuned
      replicas: 2
      conda:
        type: service
        slug: pytorch20_p39_gpu_v1
      dependencies:
        pipPackages: >-
          'accelerate>=0.21.0'
          appdirs
          loralib
          bitsandbytes==0.39.1
          black
          'black[jupyter]'
          datasets
          fire
          'git+https://github.com/huggingface/peft.git'
          'transformers>=4.31.0'
          sentencepiece
          py7zr
          scipy
          optimum
      outputDir: /home/datascience/outputs
      outputUri: oci://bucket@namespace/outputs/$JOB_RUN_OCID
      env:
        - name: MODEL_NAME
          value: meta-llama/Llama-2-7b-hf
        - name: HUGGING_FACE_HUB_TOKEN
          value: "<access_token>"
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib

You can create and start the job run API call or ADS CLI.

To create and start running the job:

  • Python
  • YAML
# Create the job on OCI Data Science
job.create()
# Start a job run
run = job.run()
# Stream the job run outputs (from the first node)
run.watch()
# Use the following command to start the job run
ads opctl run -f your_job.yaml

The job run will:

  • Setup the PyTorch conda environment and install additional dependencies.

  • Fetch the source code from GitHub and checkout the specific commit.

  • Run the training script with the specific arguments, which includes downloading the model and dataset.

  • Save the outputs to OCI object storage once the training finishes.

Note that in the training command, there is no need specify the number of nodes, or the number of GPUs. ADS will automatically configure that base on the replica and shape you specified.

The fine-tuning runs on the samsum dataset by default. You can also add your custom datasets.

The same training script also support Parameter-Efficient Fine-Tuning (PEFT). You can change the command to the following for PEFT with LoRA

torchrun llama_finetuning.py --enable_fsdp --use_peft --peft_method lora \
--pure_bf16 --batch_size_training 1 --micro_batch_size 1 \
--model_name /home/datascience/llama --output_dir /home/datascience/outputs