Training Large Language Model#
New in version 2.8.8.
Oracle Cloud Infrastructure (OCI) Data Science Jobs (Jobs) provides fully managed infrastructure to enable training large language model at scale. This page shows an example of fine-tuning the Llama 2 model. For model details on the APIs, see Train PyTorch Models.
Distributed Training with OCI Data Science
You need to configure your networking and IAM policies. We recommend running the training on a private subnet. In this example, internet access is needed to download the source code and the pre-trained model.
The llama-recipes repository contains example code to fine-tune llama2 model. The example fine-tuning script support full parameter fine-tuning and Parameter-Efficient Fine-Tuning (PEFT). With ADS, you can start the training job by taking the source code directly from Github.
Access the Pre-Trained Model#
To fine-tune the model, you will first need to access the pre-trained model.
The pre-trained model can be obtained from Meta
or HuggingFace.
In this example, we will use the access token
to download the pre-trained model from HuggingFace (by setting the HUGGING_FACE_HUB_TOKEN
environment variable).
Fine-Tuning the Model#
You can define the training job with ADS Python APIs or YAML. Here the examples for fine-tuning full parameters of the 7B model using FSDP.
from ads.jobs import Job, DataScienceJob, PyTorchDistributedRuntime
job = (
Job(name="LLAMA2-Fine-Tuning")
.with_infrastructure(
DataScienceJob()
.with_log_group_id("<log_group_ocid>")
.with_log_id("<log_ocid>")
.with_compartment_id("<compartment_ocid>")
.with_project_id("<project_ocid>")
.with_subnet_id("<subnet_ocid>")
.with_shape_name("VM.GPU.A10.1")
.with_block_storage_size(256)
)
.with_runtime(
PyTorchDistributedRuntime()
# Specify the service conda environment by slug name.
.with_service_conda("pytorch20_p39_gpu_v1")
.with_git(
url="https://github.com/facebookresearch/llama-recipes.git",
commit="03faba661f079ee1ecaeb66deaa6bdec920a7bab"
)
.with_dependency(
pip_pkg=" ".join([
"'accelerate>=0.21.0'",
"appdirs",
"loralib",
"bitsandbytes==0.39.1",
"black",
"'black[jupyter]'",
"datasets",
"fire",
"'git+https://github.com/huggingface/peft.git'",
"'transformers>=4.31.0'",
"sentencepiece",
"py7zr",
"scipy",
"optimum"
])
)
.with_output("/home/datascience/outputs", "oci://bucket@namespace/outputs/$JOB_RUN_OCID")
.with_command(" ".join([
"torchrun llama_finetuning.py",
"--enable_fsdp",
"--pure_bf16",
"--batch_size_training 1",
"--micro_batch_size 1",
"--model_name $MODEL_NAME",
"--dist_checkpoint_root_folder /home/datascience/outputs",
"--dist_checkpoint_folder fine-tuned"
]))
.with_replica(2)
.with_environment_variable(
MODEL_NAME="meta-llama/Llama-2-7b-hf",
HUGGING_FACE_HUB_TOKEN="<access_token>",
LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib",
)
)
)
kind: job
apiVersion: v1.0
spec:
name: LLAMA2-Fine-Tuning
infrastructure:
kind: infrastructure
spec:
blockStorageSize: 256
compartmentId: "<compartment_ocid>"
logGroupId: "<log_group_id>"
logId: "<log_id>"
projectId: "<project_id>"
subnetId: "<subnet_id>"
shapeName: VM.GPU.A10.2
type: dataScienceJob
runtime:
kind: runtime
type: pyTorchDistributed
spec:
git:
url: https://github.com/facebookresearch/llama-recipes.git
commit: 03faba661f079ee1ecaeb66deaa6bdec920a7bab
command: >-
torchrun llama_finetuning.py
--enable_fsdp
--pure_bf16
--batch_size_training 1
--micro_batch_size 1
--model_name $MODEL_NAME
--dist_checkpoint_root_folder /home/datascience/outputs
--dist_checkpoint_folder fine-tuned
replicas: 2
conda:
type: service
slug: pytorch20_p39_gpu_v1
dependencies:
pipPackages: >-
'accelerate>=0.21.0'
appdirs
loralib
bitsandbytes==0.39.1
black
'black[jupyter]'
datasets
fire
'git+https://github.com/huggingface/peft.git'
'transformers>=4.31.0'
sentencepiece
py7zr
scipy
optimum
outputDir: /home/datascience/outputs
outputUri: oci://bucket@namespace/outputs/$JOB_RUN_OCID
env:
- name: MODEL_NAME
value: meta-llama/Llama-2-7b-hf
- name: HUGGING_FACE_HUB_TOKEN
value: "<access_token>"
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib
You can create and start the job run API call or ADS CLI.
To create and start running the job:
# Create the job on OCI Data Science
job.create()
# Start a job run
run = job.run()
# Stream the job run outputs (from the first node)
run.watch()
# Use the following command to start the job run
ads opctl run -f your_job.yaml
The job run will:
Setup the PyTorch conda environment and install additional dependencies.
Fetch the source code from GitHub and checkout the specific commit.
Run the training script with the specific arguments, which includes downloading the model and dataset.
Save the outputs to OCI object storage once the training finishes.
Note that in the training command, there is no need specify the number of nodes, or the number of GPUs. ADS will automatically configure that base on the replica
and shape
you specified.
The fine-tuning runs on the samsum dataset by default. You can also add your custom datasets.
The same training script also support Parameter-Efficient Fine-Tuning (PEFT). You can change the command
to the following for PEFT with LoRA
torchrun llama_finetuning.py --enable_fsdp --use_peft --peft_method lora \
--pure_bf16 --batch_size_training 1 --micro_batch_size 1 \
--model_name /home/datascience/llama --output_dir /home/datascience/outputs