Training with OCIΒΆ
Oracle Cloud Infrastructure (OCI) Data Science Jobs (Jobs) enables you to define and run repeatable machine learning tasks on a fully managed infrastructure. You can have Compute resource on demand and run applications that perform tasks such as data preparation, model training, hyperparameter tuning, and batch inference.
Here is an example for training RNN on Word-level Language Modeling, using the source code directly from GitHub.
from ads.jobs import Job, DataScienceJob, GitPythonRuntime
job = (
Job(name="Training RNN with PyTorch")
.with_infrastructure(
DataScienceJob()
# Configure logging for getting the job run outputs.
.with_log_group_id("<log_group_ocid>")
# Log resource will be auto-generated if log ID is not specified.
.with_log_id("<log_ocid>")
# If you are in an OCI data science notebook session,
# the following configurations are not required.
# Configurations from the notebook session will be used as defaults.
.with_compartment_id("<compartment_ocid>")
.with_project_id("<project_ocid>")
.with_subnet_id("<subnet_ocid>")
.with_shape_name("VM.Standard.E3.Flex")
# Shape config details are applicable only for the flexible shapes.
.with_shape_config_details(memory_in_gbs=16, ocpus=1)
# Minimum/Default block storage size is 50 (GB).
.with_block_storage_size(50)
)
.with_runtime(
GitPythonRuntime(skip_metadata_update=True)
# Use service conda pack
.with_service_conda("pytorch110_p38_gpu_v1")
# Specify training source code from GitHub
.with_source(url="https://github.com/pytorch/examples.git", branch="main")
# Entrypoint is a relative path from the root of the Git repository
.with_entrypoint("word_language_model/main.py")
# Pass the arguments as: "--epochs 5 --save model.pt --cuda"
.with_argument(epochs=5, save="model.pt", cuda=None)
# Set working directory, which will also be added to PYTHONPATH
.with_working_dir("word_language_model")
# Save the output to OCI object storage
# output_dir is relative to working directory
.with_output(output_dir=".", output_uri="oci://bucket@namespace/prefix")
)
)
kind: job
spec:
name: "My Job"
infrastructure:
kind: infrastructure
type: dataScienceJob
spec:
blockStorageSize: 50
compartmentId: <compartment_ocid>
jobInfrastructureType: STANDALONE
logGroupId: <log_group_ocid>
logId: <log_ocid>
projectId: <project_ocid>
shapeConfigDetails:
memoryInGBs: 16
ocpus: 1
shapeName: VM.Standard.E3.Flex
subnetId: <subnet_ocid>
runtime:
kind: runtime
type: gitPython
spec:
args:
- --epochs
- '5'
- --save
- model.pt
- --cuda
branch: main
conda:
slug: pytorch110_p38_gpu_v1
type: service
entrypoint: word_language_model/main.py
outputDir: .
outputUri: oci://bucket@namespace/prefix
skipMetadataUpdate: true
url: https://github.com/pytorch/examples.git
workingDir: word_language_model
To create and start running the job:
# Create the job on OCI Data Science
job.create()
# Start a job run
run = job.run()
# Stream the job run outputs (from the first node)
run.watch()
# Use the following command to start the job run
ads opctl run -f your_job.yaml
The job run will:
Setup the PyTorch conda environment
Fetch the source code from GitHub
Run the training script with the specific arguments
Save the outputs to OCI object storage
Following are the example outputs of the job run:
2023-02-27 20:26:36 - Job Run ACCEPTED
2023-02-27 20:27:05 - Job Run ACCEPTED, Infrastructure provisioning.
2023-02-27 20:28:27 - Job Run ACCEPTED, Infrastructure provisioned.
2023-02-27 20:28:53 - Job Run ACCEPTED, Job run bootstrap starting.
2023-02-27 20:33:05 - Job Run ACCEPTED, Job run bootstrap complete. Artifact execution starting.
2023-02-27 20:33:08 - Job Run IN_PROGRESS, Job run artifact execution in progress.
2023-02-27 20:33:31 - | epoch 1 | 200/ 2983 batches | lr 20.00 | ms/batch 8.41 | loss 7.63 | ppl 2064.78
2023-02-27 20:33:32 - | epoch 1 | 400/ 2983 batches | lr 20.00 | ms/batch 8.23 | loss 6.86 | ppl 949.18
2023-02-27 20:33:34 - | epoch 1 | 600/ 2983 batches | lr 20.00 | ms/batch 8.21 | loss 6.47 | ppl 643.12
2023-02-27 20:33:36 - | epoch 1 | 800/ 2983 batches | lr 20.00 | ms/batch 8.22 | loss 6.29 | ppl 537.11
2023-02-27 20:33:37 - | epoch 1 | 1000/ 2983 batches | lr 20.00 | ms/batch 8.22 | loss 6.14 | ppl 462.61
2023-02-27 20:33:39 - | epoch 1 | 1200/ 2983 batches | lr 20.00 | ms/batch 8.21 | loss 6.05 | ppl 425.85
...
2023-02-27 20:35:41 - =========================================================================================
2023-02-27 20:35:41 - | End of training | test loss 4.96 | test ppl 142.94
2023-02-27 20:35:41 - =========================================================================================
...
For more details, see: