Training with OCI#

Oracle Cloud Infrastructure (OCI) Data Science Jobs (Jobs) enables you to define and run repeatable machine learning tasks on a fully managed infrastructure. You can have Compute resource on demand and run applications that perform tasks such as data preparation, model training, hyperparameter tuning, and batch inference.

Here is an example for training RNN on Word-level Language Modeling, using the source code directly from GitHub.

Python
YAML

from ads.jobs import Job, DataScienceJob, GitPythonRuntime

job = (
    Job(name="Training RNN with PyTorch")
    .with_infrastructure(
        DataScienceJob()
        # Configure logging for getting the job run outputs.
        .with_log_group_id("<log_group_ocid>")
        # Log resource will be auto-generated if log ID is not specified.
        .with_log_id("<log_ocid>")
        # If you are in an OCI data science notebook session,
        # the following configurations are not required.
        # Configurations from the notebook session will be used as defaults.
        .with_compartment_id("<compartment_ocid>")
        .with_project_id("<project_ocid>")
        .with_subnet_id("<subnet_ocid>")
        .with_shape_name("VM.Standard.E3.Flex")
        # Shape config details are applicable only for the flexible shapes.
        .with_shape_config_details(memory_in_gbs=16, ocpus=1)
        # Minimum/Default block storage size is 50 (GB).
        .with_block_storage_size(50)
    )
    .with_runtime(
        GitPythonRuntime(skip_metadata_update=True)
        # Use service conda pack
        .with_service_conda("pytorch110_p38_gpu_v1")
        # Specify training source code from GitHub
        .with_source(url="https://github.com/pytorch/examples.git", branch="main")
        # Entrypoint is a relative path from the root of the Git repository
        .with_entrypoint("word_language_model/main.py")
        # Pass the arguments as: "--epochs 5 --save model.pt --cuda"
        .with_argument(epochs=5, save="model.pt", cuda=None)
        # Set working directory, which will also be added to PYTHONPATH
        .with_working_dir("word_language_model")
        # Save the output to OCI object storage
        # output_dir is relative to working directory
        .with_output(output_dir=".", output_uri="oci://bucket@namespace/prefix")
    )
)

kind: job
spec:
  name: "My Job"
  infrastructure:
    kind: infrastructure
    type: dataScienceJob
    spec:
      blockStorageSize: 50
      compartmentId: <compartment_ocid>
      jobInfrastructureType: STANDALONE
      logGroupId: <log_group_ocid>
      logId: <log_ocid>
      projectId: <project_ocid>
      shapeConfigDetails:
        memoryInGBs: 16
        ocpus: 1
      shapeName: VM.Standard.E3.Flex
      subnetId: <subnet_ocid>
  runtime:
    kind: runtime
    type: gitPython
    spec:
      args:
      - --epochs
      - '5'
      - --save
      - model.pt
      - --cuda
      branch: main
      conda:
        slug: pytorch110_p38_gpu_v1
        type: service
      entrypoint: word_language_model/main.py
      outputDir: .
      outputUri: oci://bucket@namespace/prefix
      skipMetadataUpdate: true
      url: https://github.com/pytorch/examples.git
      workingDir: word_language_model

# Create the job on OCI Data Science
job.create()
# Start a job run
run = job.run()
# Stream the job run outputs
run.watch()

The job run will:

Setup the PyTorch conda environment
Fetch the source code from GitHub
Run the training script with the specific arguments
Save the outputs to OCI object storage

Following are the example outputs of the job run:

2023-02-27 20:26:36 - Job Run ACCEPTED
2023-02-27 20:27:05 - Job Run ACCEPTED, Infrastructure provisioning.
2023-02-27 20:28:27 - Job Run ACCEPTED, Infrastructure provisioned.
2023-02-27 20:28:53 - Job Run ACCEPTED, Job run bootstrap starting.
2023-02-27 20:33:05 - Job Run ACCEPTED, Job run bootstrap complete. Artifact execution starting.
2023-02-27 20:33:08 - Job Run IN_PROGRESS, Job run artifact execution in progress.
2023-02-27 20:33:31 - | epoch   1 |   200/ 2983 batches | lr 20.00 | ms/batch  8.41 | loss  7.63 | ppl  2064.78
2023-02-27 20:33:32 - | epoch   1 |   400/ 2983 batches | lr 20.00 | ms/batch  8.23 | loss  6.86 | ppl   949.18
2023-02-27 20:33:34 - | epoch   1 |   600/ 2983 batches | lr 20.00 | ms/batch  8.21 | loss  6.47 | ppl   643.12
2023-02-27 20:33:36 - | epoch   1 |   800/ 2983 batches | lr 20.00 | ms/batch  8.22 | loss  6.29 | ppl   537.11
2023-02-27 20:33:37 - | epoch   1 |  1000/ 2983 batches | lr 20.00 | ms/batch  8.22 | loss  6.14 | ppl   462.61
2023-02-27 20:33:39 - | epoch   1 |  1200/ 2983 batches | lr 20.00 | ms/batch  8.21 | loss  6.05 | ppl   425.85
...
2023-02-27 20:35:41 - =========================================================================================
2023-02-27 20:35:41 - | End of training | test loss  4.96 | test ppl   142.94
2023-02-27 20:35:41 - =========================================================================================
...

For more details, see: