Large Language Model

Oracle ADS (Accelerated Data Science) opens the gateway to harnessing the full potential of the Large Language models within Oracle Cloud Infrastructure (OCI). Meta’s latest offering, Llama 2, introduces a collection of pre-trained and fine-tuned generative text models, ranging from 7 to 70 billion parameters. These models represent a significant leap forward, being trained on 40% more tokens and boasting an extended context length of 4,000 tokens.

Throughout this documentation, we showcase two essential inference frameworks:

  • Text Generation Inference (TGI). A purpose-built solution for deploying and serving LLMs from Hugging Face, which we extend to meet the interface requirements of model deployment resources.

  • vLLM. An open-source, high-throughput, and memory-efficient inference and serving engine for LLMs from UC Berkeley.

While our primary focus is on the Llama 2 family, the methodology presented here can be applied to other LLMs as well.

Sample Code

For your convenience, we provide sample code and a complete walkthrough, available in the Oracle GitHub samples repository.

Prerequisites

Using the Llama 2 model requires user agreement acceptance on Meta’s website. Downloading the model from Hugging Face necessitates an account and agreement to the service terms. Ensure that the model’s license permits usage for your intended purposes.

Recommended Hardware

We recommend specific OCI shapes based on Nvidia A10 GPUs for deploying models. These shapes cater to both the 7-billion and 13-billion parameter models, with the latter utilizing quantization techniques to optimize GPU memory usage. OCI offers a variety of GPU options to suit your needs.

Deployment Approaches

You can use the following methods to deploy an LLM with OCI Data Science:

  • Online Method. This approach involves downloading the LLM directly from the hosting repository into the Data Science Model Deployment. It minimizes data copying, making it suitable for large models. However, it lacks governance and may not be ideal for production environments or fine-tuning scenarios.

  • Offline Method. In this method, you download the LLM model from the host repository and save it in the Data Science Model Catalog. Deployment then occurs directly from the catalog, allowing for better control and governance of the model.

Inference Container

We explore two inference options: Hugging Face’s Text Generation Inference (TGI) and vLLM from UC Berkeley. These containers are crucial for effective model deployment and are optimized to align with OCI Data Science model deployment requirements. You can find both the TGI and vLLM Docker files in our samples repository.

Creating the Model Deployment

The final step involves deploying the model and the inference container by creating a model deployment. Once deployed, the model is accessible via a predict URL, allowing HTTP-based model invocation.

Testing the Model

To validate your deployed model, a Gradio Chat app can be configured to use the predict URL. This app provides parameters such as max_tokens, temperature, and top_p for fine-tuning model responses. Check our blog to learn more about this.

Train Model

Check Training Large Language Model to see how to train your large language model by Oracle Cloud Infrastructure (OCI) Data Science Jobs (Jobs).

Register Model

Once you’ve trained your LLM, we guide you through the process of registering it within OCI, enabling seamless access and management.

Zip all items of the folder using zip/tar utility, preferrably using below command to avoid creating another hierarchy of folder structure inside zipped file.

zip my_large_model.zip * -0

Upload the zipped artifact created in an object storage bucket in your tenancy. Tools like rclone, can help speed this upload. Using rclone with OCI can be referred from here.

Example of using oci-cli:

oci os object put -ns <namespace> -bn <bucket> --name <prefix>/my_large_model.zip --file my_large_model.zip

Next step is to create a model catalog item. Use DataScienceModel to register the large model to Model Catalog.

import ads
from ads.model import DataScienceModel

ads.set_auth("resource_principal")

MODEL_DISPLAY_NAME = "My Large Model"
ARTIFACT_PATH = "oci://<bucket>@<namespace>/<prefix>/my_large_model.zip"

model = (DataScienceModel()
        .with_display_name(MODEL_DISPLAY_NAME)
        .with_artifact(ARTIFACT_PATH)
        .create(
            remove_existing_artifact=False
        ))
model_id = model.id

Deploy Model

The final step involves deploying your registered LLM for real-world applications. We walk you through deploying it in a custom containers (Bring Your Own Container) within the OCI Data Science Service, leveraging advanced technologies for optimal performance.

You can define the model deployment with ADS Python APIs or YAML. In the examples below, you will need to change with the OCIDs of the resources required for the deployment, like project ID, compartment ID etc. All of the configurations with <UNIQUE_ID> should be replaces with your corresponding ID from your tenancy, the resources we created in the previous steps.

Online Deployment

Prerequisites

Check on GitHub Sample repository to see how to complete the Prerequisites before actual deployment.

  • Zips your Hugging Face user access token and registers it into Model Catalog by following the instruction on Register Model in this page.

  • Creates logging in the OCI Logging Service for the model deployment (if you have to already created, you can skip this step).

  • Creates a subnet in Virtual Cloud Network for the model deployment.

  • Executes container build and push process to Oracle Cloud Container Registry.

  • You can now use the Bring Your Own Container Deployment in OCI Data Science to the deploy the Llama2 model.

Set custom environment variables:

  • 7b llama2 - vllm
  • 7b llama2 - TGI
  • 13b llama2 - TGI
env_var = {
    "TOKEN_FILE": "/opt/ds/model/deployed_model/token",
    "PARAMS": "--model meta-llama/Llama-2-7b-chat-hf",
}
env_var = {
  "TOKEN_FILE": "/opt/ds/model/deployed_model/token",
  "PARAMS": "--model-id meta-llama/Llama-2-7b-chat-hf --max-batch-prefill-tokens 1024",
}
env_var = {
  "TOKEN_FILE": "/opt/ds/model/deployed_model/token",
  "PARAMS" : "--model meta-llama/Llama-2-13b-chat-hf --max-batch-prefill-tokens 1024 --quantize bitsandbytes --max-batch-total-tokens 4096"
}

You can override more vllm/TGI bootstrapping configuration using PARAMS environment configuration. For details of configurations, please refer the official vLLM doc and TGI doc.

  • Python
  • TGI-YAML
  • vllm-YAML
from ads.model.deployment import ModelDeployment, ModelDeploymentInfrastructure, ModelDeploymentContainerRuntime

# configure model deployment infrastructure
infrastructure = (
    ModelDeploymentInfrastructure()
    .with_project_id("ocid1.datascienceproject.oc1.<UNIQUE_ID>")
    .with_compartment_id("ocid1.compartment.oc1..<UNIQUE_ID>")
    .with_shape_name("VM.GPU.A10.2")
    .with_bandwidth_mbps(10)
    .with_web_concurrency(10)
    .with_access_log(
        log_group_id="ocid1.loggroup.oc1.<UNIQUE_ID>",
        log_id="ocid1.log.oc1.<UNIQUE_ID>"
    )
    .with_predict_log(
        log_group_id="ocid1.loggroup.oc1.<UNIQUE_ID>",
        log_id="ocid1.log.oc1.<UNIQUE_ID>"
    )
    .with_subnet_id("ocid1.subnet.oc1.<UNIQUE_ID>")
)

# configure model deployment runtime
container_runtime = (
    ModelDeploymentContainerRuntime()
    .with_image("iad.ocir.io/<namespace>/<image>:<tag>")
    .with_server_port(5001)
    .with_health_check_port(5001)
    .with_env(env_var)
    .with_deployment_mode("HTTPS_ONLY")
    .with_model_uri("ocid1.datasciencemodel.oc1.<UNIQUE_ID>")
    .with_region("us-ashburn-1")
    .with_overwrite_existing_artifact(True)
    .with_remove_existing_artifact(True)
    .with_timeout(100)
)

# configure model deployment
deployment = (
    ModelDeployment()
    .with_display_name("Model Deployment Demo using ADS")
    .with_description("The model deployment description")
    .with_freeform_tags({"key1":"value1"})
    .with_infrastructure(infrastructure)
    .with_runtime(container_runtime)
)
kind: deployment
spec:
  displayName: LLama2-7b model deployment - tgi
  infrastructure:
    kind: infrastructure
    type: datascienceModelDeployment
    spec:
      compartmentId: ocid1.compartment.oc1..<UNIQUE_ID>
      projectId: ocid1.datascienceproject.oc1.<UNIQUE_ID>
      accessLog:
        logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
        logId: ocid1.log.oc1.<UNIQUE_ID>
      predictLog:
        logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
        logId: ocid1.log.oc1.<UNIQUE_ID>
      shapeName: VM.GPU.A10.2
      replica: 1
      bandWidthMbps: 10
      webConcurrency: 10
      subnetId: ocid1.subnet.oc1.<UNIQUE_ID>
  runtime:
    kind: runtime
    type: container
    spec:
      modelUri: ocid1.datasciencemodel.oc1.<UNIQUE_ID>
      image: <UNIQUE_ID>
      serverPort: 5001
      healthCheckPort: 5001
      env:
        TOKEN: "/opt/ds/model/deployed_model/token"
        PARAMS: "--model-id meta-llama/Llama-2-7b-chat-hf --max-batch-prefill-tokens 1024"
      region: us-ashburn-1
      overwriteExistingArtifact: True
      removeExistingArtifact: True
      timeout: 100
      deploymentMode: HTTPS_ONLY
kind: deployment
spec:
  displayName: LLama2-7b model deployment - vllm
  infrastructure:
    kind: infrastructure
    type: datascienceModelDeployment
    spec:
      compartmentId: ocid1.compartment.oc1..<UNIQUE_ID>
      projectId: ocid1.datascienceproject.oc1.<UNIQUE_ID>
      accessLog:
        logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
        logId: ocid1.log.oc1.<UNIQUE_ID>
      predictLog:
        logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
        logId: ocid1.log.oc1.<UNIQUE_ID>
      shapeName: VM.GPU.A10.2
      replica: 1
      bandWidthMbps: 10
      webConcurrency: 10
      subnetId: ocid1.subnet.oc1.<UNIQUE_ID>
  runtime:
    kind: runtime
    type: container
    spec:
      modelUri: ocid1.datasciencemodel.oc1.<UNIQUE_ID>
      image: <UNIQUE_ID>
      serverPort: 5001
      healthCheckPort: 5001
      env:
        PARAMS: "--model meta-llama/Llama-2-7b-chat-hf"
        HUGGINGFACE_HUB_CACHE: "/home/datascience/.cache"
        TOKEN_FILE: /opt/ds/model/deployed_model/token
        STORAGE_SIZE_IN_GB: "950"
        WEB_CONCURRENCY:  1
      region: us-ashburn-1
      overwriteExistingArtifact: True
      removeExistingArtifact: True
      timeout: 100
      deploymentMode: HTTPS_ONLY

Offline Deployment

Check on GitHub Sample repository to see how to complete the Prerequisites before actual deployment.

  • Registers the zipped artifact into Model Catalog by following the instruction on Register Model in this page.

  • Creates logging in the OCI Logging Service for the model deployment (if you have to already created, you can skip this step).

  • Executes container build and push process to Oracle Cloud Container Registry.

  • You can now use the Bring Your Own Container Deployment in OCI Data Science to the deploy the Llama2 model.

Set custom environment variables:

  • 7b llama2 - vllm
  • 13b llama2 - vllm
  • 7b llama2 - TGI
  • 13b llama2 - TGI
env_var = {
    "PARAMS": "--model /opt/ds/model/deployed_model",
}
env_var = {
    "PARAMS": "--model /opt/ds/model/deployed_model",
    "TENSOR_PARALLELISM": 2,
}
env_var = {
  "MODEL_DEPLOY_PREDICT_ENDPOINT": "/generate",
  "PARAMS": "--model /opt/ds/model/deployed_model --max-batch-prefill-tokens 1024"
}
env_var = {
  "MODEL_DEPLOY_PREDICT_ENDPOINT": "/generate",
  "PARAMS" : "--model /opt/ds/model/deployed_model --max-batch-prefill-tokens 1024 --quantize bitsandbytes --max-batch-total-tokens 4096"
}

You can override more vllm/TGI bootstrapping configuration using PARAMS environment configuration. For details of configurations, please refer the official vLLM doc and TGI doc.

Creates Model Deployment:

  • Python
  • TGI-YAML
  • vllm-YAML
from ads.model.deployment import ModelDeployment, ModelDeploymentInfrastructure, ModelDeploymentContainerRuntime

# configure model deployment infrastructure
infrastructure = (
    ModelDeploymentInfrastructure()
    .with_project_id("ocid1.datascienceproject.oc1.<UNIQUE_ID>")
    .with_compartment_id("ocid1.compartment.oc1..<UNIQUE_ID>")
    .with_shape_name("VM.GPU3.2")
    .with_bandwidth_mbps(10)
    .with_web_concurrency(10)
    .with_access_log(
        log_group_id="ocid1.loggroup.oc1.<UNIQUE_ID>",
        log_id="ocid1.log.oc1.<UNIQUE_ID>"
    )
    .with_predict_log(
        log_group_id="ocid1.loggroup.oc1.<UNIQUE_ID>",
        log_id="ocid1.log.oc1.<UNIQUE_ID>"
    )
)

# configure model deployment runtime
container_runtime = (
    ModelDeploymentContainerRuntime()
    .with_image("iad.ocir.io/<namespace>/<image>:<tag>")
    .with_server_port(5001)
    .with_health_check_port(5001)
    .with_env(env_var)
    .with_deployment_mode("HTTPS_ONLY")
    .with_model_uri("ocid1.datasciencemodel.oc1.<UNIQUE_ID>")
    .with_region("us-ashburn-1")
    .with_overwrite_existing_artifact(True)
    .with_remove_existing_artifact(True)
    .with_timeout(100)
)

# configure model deployment
deployment = (
    ModelDeployment()
    .with_display_name("Model Deployment Demo using ADS")
    .with_description("The model deployment description.")
    .with_freeform_tags({"key1":"value1"})
    .with_infrastructure(infrastructure)
    .with_runtime(container_runtime)
)
kind: deployment
spec:
  displayName: LLama2-7b model deployment - tgi
  infrastructure:
    kind: infrastructure
    type: datascienceModelDeployment
    spec:
      compartmentId: ocid1.compartment.oc1..<UNIQUE_ID>
      projectId: ocid1.datascienceproject.oc1.<UNIQUE_ID>
      accessLog:
        logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
        logId: ocid1.log.oc1.<UNIQUE_ID>
      predictLog:
        logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
        logId: ocid1.log.oc1.<UNIQUE_ID>
      shapeName: VM.GPU.A10.2
      replica: 1
      bandWidthMbps: 10
      webConcurrency: 10
      subnetId: ocid1.subnet.oc1.<UNIQUE_ID>
  runtime:
    kind: runtime
    type: container
    spec:
      modelUri: ocid1.datasciencemodel.oc1.<UNIQUE_ID>
      image: <UNIQUE_ID>
      serverPort: 5001
      healthCheckPort: 5001
      env:
        MODEL_DEPLOY_PREDICT_ENDPOINT: "/generate"
        PARAMS: "--model /opt/ds/model/deployed_model --max-batch-prefill-tokens 1024"
      region: us-ashburn-1
      overwriteExistingArtifact: True
      removeExistingArtifact: True
      timeout: 100
      deploymentMode: HTTPS_ONLY
kind: deployment
spec:
  displayName: LLama2-7b model deployment - vllm
  infrastructure:
    kind: infrastructure
    type: datascienceModelDeployment
    spec:
      compartmentId: ocid1.compartment.oc1..<UNIQUE_ID>
      projectId: ocid1.datascienceproject.oc1.<UNIQUE_ID>
      accessLog:
        logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
        logId: ocid1.log.oc1.<UNIQUE_ID>
      predictLog:
        logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
        logId: ocid1.log.oc1.<UNIQUE_ID>
      shapeName: VM.GPU.A10.2
      replica: 1
      bandWidthMbps: 10
      webConcurrency: 10
  runtime:
    kind: runtime
    type: container
    spec:
      modelUri: ocid1.datasciencemodel.oc1.<UNIQUE_ID>
      image: <UNIQUE_ID>
      serverPort: 5001
      healthCheckPort: 5001
      env:
        PARAMS: "--model /opt/ds/model/deployed_model"
        TENSOR_PARALLELISM: 2
      region: us-ashburn-1
      overwriteExistingArtifact: True
      removeExistingArtifact: True
      timeout: 100
      deploymentMode: HTTPS_ONLY

You can deploy the model through API call or ADS CLI.

Make sure that you’ve also created and setup your API Auth Token to execute the commands below.

To create a model deployment:

  • Python
  • YAML
# Deploy model on container runtime
deployment.deploy()
# Use the following command to deploy model
ads opctl run -f ads-md-deploy-<framework>.yaml

Inference Model

Once the model is deployed and shown as Active you can execute inference against it. You can run inference against the deployed model with oci-cli from your OCI Data Science Notebook or you local environment.

Run inference against the deployed model:

  • Python
  • TGI Inference by OCI CLI
  • vLLM Inference by OCI CLI
# For TGI
data = {
    "inputs": "Write a python program to randomly select item from a predefined list?",
    "parameters": {
      "max_new_tokens": 200
    }
  }

# For vLLM
data = {
    "prompt": "are you smart?",
    "use_beam_search": true,
    "n": 4,
    "temperature": 0
  }

deployment.predict(data=data)
oci raw-request \
    --http-method POST \
    --target-uri "<TGI_model_endpoint>" \
    --request-body '{
        "inputs": "Write a python program to randomly select item from a predefined list?",
        "parameters": {
        "max_new_tokens": 200
        }
    }' \
    --auth resource_principal
oci raw-request \
  --http-method POST \
  --target-uri "<vLLM_model_endpoint>" \
  --request-body '{
    "prompt": "are you smart?",
    "use_beam_search": true,
    "n": 4,
    "temperature": 0
  }' \
  --auth resource_principal