Large Language Model¶
Oracle ADS (Accelerated Data Science) opens the gateway to harnessing the full potential of the Large Language models within Oracle Cloud Infrastructure (OCI). Meta’s latest offering, Llama 2, introduces a collection of pre-trained and fine-tuned generative text models, ranging from 7 to 70 billion parameters. These models represent a significant leap forward, being trained on 40% more tokens and boasting an extended context length of 4,000 tokens.
Throughout this documentation, we showcase two essential inference frameworks:
Text Generation Inference (TGI). A purpose-built solution for deploying and serving LLMs from Hugging Face, which we extend to meet the interface requirements of model deployment resources.
vLLM. An open-source, high-throughput, and memory-efficient inference and serving engine for LLMs from UC Berkeley.
While our primary focus is on the Llama 2 family, the methodology presented here can be applied to other LLMs as well.
Sample Code
For your convenience, we provide sample code and a complete walkthrough, available in the Oracle GitHub samples repository.
Prerequisites
Using the Llama 2 model requires user agreement acceptance on Meta’s website. Downloading the model from Hugging Face necessitates an account and agreement to the service terms. Ensure that the model’s license permits usage for your intended purposes.
Recommended Hardware
We recommend specific OCI shapes based on Nvidia A10 GPUs for deploying models. These shapes cater to both the 7-billion and 13-billion parameter models, with the latter utilizing quantization techniques to optimize GPU memory usage. OCI offers a variety of GPU options to suit your needs.
Deployment Approaches
You can use the following methods to deploy an LLM with OCI Data Science:
Online Method. This approach involves downloading the LLM directly from the hosting repository into the Data Science Model Deployment. It minimizes data copying, making it suitable for large models. However, it lacks governance and may not be ideal for production environments or fine-tuning scenarios.
Offline Method. In this method, you download the LLM model from the host repository and save it in the Data Science Model Catalog. Deployment then occurs directly from the catalog, allowing for better control and governance of the model.
Inference Container
We explore two inference options: Hugging Face’s Text Generation Inference (TGI) and vLLM from UC Berkeley. These containers are crucial for effective model deployment and are optimized to align with OCI Data Science model deployment requirements. You can find both the TGI and vLLM Docker files in our samples repository.
Creating the Model Deployment
The final step involves deploying the model and the inference container by creating a model deployment. Once deployed, the model is accessible via a predict URL, allowing HTTP-based model invocation.
Testing the Model
To validate your deployed model, a Gradio Chat app can be configured to use the predict URL. This app provides
parameters such as max_tokens
, temperature
, and top_p
for fine-tuning model responses. Check our blog to
learn more about this.
Train Model¶
Check Training Large Language Model to see how to train your large language model by Oracle Cloud Infrastructure (OCI) Data Science Jobs (Jobs).
Register Model¶
Once you’ve trained your LLM, we guide you through the process of registering it within OCI, enabling seamless access and management.
Zip all items of the folder using zip/tar utility, preferrably using below command to avoid creating another hierarchy of folder structure inside zipped file.
zip my_large_model.zip * -0
Upload the zipped artifact created in an object storage bucket in your tenancy. Tools like rclone, can help speed this upload. Using rclone with OCI can be referred from here.
Example of using oci-cli
:
oci os object put -ns <namespace> -bn <bucket> --name <prefix>/my_large_model.zip --file my_large_model.zip
Next step is to create a model catalog item. Use DataScienceModel
to register the large model to Model Catalog.
import ads
from ads.model import DataScienceModel
ads.set_auth("resource_principal")
MODEL_DISPLAY_NAME = "My Large Model"
ARTIFACT_PATH = "oci://<bucket>@<namespace>/<prefix>/my_large_model.zip"
model = (DataScienceModel()
.with_display_name(MODEL_DISPLAY_NAME)
.with_artifact(ARTIFACT_PATH)
.create(
remove_existing_artifact=False
))
model_id = model.id
Deploy Model¶
The final step involves deploying your registered LLM for real-world applications. We walk you through deploying it in a custom containers (Bring Your Own Container) within the OCI Data Science Service, leveraging advanced technologies for optimal performance.
You can define the model deployment with ADS Python APIs or YAML. In the
examples below, you will need to change with the OCIDs of the resources required for the deployment, like project ID
,
compartment ID
etc. All of the configurations with <UNIQUE_ID>
should be replaces with your corresponding ID from
your tenancy, the resources we created in the previous steps.
Online Deployment¶
Prerequisites
Check on GitHub Sample repository to see how to complete the Prerequisites before actual deployment.
Zips your Hugging Face user access token and registers it into Model Catalog by following the instruction on
Register Model
in this page.Creates logging in the OCI Logging Service for the model deployment (if you have to already created, you can skip this step).
Creates a subnet in Virtual Cloud Network for the model deployment.
Executes container build and push process to Oracle Cloud Container Registry.
You can now use the Bring Your Own Container Deployment in OCI Data Science to the deploy the Llama2 model.
Set custom environment variables:
env_var = {
"TOKEN_FILE": "/opt/ds/model/deployed_model/token",
"PARAMS": "--model meta-llama/Llama-2-7b-chat-hf",
}
env_var = {
"TOKEN_FILE": "/opt/ds/model/deployed_model/token",
"PARAMS": "--model-id meta-llama/Llama-2-7b-chat-hf --max-batch-prefill-tokens 1024",
}
env_var = {
"TOKEN_FILE": "/opt/ds/model/deployed_model/token",
"PARAMS" : "--model meta-llama/Llama-2-13b-chat-hf --max-batch-prefill-tokens 1024 --quantize bitsandbytes --max-batch-total-tokens 4096"
}
You can override more vllm/TGI bootstrapping configuration using PARAMS
environment configuration.
For details of configurations, please refer the official vLLM doc and
TGI doc.
from ads.model.deployment import ModelDeployment, ModelDeploymentInfrastructure, ModelDeploymentContainerRuntime
# configure model deployment infrastructure
infrastructure = (
ModelDeploymentInfrastructure()
.with_project_id("ocid1.datascienceproject.oc1.<UNIQUE_ID>")
.with_compartment_id("ocid1.compartment.oc1..<UNIQUE_ID>")
.with_shape_name("VM.GPU.A10.2")
.with_bandwidth_mbps(10)
.with_web_concurrency(10)
.with_access_log(
log_group_id="ocid1.loggroup.oc1.<UNIQUE_ID>",
log_id="ocid1.log.oc1.<UNIQUE_ID>"
)
.with_predict_log(
log_group_id="ocid1.loggroup.oc1.<UNIQUE_ID>",
log_id="ocid1.log.oc1.<UNIQUE_ID>"
)
.with_subnet_id("ocid1.subnet.oc1.<UNIQUE_ID>")
)
# configure model deployment runtime
container_runtime = (
ModelDeploymentContainerRuntime()
.with_image("iad.ocir.io/<namespace>/<image>:<tag>")
.with_server_port(5001)
.with_health_check_port(5001)
.with_env(env_var)
.with_deployment_mode("HTTPS_ONLY")
.with_model_uri("ocid1.datasciencemodel.oc1.<UNIQUE_ID>")
.with_region("us-ashburn-1")
.with_overwrite_existing_artifact(True)
.with_remove_existing_artifact(True)
.with_timeout(100)
)
# configure model deployment
deployment = (
ModelDeployment()
.with_display_name("Model Deployment Demo using ADS")
.with_description("The model deployment description")
.with_freeform_tags({"key1":"value1"})
.with_infrastructure(infrastructure)
.with_runtime(container_runtime)
)
kind: deployment
spec:
displayName: LLama2-7b model deployment - tgi
infrastructure:
kind: infrastructure
type: datascienceModelDeployment
spec:
compartmentId: ocid1.compartment.oc1..<UNIQUE_ID>
projectId: ocid1.datascienceproject.oc1.<UNIQUE_ID>
accessLog:
logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
logId: ocid1.log.oc1.<UNIQUE_ID>
predictLog:
logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
logId: ocid1.log.oc1.<UNIQUE_ID>
shapeName: VM.GPU.A10.2
replica: 1
bandWidthMbps: 10
webConcurrency: 10
subnetId: ocid1.subnet.oc1.<UNIQUE_ID>
runtime:
kind: runtime
type: container
spec:
modelUri: ocid1.datasciencemodel.oc1.<UNIQUE_ID>
image: <UNIQUE_ID>
serverPort: 5001
healthCheckPort: 5001
env:
TOKEN: "/opt/ds/model/deployed_model/token"
PARAMS: "--model-id meta-llama/Llama-2-7b-chat-hf --max-batch-prefill-tokens 1024"
region: us-ashburn-1
overwriteExistingArtifact: True
removeExistingArtifact: True
timeout: 100
deploymentMode: HTTPS_ONLY
kind: deployment
spec:
displayName: LLama2-7b model deployment - vllm
infrastructure:
kind: infrastructure
type: datascienceModelDeployment
spec:
compartmentId: ocid1.compartment.oc1..<UNIQUE_ID>
projectId: ocid1.datascienceproject.oc1.<UNIQUE_ID>
accessLog:
logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
logId: ocid1.log.oc1.<UNIQUE_ID>
predictLog:
logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
logId: ocid1.log.oc1.<UNIQUE_ID>
shapeName: VM.GPU.A10.2
replica: 1
bandWidthMbps: 10
webConcurrency: 10
subnetId: ocid1.subnet.oc1.<UNIQUE_ID>
runtime:
kind: runtime
type: container
spec:
modelUri: ocid1.datasciencemodel.oc1.<UNIQUE_ID>
image: <UNIQUE_ID>
serverPort: 5001
healthCheckPort: 5001
env:
PARAMS: "--model meta-llama/Llama-2-7b-chat-hf"
HUGGINGFACE_HUB_CACHE: "/home/datascience/.cache"
TOKEN_FILE: /opt/ds/model/deployed_model/token
STORAGE_SIZE_IN_GB: "950"
WEB_CONCURRENCY: 1
region: us-ashburn-1
overwriteExistingArtifact: True
removeExistingArtifact: True
timeout: 100
deploymentMode: HTTPS_ONLY
Offline Deployment¶
Check on GitHub Sample repository to see how to complete the Prerequisites before actual deployment.
Registers the zipped artifact into Model Catalog by following the instruction on
Register Model
in this page.Creates logging in the OCI Logging Service for the model deployment (if you have to already created, you can skip this step).
Executes container build and push process to Oracle Cloud Container Registry.
You can now use the Bring Your Own Container Deployment in OCI Data Science to the deploy the Llama2 model.
Set custom environment variables:
env_var = {
"PARAMS": "--model /opt/ds/model/deployed_model",
}
env_var = {
"PARAMS": "--model /opt/ds/model/deployed_model",
"TENSOR_PARALLELISM": 2,
}
env_var = {
"MODEL_DEPLOY_PREDICT_ENDPOINT": "/generate",
"PARAMS": "--model /opt/ds/model/deployed_model --max-batch-prefill-tokens 1024"
}
env_var = {
"MODEL_DEPLOY_PREDICT_ENDPOINT": "/generate",
"PARAMS" : "--model /opt/ds/model/deployed_model --max-batch-prefill-tokens 1024 --quantize bitsandbytes --max-batch-total-tokens 4096"
}
You can override more vllm/TGI bootstrapping configuration using PARAMS
environment configuration.
For details of configurations, please refer the official vLLM doc and
TGI doc.
Creates Model Deployment:
from ads.model.deployment import ModelDeployment, ModelDeploymentInfrastructure, ModelDeploymentContainerRuntime
# configure model deployment infrastructure
infrastructure = (
ModelDeploymentInfrastructure()
.with_project_id("ocid1.datascienceproject.oc1.<UNIQUE_ID>")
.with_compartment_id("ocid1.compartment.oc1..<UNIQUE_ID>")
.with_shape_name("VM.GPU3.2")
.with_bandwidth_mbps(10)
.with_web_concurrency(10)
.with_access_log(
log_group_id="ocid1.loggroup.oc1.<UNIQUE_ID>",
log_id="ocid1.log.oc1.<UNIQUE_ID>"
)
.with_predict_log(
log_group_id="ocid1.loggroup.oc1.<UNIQUE_ID>",
log_id="ocid1.log.oc1.<UNIQUE_ID>"
)
)
# configure model deployment runtime
container_runtime = (
ModelDeploymentContainerRuntime()
.with_image("iad.ocir.io/<namespace>/<image>:<tag>")
.with_server_port(5001)
.with_health_check_port(5001)
.with_env(env_var)
.with_deployment_mode("HTTPS_ONLY")
.with_model_uri("ocid1.datasciencemodel.oc1.<UNIQUE_ID>")
.with_region("us-ashburn-1")
.with_overwrite_existing_artifact(True)
.with_remove_existing_artifact(True)
.with_timeout(100)
)
# configure model deployment
deployment = (
ModelDeployment()
.with_display_name("Model Deployment Demo using ADS")
.with_description("The model deployment description.")
.with_freeform_tags({"key1":"value1"})
.with_infrastructure(infrastructure)
.with_runtime(container_runtime)
)
kind: deployment
spec:
displayName: LLama2-7b model deployment - tgi
infrastructure:
kind: infrastructure
type: datascienceModelDeployment
spec:
compartmentId: ocid1.compartment.oc1..<UNIQUE_ID>
projectId: ocid1.datascienceproject.oc1.<UNIQUE_ID>
accessLog:
logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
logId: ocid1.log.oc1.<UNIQUE_ID>
predictLog:
logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
logId: ocid1.log.oc1.<UNIQUE_ID>
shapeName: VM.GPU.A10.2
replica: 1
bandWidthMbps: 10
webConcurrency: 10
subnetId: ocid1.subnet.oc1.<UNIQUE_ID>
runtime:
kind: runtime
type: container
spec:
modelUri: ocid1.datasciencemodel.oc1.<UNIQUE_ID>
image: <UNIQUE_ID>
serverPort: 5001
healthCheckPort: 5001
env:
MODEL_DEPLOY_PREDICT_ENDPOINT: "/generate"
PARAMS: "--model /opt/ds/model/deployed_model --max-batch-prefill-tokens 1024"
region: us-ashburn-1
overwriteExistingArtifact: True
removeExistingArtifact: True
timeout: 100
deploymentMode: HTTPS_ONLY
kind: deployment
spec:
displayName: LLama2-7b model deployment - vllm
infrastructure:
kind: infrastructure
type: datascienceModelDeployment
spec:
compartmentId: ocid1.compartment.oc1..<UNIQUE_ID>
projectId: ocid1.datascienceproject.oc1.<UNIQUE_ID>
accessLog:
logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
logId: ocid1.log.oc1.<UNIQUE_ID>
predictLog:
logGroupId: ocid1.loggroup.oc1.<UNIQUE_ID>
logId: ocid1.log.oc1.<UNIQUE_ID>
shapeName: VM.GPU.A10.2
replica: 1
bandWidthMbps: 10
webConcurrency: 10
runtime:
kind: runtime
type: container
spec:
modelUri: ocid1.datasciencemodel.oc1.<UNIQUE_ID>
image: <UNIQUE_ID>
serverPort: 5001
healthCheckPort: 5001
env:
PARAMS: "--model /opt/ds/model/deployed_model"
TENSOR_PARALLELISM: 2
region: us-ashburn-1
overwriteExistingArtifact: True
removeExistingArtifact: True
timeout: 100
deploymentMode: HTTPS_ONLY
You can deploy the model through API call or ADS CLI.
Make sure that you’ve also created and setup your API Auth Token to execute the commands below.
To create a model deployment:
# Deploy model on container runtime
deployment.deploy()
# Use the following command to deploy model
ads opctl run -f ads-md-deploy-<framework>.yaml
Inference Model¶
Once the model is deployed and shown as Active you can execute inference against it. You can run inference against the deployed model with oci-cli from your OCI Data Science Notebook or you local environment.
Run inference against the deployed model:
# For TGI
data = {
"inputs": "Write a python program to randomly select item from a predefined list?",
"parameters": {
"max_new_tokens": 200
}
}
# For vLLM
data = {
"prompt": "are you smart?",
"use_beam_search": true,
"n": 4,
"temperature": 0
}
deployment.predict(data=data)
oci raw-request \
--http-method POST \
--target-uri "<TGI_model_endpoint>" \
--request-body '{
"inputs": "Write a python program to randomly select item from a predefined list?",
"parameters": {
"max_new_tokens": 200
}
}' \
--auth resource_principal
oci raw-request \
--http-method POST \
--target-uri "<vLLM_model_endpoint>" \
--request-body '{
"prompt": "are you smart?",
"use_beam_search": true,
"n": 4,
"temperature": 0
}' \
--auth resource_principal