Model Schema¶
The data schema provides a definition of the format and nature of the data that the model expects. It also defines the output data from the model inference. The .populate_schema()
method accepts the parameters, data_sample
or X_sample
, and y_sample
. When using these parameters, the model artifact gets populates the input and output data schemas.
The .schema_input
and .schema_output
properties are Schema
objects that define the schema of each input column and the output. The Schema
object contains these fields:
description
: Description of the data in the column.domain
: A data structure that defines the domain of the data. The restrictions on the data and summary statistics of its distribution.constraints
: A data structure that is a list of expression objects that defines the constraints of the data.expression
: A string representation of an expression that can be evaluated by the language corresponding to the value provided inlanguage
attribute. The default value for language ispython
.expression
: Required. Use thestring.Template
format for specifying the expression.$x
is used to represent the variable.language
: The default value ispython
. Onlypython
is supported.
stats
: A set of summary statistics that defines the distribution of the data. These are determined using the feature type statistics as defined in ADS.values
: A description of the values of the data.
dtype
: Pandas data typefeature_type
: The primary feature type as defined by ADS.name
: Name of the column.required
: Boolean value indicating if a value is always required.
- description: Number of matching socks in your dresser drawer.
domain:
constraints:
- expression: ($x <= 10) and ($x > 0)
language: python
- expression: $x in [2, 4, 6, 8, 10]
language: python
stats:
count: 465.0
lower_quartile: 3.2
mean: 6.3
median: 7.0
sample_maximum: 10.0
sample_minimum: 2.0
standard_deviation: 2.5
upper_quartile: 8.2
values: Natural even numbers that are less than or equal to 10.
dtype: int64
feature_type: EvenNatural10
name: sock_count
required: true
Schema Model¶
{
"description": {
"nullable": true,
"required": false,
"type": "string"
},
"domain": {
"nullable": true,
"required": false,
"schema": {
"constraints": {
"nullable": true,
"required": false,
"type": "list"
},
"stats": {
"nullable": true,
"required": false,
"type": "dict"
},
"values": {
"nullable": true,
"required": false,
"type": "string"
}
},
"type": "dict"
},
"dtype": {
"nullable": false,
"required": true,
"type": "string"
},
"feature_type": {
"nullable": true,
"required": false,
"type": "string"
},
"name": {
"nullable": false,
"required": true,
"type": [
"string",
"number"
]
},
"order": {
"nullable": true,
"required": false,
"type": "integer"
},
"required": {
"nullable": false,
"required": true,
"type": "boolean"
}
}
Generating Schema¶
To auto generate schema from the training data, provide X sample and the y sample while preparing the model artifact.
Eg.
import tempfile
from ads.model.framework.sklearn_model import SklearnModel
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load dataset and Prepare train and test split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
# Train a LogisticRegression model
sklearn_estimator = LogisticRegression()
sklearn_estimator.fit(X_train, y_train)
# Instantiate ads.model.SklearnModel using the sklearn LogisticRegression model
sklearn_model = SklearnModel(estimator=sklearn_estimator, artifact_dir=tempfile.mkdtemp())
# Autogenerate score.py, pickled model, runtime.yaml, input_schema.json and output_schema.json
sklearn_model.prepare(inference_conda_env="dataexpl_p37_cpu_v3", X_sample=trainx, y_sample=trainy)
Calling .schema_input
or .schema_output
shows the schema in a YAML format.
Alternatively, you can check the output_schema.json
file for the content of the schema_output:
with open(path.join(path_to_artifact_dir, "output_schema.json"), 'r') as f:
print(f.read())
{
"schema": [
{
"dtype": "int64",
"feature_type": "Integer",
"name": "class",
"domain": {
"values": "Integer",
"stats": {
"count": 465.0,
"mean": 0.5225806451612903,
"standard deviation": 0.5000278079030275,
"sample minimum": 0.0,
"lower quartile": 0.0,
"median": 1.0,
"upper quartile": 1.0,
"sample maximum": 1.0
},
"constraints": []
},
"required": true,
"description": "class"
}
]
}
Update the Schema¶
You can update the fields in the schema:
sklearn_model.schema_output[<class name>].description = 'target variable'
sklearn_model.schema_output[<class name>].feature_type = 'Category'
You can specify a constraint for your data using Expression
, and call
evaluate
to check if the data satisfies the constraint:
sklearn_model.schema_input['col01'].domain.constraints.append(Expression('($x < 20) and ($x > -20)'))
0 is between -20 and 20, so evaluate
should return True
:
sklearn_model.schema_input['col01'].domain.constraints[0].evaluate(x=0)
True
You can directly populate the schema by calling populate_schema()
:
sklearn_model.model_artifact.populate_schema(X_sample=test.X, y_sample=test.y)
You can also load your schema from a JSON or YAML file:
cat <<EOF > schema.json
{
"schema": [
{
"dtype": "int64",
"feature_type": "Category",
"name": "class",
"domain": {
"values": "Category type.",
"stats": {
"count": 465.0,
"unique": 2},
"constraints": [
{"expression": "($x <= 1) and ($x >= 0)", "language": "python"},
{"expression": "$x in [0, 1]", "language": "python"}]},
"required": true,
"description": "target to predict."
}
]
}
EOF
sklearn_model.schema_output = Schema.from_file('schema.json'))