YAML Schema¶

This page walks through the regression operator YAML based on the current implementation in ads/opctl/operator/lowcode/regression/schema.yaml and the corresponding runtime code.

Complete Example¶

kind: operator
type: regression
version: v1
spec:
  training_data:
    url: train.csv
  test_data:
    url: test.csv
  output_directory:
    url: results
  target_column: target
  column_types:
    event_date: date
    customer_id: categorical
  model: random_forest
  model_kwargs:
    tuning_n_trials: 10
    n_estimators: 300
  preprocessing:
    enabled: true
    steps:
      missing_value_imputation: true
      categorical_encoding: true
  metric: rmse
  training_predictions_filename: training_predictions.csv
  test_predictions_filename: test_predictions.csv
  training_metrics_filename: training_metrics.csv
  test_metrics_filename: test_metrics.csv
  global_explanation_filename: global_explanations.csv
  report_filename: report.html
  report_title: Regression Report
  generate_report: true
  generate_explanations: false

Top-Level Fields¶

kind: Must be operator.
type: Must be regression.
version: The current schema version is v1.
spec: Contains the operator-specific configuration.

Data Inputs¶

`training_data`¶

Required. This is the dataset used for fitting the model.

Supported schema fields include:

url
data
sql
table_name
connect_args
format
columns
options
limit
vault_secret_id

For CLI-first workflows, url is the normal choice:

training_data:
  url: /path/to/train.csv

Or from Object Storage:

training_data:
  url: oci://bucket@namespace/regression/train.csv

Or from SQL:

training_data:
  sql: |
    SELECT feature_1, feature_2, target
    FROM DEMO.REGRESSION_TRAIN
  connect_args:
    wallet_dir: /home/datascience/oci_wallet

`test_data`¶

Optional. Use this when you want held-out evaluation.

Important:

The operator always validates that test_data contains the same feature columns as training_data.
test_metrics.csv and test_predictions.csv are written only when test_data includes the target column.

`output_directory`¶

Optional. Defaults to results.

The operator writes artifacts here. Local paths and oci:// paths are both supported.

Target and Feature Typing¶

`target_column`¶

Required. This is the continuous value to predict.

`column_types`¶

Optional. Use this to override automatic type inference.

Supported values are:

numerical
categorical
date

Example:

column_types:
  sales_date: date
  zip_code: categorical
  revenue: numerical

If column_types is not provided, the operator infers feature types from the training data.

Preprocessing¶

The current preprocessing implementation supports:

numeric coercion for numeric-like strings
median imputation for numeric columns
mode imputation for categorical columns
one-hot encoding for categorical columns
date expansion into year, month, day, dayofweek, and dayofyear

Configuration:

preprocessing:
  enabled: true
  steps:
    missing_value_imputation: true
    categorical_encoding: true

Important cautions¶

If you disable categorical_encoding while string categorical features are still present, the processed matrix can no longer be converted to numeric form and training can fail.
If you disable preprocessing.enabled, do so only when your remaining features are already in a model-ready numeric form.

Model Selection¶

`model`¶

Supported values:

auto
linear_regression
random_forest
knn
xgboost

`metric`¶

Supported values:

rmse
mae
mse
r2
mape

This metric controls:

explicit-model tuning
auto model selection

The metrics output files still include all five metrics regardless of which one you choose as the primary optimization metric.

`model_kwargs`¶

This dictionary is passed to the explicit model implementation and also supports tuning_n_trials.

Example:

model: knn
model_kwargs:
  tuning_n_trials: 5
  n_neighbors: 11
  weights: distance

Current behavior:

Explicit models use Optuna tuning by default with 20 trials.
Setting model_kwargs.tuning_n_trials: 0 disables tuning and uses the current default estimator parameters plus any explicit overrides you provide.
auto currently compares candidate models using cross-validation and then retrains the selected model. It does not use user-supplied explicit-model model_kwargs during candidate comparison.

Output Files¶

The output filenames can be customized with:

training_predictions_filename
test_predictions_filename
training_metrics_filename
test_metrics_filename
global_explanation_filename
report_filename

The report title can be customized with:

report_title

Report and Explainability Flags¶

`generate_report`¶

Defaults to true. When enabled, the operator writes report.html.

`generate_explanations`¶

Defaults to false.

Current implementation details:

global_explanations.csv is generated only when generate_explanations is true.
When generate_explanations is true, the operator first tries model-derived importance for models that expose it.
If model-derived importance is unavailable, the operator attempts a SHAP-based fallback.
This SHAP fallback is most relevant to knn.
If explainability is requested but cannot be produced, the run continues and the report explains that explainability was unavailable for that run.

Deployment Configuration¶

The operator also supports:

save_and_deploy_to_md:
  model_catalog_display_name: regression-model
  project_id: ocid1.datascienceproject.oc1..example
  compartment_id: ocid1.compartment.oc1..example
  model_deployment:
    display_name: regression-md
    initial_shape: VM.Standard.E4.Flex
    description: Regression model deployment
    log_group: ocid1.loggroup.oc1..example
    log_id: ocid1.log.oc1..example
    auto_scaling:
      minimum_instance: 1
      maximum_instance: 2
      scale_in_threshold: 10
      scale_out_threshold: 80
      scaling_metric: CPU_UTILIZATION
      cool_down_in_seconds: 600

When this block is present, the run also writes deployment metadata artifacts. See Productionize.

YAML Schema¶

Complete Example¶

Top-Level Fields¶

Data Inputs¶

training_data¶

test_data¶

output_directory¶