YAML Schema

This page walks through the regression operator YAML based on the current implementation in ads/opctl/operator/lowcode/regression/schema.yaml and the corresponding runtime code.

Complete Example

kind: operator
type: regression
version: v1
spec:
  training_data:
    url: train.csv
  test_data:
    url: test.csv
  output_directory:
    url: results
  target_column: target
  column_types:
    event_date: date
    customer_id: categorical
  model: random_forest
  model_kwargs:
    tuning_n_trials: 10
    n_estimators: 300
  preprocessing:
    enabled: true
    steps:
      missing_value_imputation: true
      categorical_encoding: true
  metric: rmse
  training_predictions_filename: training_predictions.csv
  test_predictions_filename: test_predictions.csv
  training_metrics_filename: training_metrics.csv
  test_metrics_filename: test_metrics.csv
  global_explanation_filename: global_explanations.csv
  report_filename: report.html
  report_title: Regression Report
  generate_report: true
  generate_explanations: false

Top-Level Fields

  • kind: Must be operator.

  • type: Must be regression.

  • version: The current schema version is v1.

  • spec: Contains the operator-specific configuration.

Data Inputs

training_data

Required. This is the dataset used for fitting the model.

Supported schema fields include:

  • url

  • data

  • sql

  • table_name

  • connect_args

  • format

  • columns

  • options

  • limit

  • vault_secret_id

For CLI-first workflows, url is the normal choice:

training_data:
  url: /path/to/train.csv

Or from Object Storage:

training_data:
  url: oci://bucket@namespace/regression/train.csv

Or from SQL:

training_data:
  sql: |
    SELECT feature_1, feature_2, target
    FROM DEMO.REGRESSION_TRAIN
  connect_args:
    wallet_dir: /home/datascience/oci_wallet

test_data

Optional. Use this when you want held-out evaluation.

Important:

  • The operator always validates that test_data contains the same feature columns as training_data.

  • test_metrics.csv and test_predictions.csv are written only when test_data includes the target column.

output_directory

Optional. Defaults to results.

The operator writes artifacts here. Local paths and oci:// paths are both supported.

Target and Feature Typing

target_column

Required. This is the continuous value to predict.

column_types

Optional. Use this to override automatic type inference.

Supported values are:

  • numerical

  • categorical

  • date

Example:

column_types:
  sales_date: date
  zip_code: categorical
  revenue: numerical

If column_types is not provided, the operator infers feature types from the training data.

Preprocessing

The current preprocessing implementation supports:

  • numeric coercion for numeric-like strings

  • median imputation for numeric columns

  • mode imputation for categorical columns

  • one-hot encoding for categorical columns

  • date expansion into year, month, day, dayofweek, and dayofyear

Configuration:

preprocessing:
  enabled: true
  steps:
    missing_value_imputation: true
    categorical_encoding: true

Important cautions

  • If you disable categorical_encoding while string categorical features are still present, the processed matrix can no longer be converted to numeric form and training can fail.

  • If you disable preprocessing.enabled, do so only when your remaining features are already in a model-ready numeric form.

Model Selection

model

Supported values:

  • auto

  • linear_regression

  • random_forest

  • knn

  • xgboost

metric

Supported values:

  • rmse

  • mae

  • mse

  • r2

  • mape

This metric controls:

  • explicit-model tuning

  • auto model selection

The metrics output files still include all five metrics regardless of which one you choose as the primary optimization metric.

model_kwargs

This dictionary is passed to the explicit model implementation and also supports tuning_n_trials.

Example:

model: knn
model_kwargs:
  tuning_n_trials: 5
  n_neighbors: 11
  weights: distance

Current behavior:

  • Explicit models use Optuna tuning by default with 20 trials.

  • Setting model_kwargs.tuning_n_trials: 0 disables tuning and uses the current default estimator parameters plus any explicit overrides you provide.

  • auto currently compares candidate models using cross-validation and then retrains the selected model. It does not use user-supplied explicit-model model_kwargs during candidate comparison.

Output Files

The output filenames can be customized with:

  • training_predictions_filename

  • test_predictions_filename

  • training_metrics_filename

  • test_metrics_filename

  • global_explanation_filename

  • report_filename

The report title can be customized with:

  • report_title

Report and Explainability Flags

generate_report

Defaults to true. When enabled, the operator writes report.html.

generate_explanations

Defaults to false.

Current implementation details:

  • global_explanations.csv is generated only when generate_explanations is true.

  • When generate_explanations is true, the operator first tries model-derived importance for models that expose it.

  • If model-derived importance is unavailable, the operator attempts a SHAP-based fallback.

  • This SHAP fallback is most relevant to knn.

  • If explainability is requested but cannot be produced, the run continues and the report explains that explainability was unavailable for that run.

Deployment Configuration

The operator also supports:

save_and_deploy_to_md:
  model_catalog_display_name: regression-model
  project_id: ocid1.datascienceproject.oc1..example
  compartment_id: ocid1.compartment.oc1..example
  model_deployment:
    display_name: regression-md
    initial_shape: VM.Standard.E4.Flex
    description: Regression model deployment
    log_group: ocid1.loggroup.oc1..example
    log_id: ocid1.log.oc1..example
    auto_scaling:
      minimum_instance: 1
      maximum_instance: 2
      scale_in_threshold: 10
      scale_out_threshold: 80
      scaling_metric: CPU_UTILIZATION
      cool_down_in_seconds: 600

When this block is present, the run also writes deployment metadata artifacts. See Productionize.