Advanced Use Cases

Mixed Feature Types

The regression operator automatically handles mixed tabular inputs.

Example:

kind: operator
type: regression
version: v1
spec:
  training_data:
    url: train.csv
  target_column: target
  model: random_forest
  column_types:
    event_date: date
    customer_id: categorical

With the current implementation:

  • numeric-like strings are coerced to numeric values

  • categorical columns are one-hot encoded

  • date columns are expanded to year, month, day, dayofweek, and dayofyear

This is useful when a CSV contains values such as:

  • numeric_text columns stored as strings

  • identifier columns such as customer_id

  • date strings such as 2025-01-01

Reading from Object Storage or SQL

Object Storage:

training_data:
  url: oci://bucket@namespace/regression/train.csv
test_data:
  url: oci://bucket@namespace/regression/test.csv

SQL:

training_data:
  sql: |
    SELECT x1, x2, x3, target
    FROM DEMO.REGRESSION_TRAIN
  connect_args:
    wallet_dir: /home/datascience/oci_wallet
test_data:
  sql: |
    SELECT x1, x2, x3, target
    FROM DEMO.REGRESSION_TEST
  connect_args:
    wallet_dir: /home/datascience/oci_wallet

Explicit Tuning Control

Explicit models use Optuna-backed tuning by default.

To reduce runtime for development:

model: linear_regression
model_kwargs:
  tuning_n_trials: 0

To keep tuning enabled but bounded:

model: xgboost
model_kwargs:
  tuning_n_trials: 5
  n_estimators: 300
  max_depth: 6

Current tuning behavior:

  • model_kwargs act as fixed overrides

  • the remaining model-specific parameters can still be explored by Optuna

  • the selected metric controls optimization direction

Understanding auto

The auto model currently compares:

  • linear_regression

  • random_forest

  • knn

  • xgboost

It evaluates them with cross-validation on the training data using the configured metric and then retrains the selected model on the full training set.

Example:

model: auto
metric: rmse

Important current behavior:

  • auto uses default candidate configurations during candidate comparison

  • user-supplied explicit-model model_kwargs are not used during this selection stage

Explainability by Model Family

linear_regression

Global explanations come from absolute coefficient values.

random_forest and xgboost

Global explanations come from model-derived feature importances.

knn

KNN does not expose built-in feature importance. If you want explainability output for KNN, enable SHAP-based explanations:

model: knn
generate_explanations: true

And make sure shap is installed in the runtime environment.

Held-Out Evaluation

When test_data contains the same feature columns and also includes the target column:

  • test_predictions.csv is written

  • test_metrics.csv is written

  • the HTML report includes a held-out evaluation section

Example:

test_data:
  url: test.csv

If the target column is not present in test_data, the operator still validates feature compatibility but does not generate held-out regression metrics.