Advanced Use Cases¶
Mixed Feature Types¶
The regression operator automatically handles mixed tabular inputs.
Example:
kind: operator
type: regression
version: v1
spec:
training_data:
url: train.csv
target_column: target
model: random_forest
column_types:
event_date: date
customer_id: categorical
With the current implementation:
numeric-like strings are coerced to numeric values
categorical columns are one-hot encoded
date columns are expanded to
year,month,day,dayofweek, anddayofyear
This is useful when a CSV contains values such as:
numeric_textcolumns stored as stringsidentifier columns such as
customer_iddate strings such as
2025-01-01
Reading from Object Storage or SQL¶
Object Storage:
training_data:
url: oci://bucket@namespace/regression/train.csv
test_data:
url: oci://bucket@namespace/regression/test.csv
SQL:
training_data:
sql: |
SELECT x1, x2, x3, target
FROM DEMO.REGRESSION_TRAIN
connect_args:
wallet_dir: /home/datascience/oci_wallet
test_data:
sql: |
SELECT x1, x2, x3, target
FROM DEMO.REGRESSION_TEST
connect_args:
wallet_dir: /home/datascience/oci_wallet
Explicit Tuning Control¶
Explicit models use Optuna-backed tuning by default.
To reduce runtime for development:
model: linear_regression
model_kwargs:
tuning_n_trials: 0
To keep tuning enabled but bounded:
model: xgboost
model_kwargs:
tuning_n_trials: 5
n_estimators: 300
max_depth: 6
Current tuning behavior:
model_kwargsact as fixed overridesthe remaining model-specific parameters can still be explored by Optuna
the selected
metriccontrols optimization direction
Understanding auto¶
The auto model currently compares:
linear_regressionrandom_forestknnxgboost
It evaluates them with cross-validation on the training data using the configured metric and then retrains the selected model on the full training set.
Example:
model: auto
metric: rmse
Important current behavior:
autouses default candidate configurations during candidate comparisonuser-supplied explicit-model
model_kwargsare not used during this selection stage
Explainability by Model Family¶
linear_regression¶
Global explanations come from absolute coefficient values.
random_forest and xgboost¶
Global explanations come from model-derived feature importances.
knn¶
KNN does not expose built-in feature importance. If you want explainability output for KNN, enable SHAP-based explanations:
model: knn
generate_explanations: true
And make sure shap is installed in the runtime environment.
Held-Out Evaluation¶
When test_data contains the same feature columns and also includes the target column:
test_predictions.csvis writtentest_metrics.csvis writtenthe HTML report includes a held-out evaluation section
Example:
test_data:
url: test.csv
If the target column is not present in test_data, the operator still validates feature compatibility but does not generate held-out regression metrics.