==================
Advanced Use Cases
==================

Mixed Feature Types
-------------------

The regression operator automatically handles mixed tabular inputs.

Example:

.. code-block:: yaml

    kind: operator
    type: regression
    version: v1
    spec:
      training_data:
        url: train.csv
      target_column: target
      model: random_forest
      column_types:
        event_date: date
        customer_id: categorical

With the current implementation:

* numeric-like strings are coerced to numeric values
* categorical columns are one-hot encoded
* date columns are expanded to ``year``, ``month``, ``day``, ``dayofweek``, and ``dayofyear``

This is useful when a CSV contains values such as:

* ``numeric_text`` columns stored as strings
* identifier columns such as ``customer_id``
* date strings such as ``2025-01-01``

Reading from Object Storage or SQL
----------------------------------

Object Storage:

.. code-block:: yaml

    training_data:
      url: oci://bucket@namespace/regression/train.csv
    test_data:
      url: oci://bucket@namespace/regression/test.csv

SQL:

.. code-block:: yaml

    training_data:
      sql: |
        SELECT x1, x2, x3, target
        FROM DEMO.REGRESSION_TRAIN
      connect_args:
        wallet_dir: /home/datascience/oci_wallet
    test_data:
      sql: |
        SELECT x1, x2, x3, target
        FROM DEMO.REGRESSION_TEST
      connect_args:
        wallet_dir: /home/datascience/oci_wallet

Explicit Tuning Control
-----------------------

Explicit models use Optuna-backed tuning by default.

To reduce runtime for development:

.. code-block:: yaml

    model: linear_regression
    model_kwargs:
      tuning_n_trials: 0

To keep tuning enabled but bounded:

.. code-block:: yaml

    model: xgboost
    model_kwargs:
      tuning_n_trials: 5
      n_estimators: 300
      max_depth: 6

Current tuning behavior:

* ``model_kwargs`` act as fixed overrides
* the remaining model-specific parameters can still be explored by Optuna
* the selected ``metric`` controls optimization direction

Understanding ``auto``
----------------------

The ``auto`` model currently compares:

* ``linear_regression``
* ``random_forest``
* ``knn``
* ``xgboost``

It evaluates them with cross-validation on the training data using the configured ``metric`` and then retrains the selected model on the full training set.

Example:

.. code-block:: yaml

    model: auto
    metric: rmse

Important current behavior:

* ``auto`` uses default candidate configurations during candidate comparison
* user-supplied explicit-model ``model_kwargs`` are not used during this selection stage

Explainability by Model Family
------------------------------

``linear_regression``
~~~~~~~~~~~~~~~~~~~~~

Global explanations come from absolute coefficient values.

``random_forest`` and ``xgboost``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Global explanations come from model-derived feature importances.

``knn``
~~~~~~~

KNN does not expose built-in feature importance. If you want explainability output for KNN, enable SHAP-based explanations:

.. code-block:: yaml

    model: knn
    generate_explanations: true

And make sure ``shap`` is installed in the runtime environment.

Held-Out Evaluation
-------------------

When ``test_data`` contains the same feature columns and also includes the target column:

* ``test_predictions.csv`` is written
* ``test_metrics.csv`` is written
* the HTML report includes a held-out evaluation section

Example:

.. code-block:: yaml

    test_data:
      url: test.csv

If the target column is not present in ``test_data``, the operator still validates feature compatibility but does not generate held-out regression metrics.