===========
YAML Schema
===========

This page walks through the regression operator YAML based on the current implementation in ``ads/opctl/operator/lowcode/regression/schema.yaml`` and the corresponding runtime code.

Complete Example
----------------

.. code-block:: yaml

    kind: operator
    type: regression
    version: v1
    spec:
      training_data:
        url: train.csv
      test_data:
        url: test.csv
      output_directory:
        url: results
      target_column: target
      column_types:
        event_date: date
        customer_id: categorical
      model: random_forest
      model_kwargs:
        tuning_n_trials: 10
        n_estimators: 300
      preprocessing:
        enabled: true
        steps:
          missing_value_imputation: true
          categorical_encoding: true
      metric: rmse
      training_predictions_filename: training_predictions.csv
      test_predictions_filename: test_predictions.csv
      training_metrics_filename: training_metrics.csv
      test_metrics_filename: test_metrics.csv
      global_explanation_filename: global_explanations.csv
      report_filename: report.html
      report_title: Regression Report
      generate_report: true
      generate_explanations: false

Top-Level Fields
----------------

* ``kind``: Must be ``operator``.
* ``type``: Must be ``regression``.
* ``version``: The current schema version is ``v1``.
* ``spec``: Contains the operator-specific configuration.

Data Inputs
-----------

``training_data``
~~~~~~~~~~~~~~~~~

Required. This is the dataset used for fitting the model.

Supported schema fields include:

* ``url``
* ``data``
* ``sql``
* ``table_name``
* ``connect_args``
* ``format``
* ``columns``
* ``options``
* ``limit``
* ``vault_secret_id``

For CLI-first workflows, ``url`` is the normal choice:

.. code-block:: yaml

    training_data:
      url: /path/to/train.csv

Or from Object Storage:

.. code-block:: yaml

    training_data:
      url: oci://bucket@namespace/regression/train.csv

Or from SQL:

.. code-block:: yaml

    training_data:
      sql: |
        SELECT feature_1, feature_2, target
        FROM DEMO.REGRESSION_TRAIN
      connect_args:
        wallet_dir: /home/datascience/oci_wallet

``test_data``
~~~~~~~~~~~~~

Optional. Use this when you want held-out evaluation.

Important:

* The operator always validates that ``test_data`` contains the same feature columns as ``training_data``.
* ``test_metrics.csv`` and ``test_predictions.csv`` are written only when ``test_data`` includes the target column.

``output_directory``
~~~~~~~~~~~~~~~~~~~~

Optional. Defaults to ``results``.

The operator writes artifacts here. Local paths and ``oci://`` paths are both supported.

Target and Feature Typing
-------------------------

``target_column``
~~~~~~~~~~~~~~~~~

Required. This is the continuous value to predict.

``column_types``
~~~~~~~~~~~~~~~~

Optional. Use this to override automatic type inference.

Supported values are:

* ``numerical``
* ``categorical``
* ``date``

Example:

.. code-block:: yaml

    column_types:
      sales_date: date
      zip_code: categorical
      revenue: numerical

If ``column_types`` is not provided, the operator infers feature types from the training data.

Preprocessing
-------------

The current preprocessing implementation supports:

* numeric coercion for numeric-like strings
* median imputation for numeric columns
* mode imputation for categorical columns
* one-hot encoding for categorical columns
* date expansion into ``year``, ``month``, ``day``, ``dayofweek``, and ``dayofyear``

Configuration:

.. code-block:: yaml

    preprocessing:
      enabled: true
      steps:
        missing_value_imputation: true
        categorical_encoding: true

Important cautions
~~~~~~~~~~~~~~~~~~

* If you disable ``categorical_encoding`` while string categorical features are still present, the processed matrix can no longer be converted to numeric form and training can fail.
* If you disable ``preprocessing.enabled``, do so only when your remaining features are already in a model-ready numeric form.

Model Selection
---------------

``model``
~~~~~~~~~

Supported values:

* ``auto``
* ``linear_regression``
* ``random_forest``
* ``knn``
* ``xgboost``

``metric``
~~~~~~~~~~

Supported values:

* ``rmse``
* ``mae``
* ``mse``
* ``r2``
* ``mape``

This metric controls:

* explicit-model tuning
* ``auto`` model selection

The metrics output files still include all five metrics regardless of which one you choose as the primary optimization metric.

``model_kwargs``
~~~~~~~~~~~~~~~~

This dictionary is passed to the explicit model implementation and also supports ``tuning_n_trials``.

Example:

.. code-block:: yaml

    model: knn
    model_kwargs:
      tuning_n_trials: 5
      n_neighbors: 11
      weights: distance

Current behavior:

* Explicit models use Optuna tuning by default with ``20`` trials.
* Setting ``model_kwargs.tuning_n_trials: 0`` disables tuning and uses the current default estimator parameters plus any explicit overrides you provide.
* ``auto`` currently compares candidate models using cross-validation and then retrains the selected model. It does not use user-supplied explicit-model ``model_kwargs`` during candidate comparison.

Output Files
------------

The output filenames can be customized with:

* ``training_predictions_filename``
* ``test_predictions_filename``
* ``training_metrics_filename``
* ``test_metrics_filename``
* ``global_explanation_filename``
* ``report_filename``

The report title can be customized with:

* ``report_title``

Report and Explainability Flags
-------------------------------

``generate_report``
~~~~~~~~~~~~~~~~~~~

Defaults to ``true``. When enabled, the operator writes ``report.html``.

``generate_explanations``
~~~~~~~~~~~~~~~~~~~~~~~~~

Defaults to ``false``.

Current implementation details:

* ``global_explanations.csv`` is generated only when ``generate_explanations`` is ``true``.
* When ``generate_explanations`` is ``true``, the operator first tries model-derived importance for models that expose it.
* If model-derived importance is unavailable, the operator attempts a SHAP-based fallback.
* This SHAP fallback is most relevant to ``knn``.
* If explainability is requested but cannot be produced, the run continues and the report explains that explainability was unavailable for that run.

Deployment Configuration
------------------------

The operator also supports:

.. code-block:: yaml

    save_and_deploy_to_md:
      model_catalog_display_name: regression-model
      project_id: ocid1.datascienceproject.oc1..example
      compartment_id: ocid1.compartment.oc1..example
      model_deployment:
        display_name: regression-md
        initial_shape: VM.Standard.E4.Flex
        description: Regression model deployment
        log_group: ocid1.loggroup.oc1..example
        log_id: ocid1.log.oc1..example
        auto_scaling:
          minimum_instance: 1
          maximum_instance: 2
          scale_in_threshold: 10
          scale_out_threshold: 80
          scaling_metric: CPU_UTILIZATION
          cool_down_in_seconds: 600

When this block is present, the run also writes deployment metadata artifacts. See :doc:`./productionize`.