ads.dataset package#

Submodules#

ads.dataset.classification_dataset module#

class ads.dataset.classification_dataset.BinaryClassificationDataset(df, sampled_df, target, target_type, shape, positive_class=None, **kwargs)[source]#

Bases: ClassificationDataset

Dataset for binary classification

set_positive_class(positive_class, missing_value=False)[source]#

Return new dataset with values in target column mapped to True or False in accordance with the specified positive label.

Parameters:
  • positive_class (same dtype as target) – The target label which should be identified as positive outcome from model.

  • missing_value (bool) – missing values will be converted to this

Returns:

dataset

Return type:

same type as the caller

Raises:

ValidationError – if the positive_class is not present in target

Examples

>>> ds = DatasetFactory.open("iris.csv")
>>> ds_with_target = ds.set_target('class')
>>> ds_with_pos_class = ds.set_positive_class('setosa')
class ads.dataset.classification_dataset.BinaryTextClassificationDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]#

Bases: BinaryClassificationDataset

Dataset for binary text classification

auto_transform()[source]#

Automatically chooses the most effective dataset transformation

select_best_features(score_func=None, k=12)[source]#

Automatically chooses the best features and removes the rest

class ads.dataset.classification_dataset.ClassificationDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]#

Bases: ADSDatasetWithTarget

Dataset for classification task

auto_transform(fix_imbalance: bool = True, correlation_threshold: float = 0.7, frac: float = 1.0, correlation_methods: str = 'pearson')[source]#

Return transformed dataset with several optimizations applied automatically. The optimizations include:

  • Dropping constant and primary key columns, which has no predictive quality,

  • Imputation, to fill in missing values in noisy data:

    • For continuous variables, fill with mean if less than 40% is missing, else drop,

    • For categorical variables, fill with most frequent if less than 40% is missing, else drop,

  • Dropping strongly co-correlated columns that tend to produce less generalizable models,

  • Balancing dataset using up or down sampling.

Parameters:
  • fix_imbalance (bool, defaults to True.) – Fix imbalance between classes in dataset. Used only for classification datasets.

  • correlation_threshold (float, defaults to 0.7. It must be between 0 and 1, inclusive.) – The correlation threshold where columns with correlation higher than the threshold will be considered as strongly co-correlated and recommended to be taken care of.

  • frac (float, defaults to 1.0. Range -> (0, 1].) – What fraction of the data should be used in the calculation?

  • correlation_methods (Union[list, str], defaults to 'pearson'.) –

    • ‘pearson’: Use Pearson’s Correlation between continuous features,

    • ’cramers v’: Use Cramer’s V correlations between categorical features,

    • ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,

    • ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].

    Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].

Returns:

transformed_dataset – The dataset after transformation

Return type:

ADSDatasetWithTarget

Examples

>>> ds_clean = ds.auto_transform(correlation_threshold=0.6)
convert_to_text_classification(text_column: str)[source]#

Builds a new dataset with the given text column as the only feature besides target.

Parameters:

text_column (str) – Feature name to use for text classification task

Returns:

ds – Dataset with one text feature and a classification target

Return type:

TextClassificationDataset

Examples

>>> review_ds = DatasetFactory.open("review_data.csv")
>>> ds_text_class = review_ds.convert_to_text_classification('reviews')
down_sample(sampler=None)[source]#

Fixes an imbalanced dataset by down-sampling.

Parameters:

sampler (An instance of SamplerMixin) – Should implement fit_resample(X,y) method. If None, does random down sampling.

Returns:

down_sampled_ds – A down-sampled dataset.

Return type:

ClassificationDataset

Examples

>>> ds = DatasetFactory.open("some_data.csv")
>>> ds_balanced_small = ds.down_sample()
up_sample(sampler='default')[source]#

Fixes imbalanced dataset by up-sampling

Parameters:
  • sampler (An instance of SamplerMixin) – Should implement fit_resample(X,y) method. If ‘default’, either SMOTE or random sampler will be used

  • fill_missing_type (a string) – Can either be ‘mean’, ‘mode’ or ‘median’.

Returns:

up_sampled_ds – an up-sampled dataset

Return type:

ClassificationDataset

Examples

>>> ds = DatasetFactory.open("some_data.csv")
>>> ds_balanced_large = ds.up_sample()
class ads.dataset.classification_dataset.MultiClassClassificationDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]#

Bases: ClassificationDataset

Dataset for multi-class classification

class ads.dataset.classification_dataset.MultiClassTextClassificationDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]#

Bases: MultiClassClassificationDataset

Dataset for multi-class text classification

auto_transform()[source]#

Automatically chooses the most effective dataset transformation

select_best_features(score_func=None, k=12)[source]#

Automatically chooses the best features and removes the rest

ads.dataset.correlation module#

ads.dataset.correlation_plot module#

class ads.dataset.correlation_plot.BokehHeatMap(ds)[source]#

Bases: object

Generate a HeatMap or horizontal bar plot to compare features.

debug()[source]#

Return True if in debug mode, otherwise False.

flatten_corr_matrix(corr_matrix)[source]#

Flatten a correlation matrix into a pandas Dataframe.

Parameters:

corr_matrix (Pandas Dataframe) – The correlation matrix to be flattened.

Returns:

corr_flatten – The flattened correlation matrix.

Return type:

Pandas DataFrame

generate_heatmap(corr_matrix, title: str, msg: str, correlation_threshold: float)[source]#

Generate a heatmap from a correlation matrix.

Parameters:
  • corr_matrix (Pandas Dataframe) – The dataframe to be used for heatmap generation.

  • title (str) – title of the heatmap.

  • msg (str) – An additional msg to include in the plot.

  • correlation_threshold (float) – A float between 0 and 1 which is used for excluding correlations which are not intense enough from the plot.

Returns:

tab – A matplotlib Panel object which includes a plotted heatmap

Return type:

matplotlib Panel

generate_target_heatmap(corr_matrix, title: str, correlation_target: str, msg: str, correlation_threshold: float)[source]#

Generate a heatmap from a correlation matrix and its targets.

Parameters:
  • corr_matrix (Pandas Dataframe) – The dataframe to be used for heatmap generation.

  • title (str) – title of the heatmap.

  • correlation_target (str) – The target column name for computing correlations against.

  • msg (str) – An additional msg to include in the plot.

  • correlation_threshold (float) – A float between 0 and 1 which is used for excluding correlations which are not intense enough from the plot.

Returns:

tab – A matplotlib Panel object which includes a plotted heatmap.

Return type:

matplotlib Panel

plot_correlation_heatmap(ds, plot_type: str = 'heatmap', correlation_target: str = None, correlation_threshold=-1, correlation_methods: str = 'pearson', **kwargs)[source]#

Plots a correlation heatmap.

Parameters:
  • ds (Pandas Slice) – A data slice or file

  • plot_type (str Defaults to "heatmap") – The type of plot - “bar” is another option.

  • correlation_target (str, Defaults to None) – the target column for correlation calculations.

  • correlation_threshold (float, Defaults to -1) – the threshold for computing correlation heatmap elements.

  • correlation_methods (str, Defaults to "pearson") – the way to compute correlations, other options are “cramers v” and “correlation ratio”

plot_hbar(matrix, low: float = 1, high=1, title: str = None, tool_tips: list = None, column_name: str = None)[source]#

Plots a histogram bar-graph.

Parameters:
  • matrix (Pandas Dataframe) – The dataframe to be plotted.

  • low (float, Defaults to 1) – The color mapping value for “low” points.

  • high (float, Defaults to 1) – The color mapping value for “high” points.

  • title (str, Defaults to None) – The optional title of the heat map.

  • tool_tips (list of str, Defaults to None) – An optional list of tool tips to include with the plot.

  • column_name (str, Defaults to None) – The name of the column which is being plotted.

Returns:

fig – A matplotlib heatmap figure object.

Return type:

matplotlib Figure

plot_heat_map(matrix, xrange: list, yrange: list, low: float = 1, high=1, title: str = None, tool_tips: list = None)[source]#

Plots a matrix as a heatmap.

Parameters:
  • matrix (Pandas Dataframe) – The dataframe to be plotted.

  • xrange (List of floats) – The range of x values to plot.

  • yrange (List of floats) – The range of y values to plot.

  • low (float, Defaults to 1) – The color mapping value for “low” points.

  • high (float, Defaults to 1) – The color mapping value for “high” points.

  • title (str, Defaults to None) – The optional title of the heat map.

  • tool_tips (list of str, Defaults to None) – An optional list of tool tips to include with the plot.

Returns:

fig – A matplotlib heatmap figure object.

Return type:

matplotlib Figure

ads.dataset.correlation_plot.plot_correlation_heatmap(ds=None, **kwargs) None[source]#

Plots a correlation heatmap.

Parameters:

ds (Pandas Slice) – A data slice or file

ads.dataset.dask_series module#

ads.dataset.dataframe_transformer module#

class ads.dataset.dataframe_transformer.DataFrameTransformer(func_name, target_name, target_sample_val, args=None, kw_args=None)[source]#

Bases: TransformerMixin

A DataFrameTransformer object.

fit(df)[source]#

Takes in a DF and returns a fitted model

transform(df)[source]#

Takes in a DF and returns a transformed DF

ads.dataset.dataframe_transformer.expand_lambda_function(lambda_func)[source]#

Returns a lambda function after expansion.

ads.dataset.dataset module#

class ads.dataset.dataset.ADSDataset(df, sampled_df=None, shape=None, name='', description=None, type_discovery=True, types={}, metadata=None, progress=<ads.dataset.progress.DummyProgressBar object>, transformer_pipeline=None, interactive=False, **kwargs)[source]#

Bases: PandasDataset

An ADSDataset Object.

The ADSDataset object cannot be used for classification or regression problems until a target has been set using set_target. To see some rows in the data use any of the usual Pandas functions like head(). There are also a variety of converters, to_dask, to_pandas, to_h2o, to_xgb, to_csv, to_parquet, to_json & to_hdf .

assign_column(column, arg)[source]#

Return new dataset with new column or values of the existing column mapped according to input correspondence.

Used for adding a new column or substituting each value in a column with another value, that may be derived from a function, a pandas.Series or a pandas.DataFrame.

Parameters:
  • column (str) – Name of the feature to update.

  • arg (function, dict, Series or DataFrame) – Mapping correspondence.

Returns:

dataset – a dataset with the specified column assigned.

Return type:

same type as the caller

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds_same_size = ds.assign_column('target',lambda x:  x>15 if x not None)
>>> ds_bigger = ds.assign_column('new_col', np.arange(ds.shape[0]))
astype(types)[source]#

Convert data type of features.

Parameters:

types (dict) – key is the existing feature name value is the data type to which the values of the feature should be converted. Valid data types: All numpy datatypes (Example: np.float64, np.int64, …) or one of categorical, continuous, ordinal or datetime.

Returns:

updated_dataset – an ADSDataset with new data types

Return type:

ADSDataset

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds_reformatted = ds.astype({"target": "categorical"})
call(func, *args, sample_size=None, **kwargs)[source]#

Runs a custom function on dataframe

func will receive the pandas dataframe (which represents the dataset) as an argument named ‘df’ by default. This can be overridden by specifying the dataframe argument name in a tuple (func, dataframe_name).

Parameters:
  • func (Union[callable, tuple]) – Custom function that takes pandas dataframe as input Alternatively a (callable, data) tuple where data is a string indicating the keyword of callable that expects the dataframe name

  • args (iterable, optional) – Positional arguments passed into func

  • sample_size (int, Optional) – To use a sampled dataframe

  • kwargs (mapping, optional) – A dictionary of keyword arguments passed into func

Returns:

func – a plotting function that contains *args and **kwargs

Return type:

function

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("classfication_data.csv"))
>>> def f1(df):
...  return(sum(df), axis=0)
>>> sum_ds = ds.call(f1)
compute()[source]#
corr(correlation_methods: list | str = 'pearson', frac: float = 1.0, sample_size: float = 1.0, nan_threshold: float = 0.8, overwrite: bool | None = None, force_recompute: bool = False)[source]#

Compute pairwise correlation of numeric and categorical columns, output a matrix or a list of matrices computed using the correlation methods passed in.

Parameters:
  • correlation_methods (Union[list, str], default to 'pearson') –

    • ‘pearson’: Use Pearson’s Correlation between continuous features,

    • ’cramers v’: Use Cramer’s V correlations between categorical features,

    • ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,

    • ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].

    Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].

  • frac – Is deprecated and replaced by sample_size.

  • sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What fraction of the data should be used in the calculation?

  • nan_threshold (float, default to 0.8, Range -> [0, 1]) – Only compute a correlation when the proportion of the values, in a column, is less than or equal to nan_threshold.

  • overwrite – Is deprecated and replaced by force_recompute.

  • force_recompute (bool, default to be False) –

    • If False, it calculates the correlation matrix if there is no cached correlation matrix. Otherwise, it returns the cached correlation matrix.

    • If True, it calculates the correlation matrix regardless whether there is cached result or not.

Returns:

correlation – The pairwise correlations as a matrix (DataFrame) or list of matrices

Return type:

Union[list, pandas.DataFrame]

property ddf#
df_read_functions = ['head', 'describe', '_get_numeric_data']#
drop_columns(columns)[source]#

Return new dataset with specified columns removed.

Parameters:

columns (str or list) – columns to drop.

Returns:

dataset – a dataset with specified columns dropped.

Return type:

same type as the caller

Raises:

ValidationError – If any of the feature names is not found in the dataset.

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds_smaller = ds.drop_columns(['col1', 'col2'])
static from_dataframe(df, sampled_df=None, shape=None, name='', description=None, type_discovery=True, types={}, metadata=None, progress=<ads.dataset.progress.DummyProgressBar object>, transformer_pipeline=None, interactive=False, **kwargs) ADSDataset[source]#
get_recommendations(*args, **kwargs)[source]#

Returns user-friendly error message to set target variable before invoking this API.

Parameters:

kwargs

Returns:

raises NotImplementedError, if target parameter value not provided

Return type:

NotImplementedError

merge(data, **kwargs)[source]#

Merges this dataset with another ADSDataset or pandas dataframe.

Parameters:
  • data (Union[ADSDataset, pandas.DataFrame]) – Data to merge.

  • kwargs (dict, optional) – additional keyword arguments that would be passed to underlying dataframe’s merge API.

Examples

>>> import pandas as pd
>>> df1 = pd.read_csv("data1.csv")
>>> df2 = pd.read_csv("data2.csv")
>>> ds = ADSDataset.from_dataframe(df1.merge(df2))
>>> ds_12 = ds1.merge(ds2)
rename_columns(columns)[source]#

Returns a new dataset with altered column names.

dict values must be unique (1-to-1). Labels not contained in a dict will be left as-is. Extra labels listed don’t throw an error.

Parameters:

columns (dict-like or function or list of str) – dict to rename columns selectively, or list of names to rename all columns, or a function like str.upper

Returns:

dataset – A dataset with specified columns renamed.

Return type:

same type as the caller

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds_renamed = ds.rename_columns({'col1': 'target'})
sample(frac=None, random_state=42)[source]#

Returns random sample of dataset.

Parameters:
  • frac (float, optional) – Fraction of axis items to return.

  • random_state (int or np.random.RandomState) – If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState

Returns:

sampled_dataset – An ADSDataset which was randomly sampled.

Return type:

ADSDataset

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds_sample = ds.sample()
set_description(description)[source]#

Sets description for the dataset.

Give your dataset a description.

Parameters:

description (str) – Description of the dataset.

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data1.csv"))
>>> ds_renamed = ds.set_description("dataset1 is from "data1.csv"")
set_name(name)[source]#

Sets name for the dataset.

This name will be used to filter the datasets returned by ds.list() API. Calling this API is optional. By default name of the dataset is set to empty.

Parameters:

name (str) – Name of the dataset.

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data1.csv"))
>>> ds_renamed = ds.set_name("dataset1")
set_target(target, type_discovery=True, target_type=None)[source]#

Returns a dataset tagged based on the type of target.

Parameters:
  • target (str) – name of the feature to use as target.

  • type_discovery (bool) – This is set as True by default.

  • target_type (type) – If provided, then the target will be typed with the provided value.

Returns:

ds – tagged according to the type of the target column.

Return type:

ADSDataset

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("classfication_data.csv"))
>>> ds_with_target= ds.set_target("target_class")
show_corr(frac: float = 1.0, sample_size: float = 1.0, nan_threshold: float = 0.8, overwrite: bool | None = None, force_recompute: bool = False, correlation_target: str | None = None, plot_type: str = 'heatmap', correlation_threshold: float = -1, correlation_methods='pearson', **kwargs)[source]#

Show heatmap or barplot of pairwise correlation of numeric and categorical columns, output three tabs which are heatmap or barplot of correlation matrix of numeric columns vs numeric columns using pearson correlation method, categorical columns vs categorical columns using Cramer’s V method, and numeric vs categorical columns, excluding NA/null values and columns which have more than 80% of NA/null values. By default, only ‘pearson’ correlation is calculated and shown in the first tab. Set correlation_methods=’all’ to show all correlation charts.

Parameters:
  • frac (Is superseded by sample_size) –

  • sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What fraction of the data should be used in the calculation?

  • nan_threshold (float, defaults to 0.8, Range -> [0, 1]) – In the default case, it will only calculate the correlation of the columns which has less than or equal to 80% of missing values.

  • overwrite – Is deprecated and replaced by force_recompute.

  • force_recompute (bool, default to be False.) –

    • If False, it calculates the correlation matrix if there is no cached correlation matrix. Otherwise, it returns the cached correlation matrix.

    • If True, it calculates the correlation matrix regardless whether there is cached result or not.

  • plot_type (str, default to "heatmap") – It can only be “heatmap” or “bar”. Note that if “bar” is chosen, correlation_target also has to be set and the bar chart will only show the correlation values of the pairs which have the target in them.

  • correlation_target (str, default to Non) – It can be any columns of type continuous, ordinal, categorical or zipcode. When correlation_target is set, only pairs that contains correlation_target will show.

  • correlation_threshold (float, default to -1) – It can be any number between -1 and 1.

  • correlation_methods (Union[list, str], defaults to 'pearson') –

    • ‘pearson’: Use Pearson’s Correlation between continuous features,

    • ’cramers v’: Use Cramer’s V correlations between categorical features,

    • ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,

    • ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].

    Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].

Return type:

None

show_in_notebook(correlation_threshold=-1, selected_index=0, sample_size=0, visualize_features=True, correlation_methods='pearson', **kwargs)[source]#

Provide visualization of dataset.

  • Display feature distribution. The data table display will show a maximum of 8 digits,

  • Plot the correlation between the dataset features (as a heatmap) only when all the features are continuous or ordinal,

  • Display data head.

Parameters:
  • correlation_threshold (int, default -1) – The correlation threshold to select, which only show features that have larger or equal correlation values than the threshold.

  • selected_index (int, str, default 0) – The displayed output is stacked into an accordion widget, use selected_index to force the display to open a specific element, use the (zero offset) index or any prefix string of the name (eg, ‘corr’ for correlations)

  • sample_size (int, default 0) – The size (in rows) to sample for visualizations

  • visualize_features (bool default False) – For the “Features” section control if feature visualizations are shown or not. If not only a summary of the numeric statistics is shown. The numeric statistics are also always shows for wide (>64 features) datasets

  • correlation_methods (Union[list, str], default to 'pearson') –

    • ‘pearson’: Use Pearson’s Correlation between continuous features,

    • ’cramers v’: Use Cramer’s V correlations between categorical features,

    • ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,

    • ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].

    Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].

snapshot(snapshot_dir=None, name='', storage_options=None)[source]#

Snapshot the dataset with modifications made so far.

Optionally caller can invoke ds.set_name() before saving to identify the dataset uniquely at the time of using ds.list().

The snapshot can be reloaded by providing the URI returned by this API to DatasetFactory.open()

Parameters:
  • snapshot_dir (str, optional) – Directory path under which dataset snapshot will be created. Defaults to snapshots_dir set using DatasetFactory.set_default_storage().

  • name (str, optional, default: "") – Name to uniquely identify the snapshot using DatasetFactory.list_snapshots(). If not provided, an auto-generated name is used.

  • storage_options (dict, optional) – Parameters passed on to the backend filesystem class. Defaults to storage_options set using DatasetFactory.set_default_storage().

Returns:

p_str – the URI to access the snapshotted dataset.

Return type:

str

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds_uri = ds.snapshot()
suggest_recommendations(*args, **kwargs)[source]#

Returns user-friendly error message to set target variable before invoking this API.

Parameters:

kwargs

Returns:

raises NotImplementedError, if target parameter value not provided

Return type:

NotImplementedError

to_avro(path, schema=None, storage_options=None, **kwargs)[source]#

Save data to Avro files. Avro is a remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.

Parameters:
  • path (string) – Path to a target filename. May contain a * to denote many filenames.

  • schema (dict) – Avro schema dictionary, see below.

  • storage_options (dict, optional) – Parameters passed to the backend filesystem class. Defaults to storage_options set using DatasetFactory.set_default_storage().

  • kwargs (dict, optional) – See https://fastavro.readthedocs.io/en/latest/writer.html

Notes

Avro schema is a complex dictionary describing the data, see https://avro.apache.org/docs/1.8.2/gettingstartedpython.html#Defining+a+schema and https://fastavro.readthedocs.io/en/latest/writer.html. Its structure is as follows:

{'name': 'Test',
'namespace': 'Test',
'doc': 'Descriptive text',
'type': 'record',
'fields': [
    {'name': 'a', 'type': 'int'},
]}

where the “name” field is required, but “namespace” and “doc” are optional descriptors; “type” must always be “record”. The list of fields should have an entry for every key of the input records, and the types are like the primitive, complex or logical types of the Avro spec (https://avro.apache.org/docs/1.8.2/spec.html).

Examples

>>> import pandas
>>> import fastavro
>>> with open("data.avro", "rb") as fp:
>>>     reader = fastavro.reader(fp)
>>>     records = [r for r in reader]
>>>     df = pandas.DataFrame.from_records(records)
>>> ds = ADSDataset.from_dataframe(df)
>>> ds.to_avro("my/path.avro")
to_csv(path, storage_options=None, **kwargs)[source]#

Save the materialized dataframe to csv file.

Parameters:
  • path (str) – Location to write to. If there are more than one partitions in df, should include a glob character to expand into a set of file names, or provide a name_function=parameter. Supports protocol specifications such as “oci://”, “s3://”.

  • storage_options (dict, optional) – Parameters passed on to the backend filesystem class. Defaults to storage_options set using DatasetFactory.set_default_storage().

  • kwargs (dict, optional) –

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> [ds_link] = ds.to_csv("my/path.csv")
to_dask(filter=None, frac=None, npartitions=None, include_transformer_pipeline=False)[source]#

Returns a copy of the data as dask.dataframe.core.DataFrame, and a sklearn pipeline optionally that holds the transformations run so far on the data.

The pipeline returned can be updated with the transformations done offline and passed along with the dataframe to Dataset.open API if the transformations need to be reproduced at the time of scoring.

Parameters:
Returns:

  • dataframe (dask.dataframe.core.DataFrame) – if include_transformer_pipeline is False.

  • (data, transformer_pipeline) (tuple of dask.dataframe.core.DataFrame and dataset.pipeline.TransformerPipeline) – if include_transformer_pipeline is True.

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds_dask = ds.to_dask()

Notes

See also http://docs.dask.org/en/latest/dataframe-api.html#dataframe and https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

to_dask_dataframe(filter=None, frac=None, npartitions=None, include_transformer_pipeline=False)[source]#
to_h2o(filter=None, frac=None, include_transformer_pipeline=False)[source]#

Returns a copy of the data as h2o.H2OFrame, and a sklearn pipeline optionally that holds the transformations run so far on the data.

The pipeline returned can be updated with the transformations done offline and passed along with the dataframe to Dataset.open API if the transformations need to be reproduced at the time of scoring.

Parameters:
Returns:

  • dataframe (h2o.H2OFrame) – if include_transformer_pipeline is False.

  • (data, transformer_pipeline) (tuple of h2o.H2OFrame and dataset.pipeline.TransformerPipeline) – if include_transformer_pipeline is True.

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds_as_h2o = ds.to_h2o()

Notes

See also https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

to_h2o_dataframe(filter=None, frac=None, include_transformer_pipeline=False)[source]#
to_hdf(path: str, key: str, storage_options: dict | None = None, **kwargs) str[source]#

Save data to Hierarchical Data Format (HDF) files.

Parameters:
  • path (string) – Path to a target filename.

  • key (string) – Datapath within the files.

  • storage_options (dict, optional) – Parameters passed to the backend filesystem class. Defaults to storage_options set using DatasetFactory.set_default_storage().

  • kwargs (dict, optional) –

Returns:

The filename of the HDF5 file created.

Return type:

str

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds.to_hdf(path="my/path.h5", key="df")
to_json(path, storage_options=None, **kwargs)[source]#

Save data to JSON files.

Parameters:
  • path (str) – Location to write to. If there are more than one partitions in df, should include a glob character to expand into a set of file names, or provide a name_function=parameter. Supports protocol specifications such as “oci://”, “s3://”.

  • storage_options (dict, optional) – Parameters passed on to the backend filesystem class. Defaults to storage_options set using DatasetFactory.set_default_storage().

  • kwargs (dict, optional) –

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds.to_json("my/path.json")
to_pandas(filter=None, frac=None, include_transformer_pipeline=False)[source]#

Returns a copy of the data as pandas.DataFrame, and a sklearn pipeline optionally that holds the transformations run so far on the data.

The pipeline returned can be updated with the transformations done offline and passed along with the dataframe to Dataset.open API if the transformations need to be reproduced at the time of scoring.

Parameters:
Returns:

  • dataframe (pandas.DataFrame) – if include_transformer_pipeline is False.

  • (data, transformer_pipeline) (tuple of pandas.DataFrame and dataset.pipeline.TransformerPipeline) – if include_transformer_pipeline is True.

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds_as_df = ds.to_pandas()

Notes

See also https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

to_pandas_dataframe(filter=None, frac=None, include_transformer_pipeline=False)[source]#
to_parquet(path, storage_options=None, **kwargs)[source]#

Save data to parquet file.

Parameters:
  • path (str) – Location to write to. If there are more than one partitions in df, should include a glob character to expand into a set of file names, or provide a name_function=parameter. Supports protocol specifications such as “oci://”, “s3://”.

  • storage_options (dict, optional) – Parameters passed on to the backend filesystem class. Defaults to storage_options set using DatasetFactory.set_default_storage().

  • kwargs (dict, optional) –

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> ds.to_parquet("my/path")
to_xgb(filter=None, frac=None, include_transformer_pipeline=False)[source]#

Returns a copy of the data as xgboost.DMatrix, and a sklearn pipeline optionally that holds the transformations run so far on the data.

The pipeline returned can be updated with the transformations done offline and passed along with the dataframe to Dataset.open API if the transformations need to be reproduced at the time of scoring.

Parameters:
Returns:

  • dataframe (xgboost.DMatrix) – if include_transformer_pipeline is False.

  • (data, transformer_pipeline) (tuple of xgboost.DMatrix and dataset.pipeline.TransformerPipeline) – if include_transformer_pipeline is True.

Examples

>>> import pandas as pd
>>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv"))
>>> xgb_dmat = ds.to_xgb()

Notes

See also https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

to_xgb_dmatrix(filter=None, frac=None, include_transformer_pipeline=False)[source]#

ads.dataset.dataset_browser module#

class ads.dataset.dataset_browser.DatasetBrowser[source]#

Bases: ABC

static GitHub(user: str, repo: str, branch: str = 'master')[source]#

Returns a GitHubDataset

static filesystem(folder: str)[source]#

Returns a LocalFilesystemDataset.

filter_list(L, filter_pattern) List[str][source]#

Filters a list of dataset names.

static list(filter_pattern='*') List[str][source]#

Return a list of dataset browser strings.

abstract open(**kwargs)[source]#

Return new dataset for the given name.

Parameters:

name (str) – the name of the dataset to open.

Returns:

ds

Return type:

Dataset

Examples

ds_browser = DatasetBrowser(“sklearn”)

ds = ds_browser.open(“iris”)

static seaborn()[source]#

Returns a SeabornDataset.

static sklearn()[source]#

Returns a SklearnDataset.

static web(index_url: str)[source]#

Returns a WebDataset.

class ads.dataset.dataset_browser.GitHubDatasets(user: str, repo: str, branch: str)[source]#

Bases: DatasetBrowser

list(filter_pattern: str = '.*') List[str][source]#

Return a list of dataset browser strings.

open(name: str, **kwargs)[source]#

Return new dataset for the given name.

Parameters:

name (str) – the name of the dataset to open.

Returns:

ds

Return type:

Dataset

Examples

ds_browser = DatasetBrowser(“sklearn”)

ds = ds_browser.open(“iris”)

class ads.dataset.dataset_browser.LocalFilesystemDatasets(folder: str)[source]#

Bases: DatasetBrowser

list(filter_pattern: str = '.*') List[str][source]#

Return a list of dataset browser strings.

open(name: str, **kwargs)[source]#

Return new dataset for the given name.

Parameters:

name (str) – the name of the dataset to open.

Returns:

ds

Return type:

Dataset

Examples

ds_browser = DatasetBrowser(“sklearn”)

ds = ds_browser.open(“iris”)

class ads.dataset.dataset_browser.SeabornDatasets[source]#

Bases: DatasetBrowser

list(filter_pattern: str = '.*') List[str][source]#

Return a list of dataset browser strings.

open(name: str, **kwargs)[source]#

Return new dataset for the given name.

Parameters:

name (str) – the name of the dataset to open.

Returns:

ds

Return type:

Dataset

Examples

ds_browser = DatasetBrowser(“sklearn”)

ds = ds_browser.open(“iris”)

class ads.dataset.dataset_browser.SklearnDatasets[source]#

Bases: DatasetBrowser

list(filter_pattern: str = '.*') List[str][source]#

Return a list of dataset browser strings.

open(name: str, **kwargs)[source]#

Return new dataset for the given name.

Parameters:

name (str) – the name of the dataset to open.

Returns:

ds

Return type:

Dataset

Examples

ds_browser = DatasetBrowser(“sklearn”)

ds = ds_browser.open(“iris”)

sklearn_datasets = ['breast_cancer', 'diabetes', 'iris', 'wine', 'digits']#
class ads.dataset.dataset_browser.WebDatasets(index_url: str)[source]#

Bases: DatasetBrowser

list(filter_pattern: str = '.*') List[str][source]#

Return a list of dataset browser strings.

open(name: str, **kwargs)[source]#

Return new dataset for the given name.

Parameters:

name (str) – the name of the dataset to open.

Returns:

ds

Return type:

Dataset

Examples

ds_browser = DatasetBrowser(“sklearn”)

ds = ds_browser.open(“iris”)

ads.dataset.dataset_with_target module#

class ads.dataset.dataset_with_target.ADSDatasetWithTarget(df, target, sampled_df=None, shape=None, target_type=None, sample_max_rows=-1, type_discovery=True, types={}, parent=None, name='', metadata=None, transformer_pipeline=None, description=None, progress=<ads.dataset.progress.DummyProgressBar object>, **kwargs)[source]#

Bases: ADSDataset

This class provides APIs for preparing dataset for modeling.

auto_transform(correlation_threshold: float = 0.7, frac: float = 1.0, sample_size=1.0, correlation_methods: str | list = 'pearson')[source]#

Return transformed dataset with several optimizations applied automatically. The optimizations include:

  • Dropping constant and primary key columns, which has no predictive quality,

  • Imputation, to fill in missing values in noisy data:

    • For continuous variables, fill with mean if less than 40% is missing, else drop,

    • For categorical variables, fill with most frequent if less than 40% is missing, else drop,

  • Dropping strongly co-correlated columns that tend to produce less generalizable models.

Parameters:
  • correlation_threshold (float, defaults to 0.7. It must be between 0 and 1, inclusive) – the correlation threshold where columns with correlation higher than the threshold will be considered as strongly co-correlated and recommended to be taken care of.

  • frac (Is superseded by sample_size) –

  • sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What fraction of the data should be used in the calculation?

  • correlation_methods (Union[list, str], defaults to 'pearson') –

    • ‘pearson’: Use Pearson’s Correlation between continuous features,

    • ’cramers v’: Use Cramer’s V correlations between categorical features,

    • ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,

    • ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].

    Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].

Returns:

transformed_dataset

Return type:

ADSDatasetWithTarget

Examples

>>> ds_clean = ds.auto_transform()
static from_dataframe(df: DataFrame, target: str, sampled_df: DataFrame | None = None, shape: Tuple[int, int] | None = None, target_type: TypedFeature | None = None, positive_class=None, **init_kwargs)[source]#
get_recommendations(correlation_methods: str = 'pearson', correlation_threshold: float = 0.7, frac: float = 1.0, sample_size: float = 1.0, overwrite: bool = None, force_recompute: bool = False, display_format: str = 'widget')[source]#

Generate recommendations for dataset optimization. This includes:

  • Identifying constant and primary key columns, which has no predictive quality,

  • Imputation, to fill in missing values in noisy data:

    • For continuous variables, fill with mean if less than 40% is missing, else drop,

    • For categorical variables, fill with most frequent if less than 40% is missing, else drop,

  • Identifying strongly co-correlated columns that tend to produce less generalizable models,

  • Automatically balancing dataset for classification problems using up or down sampling.

Parameters:
  • correlation_methods (Union[list, str], default to 'pearson') –

    • ‘pearson’: Use Pearson’s Correlation between continuous features,

    • ’cramers v’: Use Cramer’s V correlations between categorical features,

    • ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,

    • ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].

    Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].

  • correlation_threshold (float, defaults to 0.7. It must be between 0 and 1, inclusive) – The correlation threshold where columns with correlation higher than the threshold will be considered as strongly co-correlated and recommended to be taken care of.

  • frac (Is superseded by sample_size) –

  • sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What fraction of the data should be used in the calculation?

  • overwrite – Is deprecated and replaced by force_recompute.

  • force_recompute (bool, default to be False) –

    • If False, it calculates the correlation matrix if there is no cached correlation matrix. Otherwise, it returns the cached correlation matrix.

    • If True, it calculates the correlation matrix regardless whether there is cached result or not.

  • display_format (string, defaults to 'widget'.) – Should be either ‘widget’ or ‘table’. If ‘widget’, a GUI style interface is popped out; if ‘table’, a table of suggestions is shown.

get_transformed_dataset()[source]#

Return the transformed dataset with the recommendations applied.

This method should be called after applying the recommendations using the Recommendation#show_in_notebook() API.

rename_columns(columns)[source]#

Returns a dataset with columns renamed.

select_best_features(score_func=None, k=12)[source]#

Return new dataset containing only the top k features.

Parameters:
  • k (int, default 12) – The top ‘k’ features to select.

  • score_func (function) – Scoring function to use to rank the features. This scoring function should take a 2d array X(features) and an array like y(target) and return a numeric score for each feature in the same order as X.

Notes

See also https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html and https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html

Examples

>>> ds = DatasetBrowser("sklearn").open("iris")
>>> ds_small = ds.select_best_features(k=2)
suggest_recommendations(correlation_methods: str | list = 'pearson', print_code: bool = True, correlation_threshold: float = 0.7, overwrite: bool | None = None, force_recompute: bool = False, frac: float = 1.0, sample_size: float = 1.0, **kwargs)[source]#

Returns a pandas dataframe with suggestions for dataset optimization. This includes:

  • Identifying constant and primary key columns, which has no predictive quality,

  • Imputation, to fill in missing values in noisy data:

    • For continuous variables, fill with mean if less than 40% is missing, else drop,

    • For categorical variables, fill with most frequent if less than 40% is missing, else drop,

  • Identifying strongly co-correlated columns that tend to produce less generalizable models,

  • Automatically balancing dataset for classification problems using up or down sampling.

Parameters:
  • correlation_methods (Union[list, str], default to 'pearson') –

    • ‘pearson’: Use Pearson’s Correlation between continuous features,

    • ’cramers v’: Use Cramer’s V correlations between categorical features,

    • ’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,

    • ’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].

    Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’]

  • print_code (bool, Defaults to True) – Print Python code for the suggested actions.

  • correlation_threshold (float. Defaults to 0.7. It must be between 0 and 1, inclusive) – the correlation threshold where columns with correlation higher than the threshold will be considered as strongly co-correated and recommended to be taken care of.

  • frac (Is superseded by sample_size) –

  • sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What fraction of the data should be used in the calculation?

  • overwrite – Is deprecated and replaced by force_recompute.

  • force_recompute (bool, default to be False) –

    • If False, it calculates the correlation matrix if there is no cached correlation matrix. Otherwise, it returns the cached correlation matrix.

    • If True, it calculates the correlation matrix regardless whether there is cached result or not.

Returns:

suggestion dataframe

Return type:

pandas.DataFrame

Examples

>>> suggestion_df = ds.suggest_recommendations(correlation_threshold=0.7)
train_test_split(test_size=0.1, random_state=42)[source]#

Splits dataset to train and test data.

Parameters:
  • test_size (Union[float, int], optional, default=0.1) –

  • random_state (Union[int, RandomState], optional, default=None) –

    • If int, random_state is the seed used by the random number generator;

    • If RandomState instance, random_state is the random number generator;

    • If None, the random number generator is the RandomState instance used by np.random.

Returns:

train_data, test_data – tuple of ADSData instances

Return type:

tuple

Examples

>>> ds = DatasetFactory.open("data.csv")
>>> train, test = ds.train_test_split()
train_validation_test_split(test_size=0.1, validation_size=0.1, random_state=42)[source]#

Splits dataset to train, validation and test data.

Parameters:
  • test_size (Union[float, int], optional, default=0.1) –

  • validation_size (Union[float, int], optional, default=0.1) –

  • random_state (Union[int, RandomState], optional, default=None) –

    • If int, random_state is the seed used by the random number generator;

    • If RandomState instance, random_state is the random number generator;

    • If None, the random number generator is the RandomState instance used by np.random.

Returns:

train_data, validation_data, test_data – tuple of ADSData instances

Return type:

tuple

Examples

>>> ds = DatasetFactory.open("data.csv")
>>> train, valid, test = ds.train_validation_test_split()
type_of_target()[source]#

Return the target type for the dataset.

Returns:

target_type – an object of TypedFeature

Return type:

TypedFeature

Examples

>>> ds = ds.set_target('target_class')
>>> assert(ds.type_of_target() == 'categorical')
visualize_transforms()[source]#

Render a representation of the dataset’s transform DAG.

ads.dataset.exception module#

exception ads.dataset.exception.DatasetError(*args, **kwargs)[source]#

Bases: BaseException

Base class for dataset errors.

exception ads.dataset.exception.ValidationError(msg)[source]#

Bases: DatasetError

Handles validation errors in dataset.

ads.dataset.factory module#

class ads.dataset.factory.CustomFormatReaders[source]#

Bases: object

DEFAULT_SQL_ARRAYSIZE = 50000#
DEFAULT_SQL_CHUNKSIZE = 12007#
DEFAULT_SQL_CTU = False#
DEFAULT_SQL_MIL = 128#
static read_arff(path, **kwargs)[source]#
static read_avro(path: str, **kwargs) DataFrame[source]#
static read_html(path, html_table_index: int | None = None, **kwargs)[source]#
static read_json(path: str, **kwargs) DataFrame[source]#
static read_libsvm(path: str, **kwargs) DataFrame[source]#
static read_log(path, **kwargs)[source]#
classmethod read_sql(path: str, table: str | None = None, **kwargs) DataFrame[source]#
Parameters:
  • path – str This is the connection URL that gets passed to sqlalchemy’s create_engine method

  • table – str This is either the name of a table to select * from or a sql query to be run

  • kwargs

Returns:

pd.DataFrame

static read_tsv(path: str, **kwargs) DataFrame[source]#
static read_xml(path: str, **kwargs) DataFrame[source]#

Load data from xml file.

Parameters:
  • path (str) – Path to XML file

  • storage_options (dict, optional) – Storage options passed to Pandas to read the file.

Returns:

dataframe

Return type:

pandas.DataFrame

class ads.dataset.factory.DatasetFactory[source]#

Bases: object

static download(remote_path, local_path, storage=None, overwrite=False)[source]#

Download a remote file or directory to local storage.

Parameters:
  • remote_path (str) – Supports protocols like oci, s3, also supports glob expressions

  • local_path (str) – Supports glob expressions

  • storage (dict) – Parameters passed on to the backend remote filesystem class.

  • overwrite (bool, default False) – If True, the method will overwrite any existing files in the local_path

Examples

>>> DatasetFactory.download("oci://Bucket/prefix/to/data/*.csv",
...         "/home/datascience/data/")
static from_dataframe(df, target: str | None = None, **kwargs)[source]#

Returns an object of ADSDatasetWithTarget or ADSDataset given a pandas.DataFrame

Parameters:
  • df (pandas.DataFrame) –

  • target (str) –

  • kwargs (dict) – See DatasetFactory.open() for supported kwargs

Returns:

dataset – according to the type of target

Return type:

an object of ADSDataset target is not specified, otherwise an object of ADSDatasetWithTarget tagged

Examples

>>> df = pd.DataFrame(data)
>>> ds = from_dataframe(df)
classmethod infer_target_type(target, target_series, discover_target_type=True)[source]#
static list_snapshots(snapshot_dir=None, name='', storage_options=None, **kwargs)[source]#

Displays the URIs for dataset snapshots under the given directory path.

Parameters:
  • snapshot_dir (str) – Return all dataset snapshots created using ADSDataset.snapshot() within this directory. The path can contain protocols such as oci, s3.

  • name (str, optional) – The list of snapshots in the directory gets filtered by the name. Accepts glob expressions. default = “ads_”

  • storage_options (dict) – Parameters passed on to the backend filesystem class.

Example

>>> DatasetFactory.list_snapshots(snapshot_dir="oci://my_bucket/snapshots_dir",
...             name="ads_iris_")

Returns a list of all snapshots (recursively) saved to obj storage bucket “my_bucket” with prefix “/snapshots_dir/ads_iris_**” sorted by time created.

static open(source, target=None, format='infer', reader_fn: Callable = None, name: str = None, description='', npartitions: int = None, type_discovery=True, html_table_index=None, column_names='infer', sample_max_rows=10000, positive_class=None, transformer_pipeline=None, types={}, **kwargs)[source]#

Returns an object of ADSDataset or ADSDatasetWithTarget read from the given path

Deprecated since version 2.6.6: “Deprecated in favor of using Pandas. Pandas supports reading from object storage directly. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/loading_data/connect.html”,

Parameters:
  • source (Union[str, pandas.DataFrame, h2o.DataFrame, pyspark.sql.dataframe.DataFrame]) – If str, URI for the dataset. The dataset could be read from local or network file system, hdfs, s3, gcs and optionally pyspark in pyspark conda env

  • target (str, optional) – Name of the target in dataset. If set an ADSDatasetWithTarget object is returned, otherwise an ADSDataset object is returned which can be used to understand the dataset through visualizations

  • format (str, default: infer) – Format of the dataset. Supported formats: CSV, TSV, Parquet, libsvm, JSON, XLS/XLSX (Excel), HDF5, SQL, XML, Apache server log files (clf, log), ARFF. By default, the format would be inferred from the ending of the dataset file path.

  • reader_fn (Callable, default: None) – The user may pass in their own custom reader function. It must accept (path, **kwarg) and return a pandas DataFrame

  • name (str, optional default: "") –

  • description (str, optional default: "") – Text describing the dataset

  • npartitions (int, deprecated) – Number of partitions to split the data By default this is set to the max number of cores supported by the backend compute accelerator

  • type_discovery (bool, default: True) – If false, the data types of the dataframe are used as such. By default, the dataframe columns are associated with the best suited data types. Associating the features with the disovered datatypes would impact visualizations and model prediction.

  • html_table_index (int, optional) – The index of the dataframe table in html content. This is used when the format of dataset is html

  • column_names ('infer', list of str or None, default: 'infer') – Supported only for CSV and TSV. List of column names to use. By default, column names are inferred from the first line of the file. If set to None, column names would be auto-generated instead of inferring from file. If the file already contains a column header, specify header=0 to ignore the existing column names.

  • sample_max_rows (int, default: 10000, use -1 auto calculate sample size, use 0 (zero) for no sampling) – Sample size of the dataframe to use for visualization and optimization.

  • positive_class (Any, optional) – Label in target for binary classification problems which should be identified as positive for modeling. By default, the first unique value is considered as the positive label.

  • types (dict, optional) – Dictionary of <feature_name> : <data_type> to override the data type of features.

  • transformer_pipeline (datasets.pipeline.TransformerPipeline, optional) – A pipeline of transformations done outside the sdk and need to be applied at the time of scoring

  • storage_options (dict, default: varies by source type) – Parameters passed on to the backend filesystem class.

  • sep (str) – Delimiting character for parsing the input file.

  • kwargs (additional keyword arguments that would be passed to underlying dataframe read API) – based on the format of the dataset

Returns:

  • dataset (An instance of ADSDataset)

  • (or)

  • dataset_with_target (An instance of ADSDatasetWithTarget)

Examples

>>> ds = DatasetFactory.open("/path/to/data.data", format='csv', delimiter=" ",
...          na_values="n/a", skipinitialspace=True)
>>> ds = DatasetFactory.open("/path/to/data.csv", target="col_1", prefix="col_",
...           skiprows=1, encoding="ISO-8859-1")
>>> ds = DatasetFactory.open("oci://bucket@namespace/path/to/data.tsv",
...         column_names=["col1", "col2", "col3"], header=0)
>>> ds = DatasetFactory.open("oci://bucket@namespace/path/to/data.csv",
...         storage_options={"config": "~/.oci/config",
...         "profile": "USER_2"}, delimiter = ';')
>>> ds = DatasetFactory.open("/path/to/data.parquet", engine='pyarrow',
...         types={"col1": "ordinal",
...                "col2": "categorical",
...                "col3" : "continuous",
...                "col4" : "float64"})
>>> ds = DatasetFactory.open(df, target="class", sample_max_rows=5000,
...          positive_class="yes")
>>> ds = DatasetFactory.open("s3://path/to/data.json.gz", format="json",
...         compression="gzip", orient="records")
static open_to_pandas(source: str, format: str | None = None, reader_fn: Callable | None = None, **kwargs) DataFrame[source]#
static set_default_storage(snapshots_dir=None, storage_options=None)[source]#

Set default storage directory and options.

Both snapshots_dir and storage_options can be overridden at the API scope.

Parameters:
  • snapshots_dir (str) – Path for the snapshots directory. Can contain protocols such as oci, s3

  • storage_options (dict, optional) – Parameters passed on to the backend filesystem class.

static upload(local_file_or_dir, remote_file_or_dir, storage_options=None)[source]#

Upload local file or directory to remote storage

Parameters:
  • local_file_or_dir (str) – Supports glob expressions

  • remote_file_or_dir (str) – Supports protocols like oci, s3, also supports glob expressions

  • storage_options (dict) – Parameters passed on to the backend remote filesystem class.

ads.dataset.factory.get_format_reader(path: ElaboratedPath, **kwargs) Callable[source]#
ads.dataset.factory.load_dataset(path: ElaboratedPath, reader_fn: Callable, **kwargs) DataFrame[source]#

ads.dataset.feature_engineering_transformer module#

class ads.dataset.feature_engineering_transformer.FeatureEngineeringTransformer(feature_metadata=None)[source]#

Bases: TransformerMixin

fit(X, y=None)[source]#
fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(df, progress=<ads.dataset.progress.DummyProgressBar object>, fit_transform=False)[source]#

ads.dataset.feature_selection module#

class ads.dataset.feature_selection.FeatureImportance(ds, score_func=None, n=None)[source]#

Bases: object

show_in_notebook(fig_size=(10, 10))[source]#

Shows selected features in the notebook with matplotlib.

ads.dataset.forecasting_dataset module#

class ads.dataset.forecasting_dataset.ForecastingDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]#

Bases: ADSDatasetWithTarget

select_best_features(score_func=None, k=12)[source]#

Not yet implemented

ads.dataset.helper module#

class ads.dataset.helper.CustomFormatReaders[source]#

Bases: object

DEFAULT_SQL_ARRAYSIZE = 50000#
DEFAULT_SQL_CHUNKSIZE = 12007#
DEFAULT_SQL_CTU = False#
DEFAULT_SQL_MIL = 128#
static read_arff(path, **kwargs)[source]#
static read_avro(path: str, **kwargs) DataFrame[source]#
static read_html(path, html_table_index: int | None = None, **kwargs)[source]#
static read_json(path: str, **kwargs) DataFrame[source]#
static read_libsvm(path: str, **kwargs) DataFrame[source]#
static read_log(path, **kwargs)[source]#
classmethod read_sql(path: str, table: str | None = None, **kwargs) DataFrame[source]#
Parameters:
  • path – str This is the connection URL that gets passed to sqlalchemy’s create_engine method

  • table – str This is either the name of a table to select * from or a sql query to be run

  • kwargs

Returns:

pd.DataFrame

static read_tsv(path: str, **kwargs) DataFrame[source]#
static read_xml(path: str, **kwargs) DataFrame[source]#

Load data from xml file.

Parameters:
  • path (str) – Path to XML file

  • storage_options (dict, optional) – Storage options passed to Pandas to read the file.

Returns:

dataframe

Return type:

pandas.DataFrame

class ads.dataset.helper.DatasetDefaults[source]#

Bases: object

sampling_confidence_interval = 1.0#
sampling_confidence_level = 95#
exception ads.dataset.helper.DatasetLoadException(exc_msg)[source]#

Bases: BaseException

class ads.dataset.helper.ElaboratedPath(source: str | List[str], format: str | None = None, name: str | None = None, **kwargs)[source]#

Bases: object

The Elaborated Path class unifies all of the operations and information related to a path or pathlist. Whether the user wants to An Elaborated path can accept any of the following as a valid source: * A single path * A glob pattern path * A directory * A list of paths (Note: all of these paths must be from the same filesystem AND have the same format) * A sqlalchemy connection url

Parameters:
  • source

  • format

  • kwargs

By the end of this method, this class needs to have paths, format, and name ready

property format: str#
property name: str#
property num_paths: int#

This method will return the number of paths found with the associated original glob, folder, or path. If this returns 0, :return:

property paths: List[str]#

a list of str Each element will be a valid path

Type:

return

ads.dataset.helper.build_dataset(df: DataFrame, shape: Tuple[int, int], target: str | None = None, progress=None, **kwargs)[source]#
ads.dataset.helper.calculate_sample_size(population_size, min_size_to_sample, confidence_level=95, confidence_interval=1.0)[source]#
Find sample size for a population using Cochran’s Sample Size Formula.

With default values for confidence_level (percentage, default: 95%) and confidence_interval (margin of error, percentage, default: 1%)

SUPPORTED CONFIDENCE LEVELS: 50%, 68%, 90%, 95%, and 99% ONLY - this is because the Z-score is table based, and I’m only providing Z for common confidence levels.

ads.dataset.helper.concatenate(X, y)[source]#
ads.dataset.helper.convert_columns(df, feature_metadata=None, dtypes=None)[source]#
ads.dataset.helper.convert_to_html(plot)[source]#
ads.dataset.helper.deprecate_default_value(var, old_value, new_value, warning_msg, warning_type)[source]#
ads.dataset.helper.deprecate_variable(old_var, new_var, warning_msg, warning_type)[source]#
ads.dataset.helper.down_sample(df, target)[source]#

Fixes imbalanced dataset by down-sampling

Parameters:
  • df (pandas.DataFrame) –

  • target (name of the target column in df) –

Returns:

downsampled_df

Return type:

pandas.DataFrame

ads.dataset.helper.fix_column_names(X)[source]#
ads.dataset.helper.generate_sample(df: DataFrame, n: int, confidence_level: int = 95, confidence_interval: float = 1.0, **kwargs)[source]#
ads.dataset.helper.get_dataset(df: DataFrame, sampled_df: DataFrame, target: str, target_type: TypedFeature, shape: Tuple[int, int], positive_class=None, **init_kwargs)[source]#
ads.dataset.helper.get_dtype(feature_type, dtype)[source]#
ads.dataset.helper.get_feature_type(name, series)[source]#
ads.dataset.helper.get_fill_val(feature_types, column, action, constant='constant')[source]#
ads.dataset.helper.get_format_reader(path: ElaboratedPath, **kwargs) Callable[source]#
ads.dataset.helper.get_target_type(target, sampled_df, **init_kwargs)[source]#
ads.dataset.helper.infer_target_type(target, target_series, discover_target_type=True)[source]#
ads.dataset.helper.is_text_data(df, target=None)[source]#
ads.dataset.helper.load_dataset(path: ElaboratedPath, reader_fn: Callable, **kwargs) DataFrame[source]#
ads.dataset.helper.map_types(types)[source]#
ads.dataset.helper.open(source, target=None, format='infer', reader_fn: Callable | None = None, name: str | None = None, description='', npartitions: int | None = None, type_discovery=True, html_table_index=None, column_names='infer', sample_max_rows=10000, positive_class=None, transformer_pipeline=None, types={}, **kwargs)[source]#

Returns an object of ADSDataset or ADSDatasetWithTarget read from the given path

Deprecated since version 2.6.6: “Deprecated in favor of using Pandas. Pandas supports reading from object storage directly. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/loading_data/connect.html”,

Parameters:
  • source (Union[str, pandas.DataFrame, h2o.DataFrame, pyspark.sql.dataframe.DataFrame]) – If str, URI for the dataset. The dataset could be read from local or network file system, hdfs, s3, gcs and optionally pyspark in pyspark conda env

  • target (str, optional) – Name of the target in dataset. If set an ADSDatasetWithTarget object is returned, otherwise an ADSDataset object is returned which can be used to understand the dataset through visualizations

  • format (str, default: infer) – Format of the dataset. Supported formats: CSV, TSV, Parquet, libsvm, JSON, XLS/XLSX (Excel), HDF5, SQL, XML, Apache server log files (clf, log), ARFF. By default, the format would be inferred from the ending of the dataset file path.

  • reader_fn (Callable, default: None) – The user may pass in their own custom reader function. It must accept (path, **kwarg) and return a pandas DataFrame

  • name (str, optional default: "") –

  • description (str, optional default: "") – Text describing the dataset

  • npartitions (int, deprecated) – Number of partitions to split the data By default this is set to the max number of cores supported by the backend compute accelerator

  • type_discovery (bool, default: True) – If false, the data types of the dataframe are used as such. By default, the dataframe columns are associated with the best suited data types. Associating the features with the disovered datatypes would impact visualizations and model prediction.

  • html_table_index (int, optional) – The index of the dataframe table in html content. This is used when the format of dataset is html

  • column_names ('infer', list of str or None, default: 'infer') – Supported only for CSV and TSV. List of column names to use. By default, column names are inferred from the first line of the file. If set to None, column names would be auto-generated instead of inferring from file. If the file already contains a column header, specify header=0 to ignore the existing column names.

  • sample_max_rows (int, default: 10000, use -1 auto calculate sample size, use 0 (zero) for no sampling) – Sample size of the dataframe to use for visualization and optimization.

  • positive_class (Any, optional) – Label in target for binary classification problems which should be identified as positive for modeling. By default, the first unique value is considered as the positive label.

  • types (dict, optional) – Dictionary of <feature_name> : <data_type> to override the data type of features.

  • transformer_pipeline (datasets.pipeline.TransformerPipeline, optional) – A pipeline of transformations done outside the sdk and need to be applied at the time of scoring

  • storage_options (dict, default: varies by source type) – Parameters passed on to the backend filesystem class.

  • sep (str) – Delimiting character for parsing the input file.

  • kwargs (additional keyword arguments that would be passed to underlying dataframe read API) – based on the format of the dataset

Returns:

  • dataset (An instance of ADSDataset)

  • (or)

  • dataset_with_target (An instance of ADSDatasetWithTarget)

ads.dataset.helper.parse_apache_log_datetime(x)[source]#
Parses datetime with timezone formatted as:

[day/month/year:hour:minute:second zone]

Source: https://mmas.github.io/read-apache-access-log-pandas .. rubric:: Example

>>> parse_datetime(‘13/Nov/2015:11:45:42 +0000’) datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)

Due to problems parsing the timezone (%z) with datetime.strptime, the timezone will be obtained using the pytz library.

ads.dataset.helper.parse_apache_log_str(x)[source]#

Returns the string delimited by two characters.

Source: https://mmas.github.io/read-apache-access-log-pandas .. rubric:: Example

>>> parse_str(‘[my string]’) ‘my string’

ads.dataset.helper.rename_duplicate_cols(original_cols)[source]#
ads.dataset.helper.up_sample(df, target, sampler='default', feature_types=None)[source]#

Fixes imbalanced dataset by up-sampling

Parameters:
  • df (Union[pandas.DataFrame, dask.dataframe.core.DataFrame]) –

  • target (name of the target column in df) –

  • sampler (Should implement fit_resample(X,y) method) –

  • fillna (a dictionary contains the column name as well as the fill value,) – only needed when the column has missing values

Returns:

upsampled_df

Return type:

Union[pandas.DataFrame, dask.dataframe.core.DataFrame]

ads.dataset.helper.validate_kwargs(func: Callable, kwargs)[source]#
ads.dataset.helper.visualize_transformation(transformer_pipeline, text=None)[source]#
ads.dataset.helper.write_parquet(path, data, engine='fastparquet', metadata_dict=None, compression=None, storage_options=None)[source]#

Uses fast parquet to write dask dataframe and custom metadata in parquet format

Parameters:
  • path (str) – Path to write to

  • data (pandas.DataFrame) –

  • engine (string) – “auto” by default

  • metadata_dict (Deprecated, will not pass through) –

  • compression ({{'snappy', 'gzip', 'brotli', None}}, default 'snappy') – Name of the compression to use

  • storage_options (dict, optional) – storage arguments required to read the path

Returns:

str

Return type:

the file path the parquet was written to

ads.dataset.label_encoder module#

class ads.dataset.label_encoder.DataFrameLabelEncoder[source]#

Bases: TransformerMixin

Label encoder for pandas.DataFrame and dask.dataframe.core.DataFrame.

label_encoders#

Holds the label encoder for each column.

Type:

defaultdict

Examples

>>> import pandas as pd
>>> from ads.dataset.label_encoder import DataFrameLabelEncoder
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
>>> le = DataFrameLabelEncoder()
>>> le.fit_transform(X=df)

Initialize an instance of DataFrameLabelEncoder.

fit(X: pandas.DataFrame)[source]#

Fits a DataFrameLabelEncoder.

Parameters:

X (pandas.DataFrame) – Target values.

Returns:

self – Fitted label encoder.

Return type:

returns an instance of self.

transform(X: pandas.DataFrame)[source]#

Transforms a dataset using the DataFrameLabelEncoder.

Parameters:

X (pandas.DataFrame) – Target values.

Returns:

Labels as normalized encodings.

Return type:

pandas.DataFrame

ads.dataset.pipeline module#

class ads.dataset.pipeline.TransformerPipeline(steps)[source]#

Bases: Pipeline

add(transformer)[source]#

Add transformer to data transformation pipeline

Parameters:

transformer (Union[TransformerMixin, tuple(str, TransformerMixin)]) – if tuple, (name, transformer implementing transform)

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') TransformerPipeline#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

visualize()[source]#

ads.dataset.plot module#

class ads.dataset.plot.Plotting(df, feature_types, x, y=None, plot_type='infer', yscale=None)[source]#

Bases: object

select_best_plot()[source]#

Returns the best plot for a given dataset

show_in_notebook(**kwargs)[source]#

Visualizes the dataset by plotting the distribution of a feature or relationship between two features.

Parameters:
  • figsize (tuple) – defines the size of the fig

  • -------

ads.dataset.progress module#

class ads.dataset.progress.DummyProgressBar(*args, **kwargs)[source]#

Bases: ProgressBar

update(*args, **kwargs)[source]#

Updates the progress bar

class ads.dataset.progress.ProgressBar[source]#

Bases: object

abstract update(description)[source]#
class ads.dataset.progress.TqdmProgressBar(max_progress=100, description='Running', verbose=False)[source]#

Bases: ProgressBar

update(description=None, n=1)[source]#

Updates the progress bar

ads.dataset.recommendation module#

class ads.dataset.recommendation.Recommendation(ds, recommendation_transformer)[source]#

Bases: object

recommendation_type_labels = ['Constant Columns', 'Potential Primary Key Columns', 'Imputation', 'Multicollinear Columns', 'Identify positive label for target', 'Fix imbalance in dataset']#
recommendation_types = ['constant_column', 'primary_key', 'imputation', 'strong_correlation', 'positive_class', 'fix_imbalance']#
show_in_notebook()[source]#

ads.dataset.recommendation_transformer module#

class ads.dataset.recommendation_transformer.RecommendationTransformer(feature_metadata=None, correlation=None, target=None, is_balanced=False, target_type=None, feature_ranking=None, len=0, fix_imbalance=True, auto_transform=True, correlation_threshold=0.7)[source]#

Bases: TransformerMixin

fit(X)[source]#
fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X, progress=<ads.dataset.progress.DummyProgressBar object>, fit_transform=False, update_transformer_log=False)[source]#
transformer_log(action)[source]#

local wrapper to both log and record in the actions_performed array

ads.dataset.regression_dataset module#

class ads.dataset.regression_dataset.RegressionDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]#

Bases: ADSDatasetWithTarget

ads.dataset.sampled_dataset module#

class ads.dataset.sampled_dataset.PandasDataset(sampled_df, type_discovery=True, types={}, metadata=None, progress=<ads.dataset.progress.DummyProgressBar object>)[source]#

Bases: object

This class provides APIs that can work on a sampled dataset.

plot(x, y=None, plot_type='infer', yscale=None, verbose=True, sample_size=0)[source]#

Supports plotting feature distribution, and relationship between features.

Parameters:
  • x (str) – The name of the feature to plot

  • y (str, optional) – Name of the feature to plot against x

  • plot_type (str, default: infer) –

    Override the inferred plot type for certain combinations of the data types of x and y. By default, the best plot type is inferred based on x and y data types. Valid values:

    • box_plot - discrete feature vs continuous feature. Draw a box plot to show distributions with respect to categories,

    • scatter - continuous feature vs continuous feature. Draw a scatter plot with possibility of several semantic groupings.

  • yscale (str, optional) – One of {“linear”, “log”, “symlog”, “logit”}. The y axis scale type to apply. Can be used when either x or y is an ordinal feature.

  • verbose (bool, default True) – Displays Note/Tips if True

plot_gis_scatter(lon='longitude', lat='latitude', ax=None)[source]#

Supports plotting Choropleth maps

Parameters:
  • df (pandas dataframe) – The dataframe to plot

  • x (str) – The name of the feature to plot, usually the longitude

  • y (str) – THe name of the feature to plot, usually the latitude

summary(feature_name=None)[source]#

Display list of features & their datatypes. Shows the column name and the feature’s meta_data if given a specific feature name.

Parameters:

date_col (str) – The name of the feature

Returns:

a dictionary that contains requested information

Return type:

dict

timeseries(date_col)[source]#

Supports any plotting operations where x=datetime.

Parameters:

date_col (str) – The name of the feature to plot

Returns:

a plotting object that contains a date column and dataframe

Return type:

func

ads.dataset.target module#

class ads.dataset.target.TargetVariable(sampled_ds, target, target_type)[source]#

Bases: object

This class provides target specific APIs.

is_balanced(skewness_threshold=0.5, class_imbalance_threshold=0.5)[source]#

Returns True if the target is balanced, False otherwise.

Returns:

is_balanced

Return type:

bool

show_in_notebook(feature_names=None)[source]#

Plot target distribution or target versus feature relation.

Parameters:

feature_names (list, Optional) – Plot target against a list of features. Display target distribution if feature_names is not provided.

ads.dataset.timeseries module#

class ads.dataset.timeseries.Timeseries(col_name, df, date_range=None, min=None, max=None)[source]#

Bases: object

plot(**kwargs)[source]#

Module contents#