ads.dataset package¶
Submodules¶
ads.dataset.classification_dataset module¶
- class ads.dataset.classification_dataset.BinaryClassificationDataset(df, sampled_df, target, target_type, shape, positive_class=None, **kwargs)[source]¶
Bases:
ClassificationDataset
Dataset for binary classification
- set_positive_class(positive_class, missing_value=False)[source]¶
Return new dataset with values in target column mapped to True or False in accordance with the specified positive label.
- Parameters:
positive_class (same dtype as target) – The target label which should be identified as positive outcome from model.
missing_value (bool) – missing values will be converted to this
- Returns:
dataset
- Return type:
same type as the caller
- Raises:
ValidationError – if the positive_class is not present in target
Examples
>>> ds = DatasetFactory.open("iris.csv") >>> ds_with_target = ds.set_target('class') >>> ds_with_pos_class = ds.set_positive_class('setosa')
- class ads.dataset.classification_dataset.BinaryTextClassificationDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]¶
Bases:
BinaryClassificationDataset
Dataset for binary text classification
- class ads.dataset.classification_dataset.ClassificationDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]¶
Bases:
ADSDatasetWithTarget
Dataset for classification task
- auto_transform(fix_imbalance: bool = True, correlation_threshold: float = 0.7, frac: float = 1.0, correlation_methods: str = 'pearson')[source]¶
Return transformed dataset with several optimizations applied automatically. The optimizations include:
Dropping constant and primary key columns, which has no predictive quality,
Imputation, to fill in missing values in noisy data:
For continuous variables, fill with mean if less than 40% is missing, else drop,
For categorical variables, fill with most frequent if less than 40% is missing, else drop,
Dropping strongly co-correlated columns that tend to produce less generalizable models,
Balancing dataset using up or down sampling.
- Parameters:
fix_imbalance (bool, defaults to True.) – Fix imbalance between classes in dataset. Used only for classification datasets.
correlation_threshold (float, defaults to 0.7. It must be between 0 and 1, inclusive.) – The correlation threshold where columns with correlation higher than the threshold will be considered as strongly co-correlated and recommended to be taken care of.
frac (float, defaults to 1.0. Range -> (0, 1].) – What fraction of the data should be used in the calculation?
correlation_methods (Union[list, str], defaults to 'pearson'.) –
‘pearson’: Use Pearson’s Correlation between continuous features,
’cramers v’: Use Cramer’s V correlations between categorical features,
’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,
’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].
- Returns:
transformed_dataset – The dataset after transformation
- Return type:
Examples
>>> ds_clean = ds.auto_transform(correlation_threshold=0.6)
- convert_to_text_classification(text_column: str)[source]¶
Builds a new dataset with the given text column as the only feature besides target.
- Parameters:
text_column (str) – Feature name to use for text classification task
- Returns:
ds – Dataset with one text feature and a classification target
- Return type:
TextClassificationDataset
Examples
>>> review_ds = DatasetFactory.open("review_data.csv") >>> ds_text_class = review_ds.convert_to_text_classification('reviews')
- down_sample(sampler=None)[source]¶
Fixes an imbalanced dataset by down-sampling.
- Parameters:
sampler (An instance of SamplerMixin) – Should implement fit_resample(X,y) method. If None, does random down sampling.
- Returns:
down_sampled_ds – A down-sampled dataset.
- Return type:
Examples
>>> ds = DatasetFactory.open("some_data.csv") >>> ds_balanced_small = ds.down_sample()
- up_sample(sampler='default')[source]¶
Fixes imbalanced dataset by up-sampling
- Parameters:
sampler (An instance of SamplerMixin) – Should implement fit_resample(X,y) method. If ‘default’, either SMOTE or random sampler will be used
fill_missing_type (a string) – Can either be ‘mean’, ‘mode’ or ‘median’.
- Returns:
up_sampled_ds – an up-sampled dataset
- Return type:
Examples
>>> ds = DatasetFactory.open("some_data.csv") >>> ds_balanced_large = ds.up_sample()
- class ads.dataset.classification_dataset.MultiClassClassificationDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]¶
Bases:
ClassificationDataset
Dataset for multi-class classification
- class ads.dataset.classification_dataset.MultiClassTextClassificationDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]¶
Bases:
MultiClassClassificationDataset
Dataset for multi-class text classification
ads.dataset.correlation module¶
ads.dataset.correlation_plot module¶
- class ads.dataset.correlation_plot.BokehHeatMap(ds)[source]¶
Bases:
object
Generate a HeatMap or horizontal bar plot to compare features.
- flatten_corr_matrix(corr_matrix)[source]¶
Flatten a correlation matrix into a pandas Dataframe.
- Parameters:
corr_matrix (Pandas Dataframe) – The correlation matrix to be flattened.
- Returns:
corr_flatten – The flattened correlation matrix.
- Return type:
Pandas DataFrame
- generate_heatmap(corr_matrix, title: str, msg: str, correlation_threshold: float)[source]¶
Generate a heatmap from a correlation matrix.
- Parameters:
corr_matrix (Pandas Dataframe) – The dataframe to be used for heatmap generation.
title (str) – title of the heatmap.
msg (str) – An additional msg to include in the plot.
correlation_threshold (float) – A float between 0 and 1 which is used for excluding correlations which are not intense enough from the plot.
- Returns:
tab – A matplotlib Panel object which includes a plotted heatmap
- Return type:
matplotlib Panel
- generate_target_heatmap(corr_matrix, title: str, correlation_target: str, msg: str, correlation_threshold: float)[source]¶
Generate a heatmap from a correlation matrix and its targets.
- Parameters:
corr_matrix (Pandas Dataframe) – The dataframe to be used for heatmap generation.
title (str) – title of the heatmap.
correlation_target (str) – The target column name for computing correlations against.
msg (str) – An additional msg to include in the plot.
correlation_threshold (float) – A float between 0 and 1 which is used for excluding correlations which are not intense enough from the plot.
- Returns:
tab – A matplotlib Panel object which includes a plotted heatmap.
- Return type:
matplotlib Panel
- plot_correlation_heatmap(ds, plot_type: str = 'heatmap', correlation_target: str = None, correlation_threshold=-1, correlation_methods: str = 'pearson', **kwargs)[source]¶
Plots a correlation heatmap.
- Parameters:
ds (Pandas Slice) – A data slice or file
plot_type (str Defaults to "heatmap") – The type of plot - “bar” is another option.
correlation_target (str, Defaults to None) – the target column for correlation calculations.
correlation_threshold (float, Defaults to -1) – the threshold for computing correlation heatmap elements.
correlation_methods (str, Defaults to "pearson") – the way to compute correlations, other options are “cramers v” and “correlation ratio”
- plot_hbar(matrix, low: float = 1, high=1, title: str = None, tool_tips: list = None, column_name: str = None)[source]¶
Plots a histogram bar-graph.
- Parameters:
matrix (Pandas Dataframe) – The dataframe to be plotted.
low (float, Defaults to 1) – The color mapping value for “low” points.
high (float, Defaults to 1) – The color mapping value for “high” points.
title (str, Defaults to None) – The optional title of the heat map.
tool_tips (list of str, Defaults to None) – An optional list of tool tips to include with the plot.
column_name (str, Defaults to None) – The name of the column which is being plotted.
- Returns:
fig – A matplotlib heatmap figure object.
- Return type:
matplotlib Figure
- plot_heat_map(matrix, xrange: list, yrange: list, low: float = 1, high=1, title: str = None, tool_tips: list = None)[source]¶
Plots a matrix as a heatmap.
- Parameters:
matrix (Pandas Dataframe) – The dataframe to be plotted.
xrange (List of floats) – The range of x values to plot.
yrange (List of floats) – The range of y values to plot.
low (float, Defaults to 1) – The color mapping value for “low” points.
high (float, Defaults to 1) – The color mapping value for “high” points.
title (str, Defaults to None) – The optional title of the heat map.
tool_tips (list of str, Defaults to None) – An optional list of tool tips to include with the plot.
- Returns:
fig – A matplotlib heatmap figure object.
- Return type:
matplotlib Figure
ads.dataset.dask_series module¶
ads.dataset.dataframe_transformer module¶
ads.dataset.dataset module¶
- class ads.dataset.dataset.ADSDataset(df, sampled_df=None, shape=None, name='', description=None, type_discovery=True, types={}, metadata=None, progress=<ads.dataset.progress.DummyProgressBar object>, transformer_pipeline=None, interactive=False, **kwargs)[source]¶
Bases:
PandasDataset
An ADSDataset Object.
The ADSDataset object cannot be used for classification or regression problems until a target has been set using set_target. To see some rows in the data use any of the usual Pandas functions like head(). There are also a variety of converters, to_dask, to_pandas, to_h2o, to_xgb, to_csv, to_parquet, to_json & to_hdf .
- assign_column(column, arg)[source]¶
Return new dataset with new column or values of the existing column mapped according to input correspondence.
Used for adding a new column or substituting each value in a column with another value, that may be derived from a function, a
pandas.Series
or apandas.DataFrame
.- Parameters:
- Returns:
dataset – a dataset with the specified column assigned.
- Return type:
same type as the caller
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds_same_size = ds.assign_column('target',lambda x: x>15 if x not None) >>> ds_bigger = ds.assign_column('new_col', np.arange(ds.shape[0]))
- astype(types)[source]¶
Convert data type of features.
- Parameters:
types (dict) – key is the existing feature name value is the data type to which the values of the feature should be converted. Valid data types: All numpy datatypes (Example: np.float64, np.int64, …) or one of categorical, continuous, ordinal or datetime.
- Returns:
updated_dataset – an ADSDataset with new data types
- Return type:
ADSDataset
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds_reformatted = ds.astype({"target": "categorical"})
- call(func, *args, sample_size=None, **kwargs)[source]¶
Runs a custom function on dataframe
func will receive the pandas dataframe (which represents the dataset) as an argument named ‘df’ by default. This can be overridden by specifying the dataframe argument name in a tuple (func, dataframe_name).
- Parameters:
func (Union[callable, tuple]) – Custom function that takes pandas dataframe as input Alternatively a (callable, data) tuple where data is a string indicating the keyword of callable that expects the dataframe name
args (iterable, optional) – Positional arguments passed into func
sample_size (int, Optional) – To use a sampled dataframe
kwargs (mapping, optional) – A dictionary of keyword arguments passed into func
- Returns:
func – a plotting function that contains *args and **kwargs
- Return type:
function
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("classfication_data.csv")) >>> def f1(df): ... return(sum(df), axis=0) >>> sum_ds = ds.call(f1)
- corr(correlation_methods: list | str = 'pearson', frac: float = 1.0, sample_size: float = 1.0, nan_threshold: float = 0.8, overwrite: bool | None = None, force_recompute: bool = False)[source]¶
Compute pairwise correlation of numeric and categorical columns, output a matrix or a list of matrices computed using the correlation methods passed in.
- Parameters:
correlation_methods (Union[list, str], default to 'pearson') –
‘pearson’: Use Pearson’s Correlation between continuous features,
’cramers v’: Use Cramer’s V correlations between categorical features,
’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,
’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].
frac – Is deprecated and replaced by sample_size.
sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What fraction of the data should be used in the calculation?
nan_threshold (float, default to 0.8, Range -> [0, 1]) – Only compute a correlation when the proportion of the values, in a column, is less than or equal to nan_threshold.
overwrite – Is deprecated and replaced by force_recompute.
force_recompute (bool, default to be False) –
If False, it calculates the correlation matrix if there is no cached correlation matrix. Otherwise, it returns the cached correlation matrix.
If True, it calculates the correlation matrix regardless whether there is cached result or not.
- Returns:
correlation – The pairwise correlations as a matrix (DataFrame) or list of matrices
- Return type:
Union[list, pandas.DataFrame]
- property ddf¶
- df_read_functions = ['head', 'describe', '_get_numeric_data']¶
- drop_columns(columns)[source]¶
Return new dataset with specified columns removed.
- Parameters:
- Returns:
dataset – a dataset with specified columns dropped.
- Return type:
same type as the caller
- Raises:
ValidationError – If any of the feature names is not found in the dataset.
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds_smaller = ds.drop_columns(['col1', 'col2'])
- static from_dataframe(df, sampled_df=None, shape=None, name='', description=None, type_discovery=True, types={}, metadata=None, progress=<ads.dataset.progress.DummyProgressBar object>, transformer_pipeline=None, interactive=False, **kwargs) ADSDataset [source]¶
- get_recommendations(*args, **kwargs)[source]¶
Returns user-friendly error message to set target variable before invoking this API.
- Parameters:
kwargs
- Returns:
raises NotImplementedError, if target parameter value not provided
- Return type:
- merge(data, **kwargs)[source]¶
Merges this dataset with another ADSDataset or pandas dataframe.
- Parameters:
data (Union[ADSDataset, pandas.DataFrame]) – Data to merge.
kwargs (dict, optional) – additional keyword arguments that would be passed to underlying dataframe’s merge API.
Examples
>>> import pandas as pd >>> df1 = pd.read_csv("data1.csv") >>> df2 = pd.read_csv("data2.csv") >>> ds = ADSDataset.from_dataframe(df1.merge(df2)) >>> ds_12 = ds1.merge(ds2)
- rename_columns(columns)[source]¶
Returns a new dataset with altered column names.
dict values must be unique (1-to-1). Labels not contained in a dict will be left as-is. Extra labels listed don’t throw an error.
- Parameters:
columns (dict-like or function or list of str) – dict to rename columns selectively, or list of names to rename all columns, or a function like str.upper
- Returns:
dataset – A dataset with specified columns renamed.
- Return type:
same type as the caller
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds_renamed = ds.rename_columns({'col1': 'target'})
- sample(frac=None, random_state=42)[source]¶
Returns random sample of dataset.
- Parameters:
frac (float, optional) – Fraction of axis items to return.
random_state (int or
np.random.RandomState
) – If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState
- Returns:
sampled_dataset – An ADSDataset which was randomly sampled.
- Return type:
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds_sample = ds.sample()
- set_description(description)[source]¶
Sets description for the dataset.
Give your dataset a description.
- Parameters:
description (str) – Description of the dataset.
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data1.csv")) >>> ds_renamed = ds.set_description("dataset1 is from "data1.csv"")
- set_name(name)[source]¶
Sets name for the dataset.
This name will be used to filter the datasets returned by ds.list() API. Calling this API is optional. By default name of the dataset is set to empty.
- Parameters:
name (str) – Name of the dataset.
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data1.csv")) >>> ds_renamed = ds.set_name("dataset1")
- set_target(target, type_discovery=True, target_type=None)[source]¶
Returns a dataset tagged based on the type of target.
- Parameters:
- Returns:
ds – tagged according to the type of the target column.
- Return type:
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("classfication_data.csv")) >>> ds_with_target= ds.set_target("target_class")
- show_corr(frac: float = 1.0, sample_size: float = 1.0, nan_threshold: float = 0.8, overwrite: bool | None = None, force_recompute: bool = False, correlation_target: str | None = None, plot_type: str = 'heatmap', correlation_threshold: float = -1, correlation_methods='pearson', **kwargs)[source]¶
Show heatmap or barplot of pairwise correlation of numeric and categorical columns, output three tabs which are heatmap or barplot of correlation matrix of numeric columns vs numeric columns using pearson correlation method, categorical columns vs categorical columns using Cramer’s V method, and numeric vs categorical columns, excluding NA/null values and columns which have more than 80% of NA/null values. By default, only ‘pearson’ correlation is calculated and shown in the first tab. Set correlation_methods=’all’ to show all correlation charts.
- Parameters:
frac (Is superseded by sample_size)
sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What fraction of the data should be used in the calculation?
nan_threshold (float, defaults to 0.8, Range -> [0, 1]) – In the default case, it will only calculate the correlation of the columns which has less than or equal to 80% of missing values.
overwrite – Is deprecated and replaced by force_recompute.
force_recompute (bool, default to be False.) –
If False, it calculates the correlation matrix if there is no cached correlation matrix. Otherwise, it returns the cached correlation matrix.
If True, it calculates the correlation matrix regardless whether there is cached result or not.
plot_type (str, default to "heatmap") – It can only be “heatmap” or “bar”. Note that if “bar” is chosen, correlation_target also has to be set and the bar chart will only show the correlation values of the pairs which have the target in them.
correlation_target (str, default to Non) – It can be any columns of type continuous, ordinal, categorical or zipcode. When correlation_target is set, only pairs that contains correlation_target will show.
correlation_threshold (float, default to -1) – It can be any number between -1 and 1.
correlation_methods (Union[list, str], defaults to 'pearson') –
‘pearson’: Use Pearson’s Correlation between continuous features,
’cramers v’: Use Cramer’s V correlations between categorical features,
’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,
’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].
- Return type:
None
- show_in_notebook(correlation_threshold=-1, selected_index=0, sample_size=0, visualize_features=True, correlation_methods='pearson', **kwargs)[source]¶
Provide visualization of dataset.
Display feature distribution. The data table display will show a maximum of 8 digits,
Plot the correlation between the dataset features (as a heatmap) only when all the features are continuous or ordinal,
Display data head.
- Parameters:
correlation_threshold (int, default -1) – The correlation threshold to select, which only show features that have larger or equal correlation values than the threshold.
selected_index (int, str, default 0) – The displayed output is stacked into an accordion widget, use selected_index to force the display to open a specific element, use the (zero offset) index or any prefix string of the name (eg, ‘corr’ for correlations)
sample_size (int, default 0) – The size (in rows) to sample for visualizations
visualize_features (bool default False) – For the “Features” section control if feature visualizations are shown or not. If not only a summary of the numeric statistics is shown. The numeric statistics are also always shows for wide (>64 features) datasets
correlation_methods (Union[list, str], default to 'pearson') –
‘pearson’: Use Pearson’s Correlation between continuous features,
’cramers v’: Use Cramer’s V correlations between categorical features,
’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,
’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].
- snapshot(snapshot_dir=None, name='', storage_options=None)[source]¶
Snapshot the dataset with modifications made so far.
Optionally caller can invoke ds.set_name() before saving to identify the dataset uniquely at the time of using ds.list().
The snapshot can be reloaded by providing the URI returned by this API to DatasetFactory.open()
- Parameters:
snapshot_dir (str, optional) – Directory path under which dataset snapshot will be created. Defaults to snapshots_dir set using DatasetFactory.set_default_storage().
name (str, optional, default: "") – Name to uniquely identify the snapshot using DatasetFactory.list_snapshots(). If not provided, an auto-generated name is used.
storage_options (dict, optional) – Parameters passed on to the backend filesystem class. Defaults to storage_options set using DatasetFactory.set_default_storage().
- Returns:
p_str – the URI to access the snapshotted dataset.
- Return type:
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds_uri = ds.snapshot()
- suggest_recommendations(*args, **kwargs)[source]¶
Returns user-friendly error message to set target variable before invoking this API.
- Parameters:
kwargs
- Returns:
raises NotImplementedError, if target parameter value not provided
- Return type:
- to_avro(path, schema=None, storage_options=None, **kwargs)[source]¶
Save data to Avro files. Avro is a remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.
- Parameters:
path (string) – Path to a target filename. May contain a
*
to denote many filenames.schema (dict) – Avro schema dictionary, see below.
storage_options (dict, optional) – Parameters passed to the backend filesystem class. Defaults to storage_options set using DatasetFactory.set_default_storage().
kwargs (dict, optional) – See https://fastavro.readthedocs.io/en/latest/writer.html
Notes
Avro schema is a complex dictionary describing the data, see https://avro.apache.org/docs/1.8.2/gettingstartedpython.html#Defining+a+schema and https://fastavro.readthedocs.io/en/latest/writer.html. Its structure is as follows:
{'name': 'Test', 'namespace': 'Test', 'doc': 'Descriptive text', 'type': 'record', 'fields': [ {'name': 'a', 'type': 'int'}, ]}
where the “name” field is required, but “namespace” and “doc” are optional descriptors; “type” must always be “record”. The list of fields should have an entry for every key of the input records, and the types are like the primitive, complex or logical types of the Avro spec (https://avro.apache.org/docs/1.8.2/spec.html).
Examples
>>> import pandas >>> import fastavro >>> with open("data.avro", "rb") as fp: >>> reader = fastavro.reader(fp) >>> records = [r for r in reader] >>> df = pandas.DataFrame.from_records(records) >>> ds = ADSDataset.from_dataframe(df) >>> ds.to_avro("my/path.avro")
- to_csv(path, storage_options=None, **kwargs)[source]¶
Save the materialized dataframe to csv file.
- Parameters:
path (str) – Location to write to. If there are more than one partitions in df, should include a glob character to expand into a set of file names, or provide a name_function=parameter. Supports protocol specifications such as “oci://”, “s3://”.
storage_options (dict, optional) – Parameters passed on to the backend filesystem class. Defaults to storage_options set using DatasetFactory.set_default_storage().
kwargs (dict, optional)
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> [ds_link] = ds.to_csv("my/path.csv")
- to_dask(filter=None, frac=None, npartitions=None, include_transformer_pipeline=False)[source]¶
Returns a copy of the data as dask.dataframe.core.DataFrame, and a sklearn pipeline optionally that holds the transformations run so far on the data.
The pipeline returned can be updated with the transformations done offline and passed along with the dataframe to Dataset.open API if the transformations need to be reproduced at the time of scoring.
- Parameters:
filter (str, optional) – The query string to filter the dataframe, for example ds.to_dask(filter=”age > 50 and location == ‘san francisco”) See also https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
frac (float, optional) – fraction of original data to return.
include_transformer_pipeline (bool, default: False) – If True, (dataframe, transformer_pipeline) is returned as a tuple.
- Returns:
dataframe (dask.dataframe.core.DataFrame) – if include_transformer_pipeline is False.
(data, transformer_pipeline) (tuple of dask.dataframe.core.DataFrame and dataset.pipeline.TransformerPipeline) – if include_transformer_pipeline is True.
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds_dask = ds.to_dask()
Notes
See also http://docs.dask.org/en/latest/dataframe-api.html#dataframe and https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline
- to_dask_dataframe(filter=None, frac=None, npartitions=None, include_transformer_pipeline=False)[source]¶
- to_h2o(filter=None, frac=None, include_transformer_pipeline=False)[source]¶
Returns a copy of the data as h2o.H2OFrame, and a sklearn pipeline optionally that holds the transformations run so far on the data.
The pipeline returned can be updated with the transformations done offline and passed along with the dataframe to Dataset.open API if the transformations need to be reproduced at the time of scoring.
- Parameters:
filter (str, optional) – The query string to filter the dataframe, for example ds.to_h2o(filter=”age > 50 and location == ‘san francisco”) See also https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
frac (float, optional) – fraction of original data to return.
include_transformer_pipeline (bool, default: False) – If True, (dataframe, transformer_pipeline) is returned as a tuple.
- Returns:
dataframe (h2o.H2OFrame) – if include_transformer_pipeline is False.
(data, transformer_pipeline) (tuple of h2o.H2OFrame and dataset.pipeline.TransformerPipeline) – if include_transformer_pipeline is True.
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds_as_h2o = ds.to_h2o()
Notes
- to_hdf(path: str, key: str, storage_options: dict | None = None, **kwargs) str [source]¶
Save data to Hierarchical Data Format (HDF) files.
- Parameters:
- Returns:
The filename of the HDF5 file created.
- Return type:
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds.to_hdf(path="my/path.h5", key="df")
- to_json(path, storage_options=None, **kwargs)[source]¶
Save data to JSON files.
- Parameters:
path (str) – Location to write to. If there are more than one partitions in df, should include a glob character to expand into a set of file names, or provide a name_function=parameter. Supports protocol specifications such as “oci://”, “s3://”.
storage_options (dict, optional) – Parameters passed on to the backend filesystem class. Defaults to storage_options set using DatasetFactory.set_default_storage().
kwargs (dict, optional)
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds.to_json("my/path.json")
- to_pandas(filter=None, frac=None, include_transformer_pipeline=False)[source]¶
Returns a copy of the data as pandas.DataFrame, and a sklearn pipeline optionally that holds the transformations run so far on the data.
The pipeline returned can be updated with the transformations done offline and passed along with the dataframe to Dataset.open API if the transformations need to be reproduced at the time of scoring.
- Parameters:
filter (str, optional) – The query string to filter the dataframe, for example ds.to_pandas(filter=”age > 50 and location == ‘san francisco”) See also https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
frac (float, optional) – fraction of original data to return.
include_transformer_pipeline (bool, default: False) – If True, (dataframe, transformer_pipeline) is returned as a tuple
- Returns:
dataframe (pandas.DataFrame) – if include_transformer_pipeline is False.
(data, transformer_pipeline) (tuple of pandas.DataFrame and dataset.pipeline.TransformerPipeline) – if include_transformer_pipeline is True.
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds_as_df = ds.to_pandas()
Notes
- to_parquet(path, storage_options=None, **kwargs)[source]¶
Save data to parquet file.
- Parameters:
path (str) – Location to write to. If there are more than one partitions in df, should include a glob character to expand into a set of file names, or provide a name_function=parameter. Supports protocol specifications such as “oci://”, “s3://”.
storage_options (dict, optional) – Parameters passed on to the backend filesystem class. Defaults to storage_options set using DatasetFactory.set_default_storage().
kwargs (dict, optional)
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> ds.to_parquet("my/path")
- to_xgb(filter=None, frac=None, include_transformer_pipeline=False)[source]¶
Returns a copy of the data as xgboost.DMatrix, and a sklearn pipeline optionally that holds the transformations run so far on the data.
The pipeline returned can be updated with the transformations done offline and passed along with the dataframe to Dataset.open API if the transformations need to be reproduced at the time of scoring.
- Parameters:
filter (str, optional) – The query string to filter the dataframe, for example ds.to_xgb(filter=”age > 50 and location == ‘san francisco”) See also https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
frac (float, optional) – fraction of original data to return.
include_transformer_pipeline (bool, default: False) – If True, (dataframe, transformer_pipeline) is returned as a tuple.
- Returns:
dataframe (xgboost.DMatrix) – if include_transformer_pipeline is False.
(data, transformer_pipeline) (tuple of xgboost.DMatrix and dataset.pipeline.TransformerPipeline) – if include_transformer_pipeline is True.
Examples
>>> import pandas as pd >>> ds = ADSDataset.from_dataframe(pd.read_csv("data.csv")) >>> xgb_dmat = ds.to_xgb()
Notes
ads.dataset.dataset_browser module¶
- class ads.dataset.dataset_browser.DatasetBrowser[source]¶
Bases:
ABC
- class ads.dataset.dataset_browser.GitHubDatasets(user: str, repo: str, branch: str)[source]¶
Bases:
DatasetBrowser
- class ads.dataset.dataset_browser.LocalFilesystemDatasets(folder: str)[source]¶
Bases:
DatasetBrowser
- class ads.dataset.dataset_browser.SeabornDatasets[source]¶
Bases:
DatasetBrowser
- class ads.dataset.dataset_browser.SklearnDatasets[source]¶
Bases:
DatasetBrowser
- open(name: str, **kwargs)[source]¶
Return new dataset for the given name.
- Parameters:
name (str) – the name of the dataset to open.
- Returns:
ds
- Return type:
Dataset
Examples
ds_browser = DatasetBrowser(“sklearn”)
ds = ds_browser.open(“iris”)
- sklearn_datasets = ['breast_cancer', 'diabetes', 'iris', 'wine', 'digits']¶
- class ads.dataset.dataset_browser.WebDatasets(index_url: str)[source]¶
Bases:
DatasetBrowser
ads.dataset.dataset_with_target module¶
- class ads.dataset.dataset_with_target.ADSDatasetWithTarget(df, target, sampled_df=None, shape=None, target_type=None, sample_max_rows=-1, type_discovery=True, types={}, parent=None, name='', metadata=None, transformer_pipeline=None, description=None, progress=<ads.dataset.progress.DummyProgressBar object>, **kwargs)[source]¶
Bases:
ADSDataset
This class provides APIs for preparing dataset for modeling.
- auto_transform(correlation_threshold: float = 0.7, frac: float = 1.0, sample_size=1.0, correlation_methods: str | list = 'pearson')[source]¶
Return transformed dataset with several optimizations applied automatically. The optimizations include:
Dropping constant and primary key columns, which has no predictive quality,
Imputation, to fill in missing values in noisy data:
For continuous variables, fill with mean if less than 40% is missing, else drop,
For categorical variables, fill with most frequent if less than 40% is missing, else drop,
Dropping strongly co-correlated columns that tend to produce less generalizable models.
- Parameters:
correlation_threshold (float, defaults to 0.7. It must be between 0 and 1, inclusive) – the correlation threshold where columns with correlation higher than the threshold will be considered as strongly co-correlated and recommended to be taken care of.
frac (Is superseded by sample_size)
sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What fraction of the data should be used in the calculation?
correlation_methods (Union[list, str], defaults to 'pearson') –
‘pearson’: Use Pearson’s Correlation between continuous features,
’cramers v’: Use Cramer’s V correlations between categorical features,
’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,
’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].
- Returns:
transformed_dataset
- Return type:
Examples
>>> ds_clean = ds.auto_transform()
- static from_dataframe(df: DataFrame, target: str, sampled_df: DataFrame | None = None, shape: Tuple[int, int] | None = None, target_type: TypedFeature | None = None, positive_class=None, **init_kwargs)[source]¶
- get_recommendations(correlation_methods: str = 'pearson', correlation_threshold: float = 0.7, frac: float = 1.0, sample_size: float = 1.0, overwrite: bool = None, force_recompute: bool = False, display_format: str = 'widget')[source]¶
Generate recommendations for dataset optimization. This includes:
Identifying constant and primary key columns, which has no predictive quality,
Imputation, to fill in missing values in noisy data:
For continuous variables, fill with mean if less than 40% is missing, else drop,
For categorical variables, fill with most frequent if less than 40% is missing, else drop,
Identifying strongly co-correlated columns that tend to produce less generalizable models,
Automatically balancing dataset for classification problems using up or down sampling.
- Parameters:
correlation_methods (Union[list, str], default to 'pearson') –
‘pearson’: Use Pearson’s Correlation between continuous features,
’cramers v’: Use Cramer’s V correlations between categorical features,
’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,
’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’].
correlation_threshold (float, defaults to 0.7. It must be between 0 and 1, inclusive) – The correlation threshold where columns with correlation higher than the threshold will be considered as strongly co-correlated and recommended to be taken care of.
frac (Is superseded by sample_size)
sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What fraction of the data should be used in the calculation?
overwrite – Is deprecated and replaced by force_recompute.
force_recompute (bool, default to be False) –
If False, it calculates the correlation matrix if there is no cached correlation matrix. Otherwise, it returns the cached correlation matrix.
If True, it calculates the correlation matrix regardless whether there is cached result or not.
display_format (string, defaults to 'widget'.) – Should be either ‘widget’ or ‘table’. If ‘widget’, a GUI style interface is popped out; if ‘table’, a table of suggestions is shown.
- get_transformed_dataset()[source]¶
Return the transformed dataset with the recommendations applied.
This method should be called after applying the recommendations using the Recommendation#show_in_notebook() API.
- select_best_features(score_func=None, k=12)[source]¶
Return new dataset containing only the top k features.
- Parameters:
k (int, default 12) – The top ‘k’ features to select.
score_func (function) – Scoring function to use to rank the features. This scoring function should take a 2d array X(features) and an array like y(target) and return a numeric score for each feature in the same order as X.
Notes
See also https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html and https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html
Examples
>>> ds = DatasetBrowser("sklearn").open("iris") >>> ds_small = ds.select_best_features(k=2)
- suggest_recommendations(correlation_methods: str | list = 'pearson', print_code: bool = True, correlation_threshold: float = 0.7, overwrite: bool | None = None, force_recompute: bool = False, frac: float = 1.0, sample_size: float = 1.0, **kwargs)[source]¶
Returns a pandas dataframe with suggestions for dataset optimization. This includes:
Identifying constant and primary key columns, which has no predictive quality,
Imputation, to fill in missing values in noisy data:
For continuous variables, fill with mean if less than 40% is missing, else drop,
For categorical variables, fill with most frequent if less than 40% is missing, else drop,
Identifying strongly co-correlated columns that tend to produce less generalizable models,
Automatically balancing dataset for classification problems using up or down sampling.
- Parameters:
correlation_methods (Union[list, str], default to 'pearson') –
‘pearson’: Use Pearson’s Correlation between continuous features,
’cramers v’: Use Cramer’s V correlations between categorical features,
’correlation ratio’: Use Correlation Ratio Correlation between categorical and continuous features,
’all’: Is equivalent to [‘pearson’, ‘cramers v’, ‘correlation ratio’].
Or a list containing any combination of these methods, for example, [‘pearson’, ‘cramers v’]
print_code (bool, Defaults to True) – Print Python code for the suggested actions.
correlation_threshold (float. Defaults to 0.7. It must be between 0 and 1, inclusive) – the correlation threshold where columns with correlation higher than the threshold will be considered as strongly co-correated and recommended to be taken care of.
frac (Is superseded by sample_size)
sample_size (float, defaults to 1.0. Float, Range -> (0, 1]) – What fraction of the data should be used in the calculation?
overwrite – Is deprecated and replaced by force_recompute.
force_recompute (bool, default to be False) –
If False, it calculates the correlation matrix if there is no cached correlation matrix. Otherwise, it returns the cached correlation matrix.
If True, it calculates the correlation matrix regardless whether there is cached result or not.
- Returns:
suggestion dataframe
- Return type:
pandas.DataFrame
Examples
>>> suggestion_df = ds.suggest_recommendations(correlation_threshold=0.7)
- train_test_split(test_size=0.1, random_state=42)[source]¶
Splits dataset to train and test data.
- Parameters:
random_state (Union[int, RandomState], optional, default=None) –
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.
- Returns:
train_data, test_data – tuple of ADSData instances
- Return type:
Examples
>>> ds = DatasetFactory.open("data.csv") >>> train, test = ds.train_test_split()
- train_validation_test_split(test_size=0.1, validation_size=0.1, random_state=42)[source]¶
Splits dataset to train, validation and test data.
- Parameters:
random_state (Union[int, RandomState], optional, default=None) –
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.
- Returns:
train_data, validation_data, test_data – tuple of ADSData instances
- Return type:
Examples
>>> ds = DatasetFactory.open("data.csv") >>> train, valid, test = ds.train_validation_test_split()
ads.dataset.exception module¶
- exception ads.dataset.exception.DatasetError(*args, **kwargs)[source]¶
Bases:
BaseException
Base class for dataset errors.
- exception ads.dataset.exception.ValidationError(msg)[source]¶
Bases:
DatasetError
Handles validation errors in dataset.
ads.dataset.factory module¶
- class ads.dataset.factory.CustomFormatReaders[source]¶
Bases:
object
- DEFAULT_SQL_ARRAYSIZE = 50000¶
- DEFAULT_SQL_CHUNKSIZE = 12007¶
- DEFAULT_SQL_CTU = False¶
- DEFAULT_SQL_MIL = 128¶
- classmethod read_sql(path: str, table: str | None = None, **kwargs) DataFrame [source]¶
- Parameters:
path – str This is the connection URL that gets passed to sqlalchemy’s create_engine method
table – str This is either the name of a table to select * from or a sql query to be run
kwargs
- Returns:
pd.DataFrame
- class ads.dataset.factory.DatasetFactory[source]¶
Bases:
object
- static download(remote_path, local_path, storage=None, overwrite=False)[source]¶
Download a remote file or directory to local storage.
- Parameters:
remote_path (str) – Supports protocols like oci, s3, also supports glob expressions
local_path (str) – Supports glob expressions
storage (dict) – Parameters passed on to the backend remote filesystem class.
overwrite (bool, default False) – If True, the method will overwrite any existing files in the local_path
Examples
>>> DatasetFactory.download("oci://Bucket/prefix/to/data/*.csv", ... "/home/datascience/data/")
- static from_dataframe(df, target: str | None = None, **kwargs)[source]¶
Returns an object of ADSDatasetWithTarget or ADSDataset given a pandas.DataFrame
- Parameters:
- Returns:
dataset – according to the type of target
- Return type:
an object of ADSDataset target is not specified, otherwise an object of ADSDatasetWithTarget tagged
Examples
>>> df = pd.DataFrame(data) >>> ds = from_dataframe(df)
- static list_snapshots(snapshot_dir=None, name='', storage_options=None, **kwargs)[source]¶
Displays the URIs for dataset snapshots under the given directory path.
- Parameters:
snapshot_dir (str) – Return all dataset snapshots created using ADSDataset.snapshot() within this directory. The path can contain protocols such as oci, s3.
name (str, optional) – The list of snapshots in the directory gets filtered by the name. Accepts glob expressions. default = “ads_”
storage_options (dict) – Parameters passed on to the backend filesystem class.
Example
>>> DatasetFactory.list_snapshots(snapshot_dir="oci://my_bucket/snapshots_dir", ... name="ads_iris_")
Returns a list of all snapshots (recursively) saved to obj storage bucket “my_bucket” with prefix “/snapshots_dir/ads_iris_**” sorted by time created.
- static open(source, target=None, format='infer', reader_fn: Callable = None, name: str = None, description='', npartitions: int = None, type_discovery=True, html_table_index=None, column_names='infer', sample_max_rows=10000, positive_class=None, transformer_pipeline=None, types={}, **kwargs)[source]¶
Returns an object of ADSDataset or ADSDatasetWithTarget read from the given path
Deprecated since version 2.6.6: “Deprecated in favor of using Pandas. Pandas supports reading from object storage directly. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/loading_data/connect.html”,
- Parameters:
source (Union[str, pandas.DataFrame, h2o.DataFrame, pyspark.sql.dataframe.DataFrame]) – If str, URI for the dataset. The dataset could be read from local or network file system, hdfs, s3, gcs and optionally pyspark in pyspark conda env
target (str, optional) – Name of the target in dataset. If set an ADSDatasetWithTarget object is returned, otherwise an ADSDataset object is returned which can be used to understand the dataset through visualizations
format (str, default: infer) – Format of the dataset. Supported formats: CSV, TSV, Parquet, libsvm, JSON, XLS/XLSX (Excel), HDF5, SQL, XML, Apache server log files (clf, log), ARFF. By default, the format would be inferred from the ending of the dataset file path.
reader_fn (Callable, default: None) – The user may pass in their own custom reader function. It must accept (path, **kwarg) and return a pandas DataFrame
name (str, optional default: "")
description (str, optional default: "") – Text describing the dataset
npartitions (int, deprecated) – Number of partitions to split the data By default this is set to the max number of cores supported by the backend compute accelerator
type_discovery (bool, default: True) – If false, the data types of the dataframe are used as such. By default, the dataframe columns are associated with the best suited data types. Associating the features with the disovered datatypes would impact visualizations and model prediction.
html_table_index (int, optional) – The index of the dataframe table in html content. This is used when the format of dataset is html
column_names ('infer', list of str or None, default: 'infer') – Supported only for CSV and TSV. List of column names to use. By default, column names are inferred from the first line of the file. If set to None, column names would be auto-generated instead of inferring from file. If the file already contains a column header, specify header=0 to ignore the existing column names.
sample_max_rows (int, default: 10000, use -1 auto calculate sample size, use 0 (zero) for no sampling) – Sample size of the dataframe to use for visualization and optimization.
positive_class (Any, optional) – Label in target for binary classification problems which should be identified as positive for modeling. By default, the first unique value is considered as the positive label.
types (dict, optional) – Dictionary of <feature_name> : <data_type> to override the data type of features.
transformer_pipeline (datasets.pipeline.TransformerPipeline, optional) – A pipeline of transformations done outside the sdk and need to be applied at the time of scoring
storage_options (dict, default: varies by source type) – Parameters passed on to the backend filesystem class.
sep (str) – Delimiting character for parsing the input file.
kwargs (additional keyword arguments that would be passed to underlying dataframe read API) – based on the format of the dataset
- Returns:
dataset (An instance of ADSDataset)
(or)
dataset_with_target (An instance of ADSDatasetWithTarget)
Examples
>>> ds = DatasetFactory.open("/path/to/data.data", format='csv', delimiter=" ", ... na_values="n/a", skipinitialspace=True)
>>> ds = DatasetFactory.open("/path/to/data.csv", target="col_1", prefix="col_", ... skiprows=1, encoding="ISO-8859-1")
>>> ds = DatasetFactory.open("oci://bucket@namespace/path/to/data.tsv", ... column_names=["col1", "col2", "col3"], header=0)
>>> ds = DatasetFactory.open("oci://bucket@namespace/path/to/data.csv", ... storage_options={"config": "~/.oci/config", ... "profile": "USER_2"}, delimiter = ';')
>>> ds = DatasetFactory.open("/path/to/data.parquet", engine='pyarrow', ... types={"col1": "ordinal", ... "col2": "categorical", ... "col3" : "continuous", ... "col4" : "float64"})
>>> ds = DatasetFactory.open(df, target="class", sample_max_rows=5000, ... positive_class="yes")
>>> ds = DatasetFactory.open("s3://path/to/data.json.gz", format="json", ... compression="gzip", orient="records")
- static open_to_pandas(source: str, format: str | None = None, reader_fn: Callable | None = None, **kwargs) DataFrame [source]¶
- static set_default_storage(snapshots_dir=None, storage_options=None)[source]¶
Set default storage directory and options.
Both snapshots_dir and storage_options can be overridden at the API scope.
- ads.dataset.factory.get_format_reader(path: ElaboratedPath, **kwargs) Callable [source]¶
- ads.dataset.factory.load_dataset(path: ElaboratedPath, reader_fn: Callable, **kwargs) DataFrame [source]¶
ads.dataset.feature_engineering_transformer module¶
- class ads.dataset.feature_engineering_transformer.FeatureEngineeringTransformer(feature_metadata=None)[source]¶
Bases:
TransformerMixin
- fit_transform(X, y=None, **fit_params)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
ads.dataset.feature_selection module¶
ads.dataset.forecasting_dataset module¶
- class ads.dataset.forecasting_dataset.ForecastingDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]¶
Bases:
ADSDatasetWithTarget
ads.dataset.helper module¶
- class ads.dataset.helper.CustomFormatReaders[source]¶
Bases:
object
- DEFAULT_SQL_ARRAYSIZE = 50000¶
- DEFAULT_SQL_CHUNKSIZE = 12007¶
- DEFAULT_SQL_CTU = False¶
- DEFAULT_SQL_MIL = 128¶
- classmethod read_sql(path: str, table: str | None = None, **kwargs) DataFrame [source]¶
- Parameters:
path – str This is the connection URL that gets passed to sqlalchemy’s create_engine method
table – str This is either the name of a table to select * from or a sql query to be run
kwargs
- Returns:
pd.DataFrame
- class ads.dataset.helper.DatasetDefaults[source]¶
Bases:
object
- sampling_confidence_interval = 1.0¶
- sampling_confidence_level = 95¶
- exception ads.dataset.helper.DatasetLoadException(exc_msg)[source]¶
Bases:
BaseException
- class ads.dataset.helper.ElaboratedPath(source: str | List[str], format: str | None = None, name: str | None = None, **kwargs)[source]¶
Bases:
object
The Elaborated Path class unifies all of the operations and information related to a path or pathlist. Whether the user wants to An Elaborated path can accept any of the following as a valid source: * A single path * A glob pattern path * A directory * A list of paths (Note: all of these paths must be from the same filesystem AND have the same format) * A sqlalchemy connection url
- Parameters:
source
format
kwargs
By the end of this method, this class needs to have paths, format, and name ready
- ads.dataset.helper.build_dataset(df: DataFrame, shape: Tuple[int, int], target: str | None = None, progress=None, **kwargs)[source]¶
- ads.dataset.helper.calculate_sample_size(population_size, min_size_to_sample, confidence_level=95, confidence_interval=1.0)[source]¶
- Find sample size for a population using Cochran’s Sample Size Formula.
With default values for confidence_level (percentage, default: 95%) and confidence_interval (margin of error, percentage, default: 1%)
SUPPORTED CONFIDENCE LEVELS: 50%, 68%, 90%, 95%, and 99% ONLY - this is because the Z-score is table based, and I’m only providing Z for common confidence levels.
- ads.dataset.helper.deprecate_default_value(var, old_value, new_value, warning_msg, warning_type)[source]¶
- ads.dataset.helper.down_sample(df, target)[source]¶
Fixes imbalanced dataset by down-sampling
- Parameters:
df (pandas.DataFrame)
target (name of the target column in df)
- Returns:
downsampled_df
- Return type:
pandas.DataFrame
- ads.dataset.helper.generate_sample(df: DataFrame, n: int, confidence_level: int = 95, confidence_interval: float = 1.0, **kwargs)[source]¶
- ads.dataset.helper.get_dataset(df: DataFrame, sampled_df: DataFrame, target: str, target_type: TypedFeature, shape: Tuple[int, int], positive_class=None, **init_kwargs)[source]¶
- ads.dataset.helper.get_format_reader(path: ElaboratedPath, **kwargs) Callable [source]¶
- ads.dataset.helper.load_dataset(path: ElaboratedPath, reader_fn: Callable, **kwargs) DataFrame [source]¶
- ads.dataset.helper.open(source, target=None, format='infer', reader_fn: Callable | None = None, name: str | None = None, description='', npartitions: int | None = None, type_discovery=True, html_table_index=None, column_names='infer', sample_max_rows=10000, positive_class=None, transformer_pipeline=None, types={}, **kwargs)[source]¶
Returns an object of ADSDataset or ADSDatasetWithTarget read from the given path
Deprecated since version 2.6.6: “Deprecated in favor of using Pandas. Pandas supports reading from object storage directly. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/loading_data/connect.html”,
- Parameters:
source (Union[str, pandas.DataFrame, h2o.DataFrame, pyspark.sql.dataframe.DataFrame]) – If str, URI for the dataset. The dataset could be read from local or network file system, hdfs, s3, gcs and optionally pyspark in pyspark conda env
target (str, optional) – Name of the target in dataset. If set an ADSDatasetWithTarget object is returned, otherwise an ADSDataset object is returned which can be used to understand the dataset through visualizations
format (str, default: infer) – Format of the dataset. Supported formats: CSV, TSV, Parquet, libsvm, JSON, XLS/XLSX (Excel), HDF5, SQL, XML, Apache server log files (clf, log), ARFF. By default, the format would be inferred from the ending of the dataset file path.
reader_fn (Callable, default: None) – The user may pass in their own custom reader function. It must accept (path, **kwarg) and return a pandas DataFrame
name (str, optional default: "")
description (str, optional default: "") – Text describing the dataset
npartitions (int, deprecated) – Number of partitions to split the data By default this is set to the max number of cores supported by the backend compute accelerator
type_discovery (bool, default: True) – If false, the data types of the dataframe are used as such. By default, the dataframe columns are associated with the best suited data types. Associating the features with the disovered datatypes would impact visualizations and model prediction.
html_table_index (int, optional) – The index of the dataframe table in html content. This is used when the format of dataset is html
column_names ('infer', list of str or None, default: 'infer') – Supported only for CSV and TSV. List of column names to use. By default, column names are inferred from the first line of the file. If set to None, column names would be auto-generated instead of inferring from file. If the file already contains a column header, specify header=0 to ignore the existing column names.
sample_max_rows (int, default: 10000, use -1 auto calculate sample size, use 0 (zero) for no sampling) – Sample size of the dataframe to use for visualization and optimization.
positive_class (Any, optional) – Label in target for binary classification problems which should be identified as positive for modeling. By default, the first unique value is considered as the positive label.
types (dict, optional) – Dictionary of <feature_name> : <data_type> to override the data type of features.
transformer_pipeline (datasets.pipeline.TransformerPipeline, optional) – A pipeline of transformations done outside the sdk and need to be applied at the time of scoring
storage_options (dict, default: varies by source type) – Parameters passed on to the backend filesystem class.
sep (str) – Delimiting character for parsing the input file.
kwargs (additional keyword arguments that would be passed to underlying dataframe read API) – based on the format of the dataset
- Returns:
dataset (An instance of ADSDataset)
(or)
dataset_with_target (An instance of ADSDatasetWithTarget)
- ads.dataset.helper.parse_apache_log_datetime(x)[source]¶
- Parses datetime with timezone formatted as:
[day/month/year:hour:minute:second zone]
Source: https://mmas.github.io/read-apache-access-log-pandas .. rubric:: Example
>>> parse_datetime(‘13/Nov/2015:11:45:42 +0000’) datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)
Due to problems parsing the timezone (%z) with datetime.strptime, the timezone will be obtained using the pytz library.
- ads.dataset.helper.parse_apache_log_str(x)[source]¶
Returns the string delimited by two characters.
Source: https://mmas.github.io/read-apache-access-log-pandas .. rubric:: Example
>>> parse_str(‘[my string]’) ‘my string’
- ads.dataset.helper.up_sample(df, target, sampler='default', feature_types=None)[source]¶
Fixes imbalanced dataset by up-sampling
- Parameters:
df (Union[pandas.DataFrame, dask.dataframe.core.DataFrame])
target (name of the target column in df)
sampler (Should implement fit_resample(X,y) method)
fillna (a dictionary contains the column name as well as the fill value,) – only needed when the column has missing values
- Returns:
upsampled_df
- Return type:
Union[pandas.DataFrame, dask.dataframe.core.DataFrame]
- ads.dataset.helper.write_parquet(path, data, engine='fastparquet', metadata_dict=None, compression=None, storage_options=None)[source]¶
Uses fast parquet to write dask dataframe and custom metadata in parquet format
- Parameters:
path (str) – Path to write to
data (pandas.DataFrame)
engine (string) – “auto” by default
metadata_dict (Deprecated, will not pass through)
compression ({{'snappy', 'gzip', 'brotli', None}}, default 'snappy') – Name of the compression to use
storage_options (dict, optional) – storage arguments required to read the path
- Returns:
str
- Return type:
the file path the parquet was written to
ads.dataset.label_encoder module¶
- class ads.dataset.label_encoder.DataFrameLabelEncoder[source]¶
Bases:
TransformerMixin
Label encoder for pandas.DataFrame and dask.dataframe.core.DataFrame.
- label_encoders¶
Holds the label encoder for each column.
- Type:
defaultdict
Examples
>>> import pandas as pd >>> from ads.dataset.label_encoder import DataFrameLabelEncoder
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}) >>> le = DataFrameLabelEncoder() >>> le.fit_transform(X=df)
Initialize an instance of DataFrameLabelEncoder.
ads.dataset.pipeline module¶
- class ads.dataset.pipeline.TransformerPipeline(steps)[source]¶
Bases:
Pipeline
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') TransformerPipeline ¶
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
ads.dataset.plot module¶
ads.dataset.progress module¶
- class ads.dataset.progress.DummyProgressBar(*args, **kwargs)[source]¶
Bases:
ProgressBar
- class ads.dataset.progress.TqdmProgressBar(max_progress=100, description='Running', verbose=False)[source]¶
Bases:
ProgressBar
ads.dataset.recommendation module¶
- class ads.dataset.recommendation.Recommendation(ds, recommendation_transformer)[source]¶
Bases:
object
- recommendation_type_labels = ['Constant Columns', 'Potential Primary Key Columns', 'Imputation', 'Multicollinear Columns', 'Identify positive label for target', 'Fix imbalance in dataset']¶
- recommendation_types = ['constant_column', 'primary_key', 'imputation', 'strong_correlation', 'positive_class', 'fix_imbalance']¶
ads.dataset.recommendation_transformer module¶
- class ads.dataset.recommendation_transformer.RecommendationTransformer(feature_metadata=None, correlation=None, target=None, is_balanced=False, target_type=None, feature_ranking=None, len=0, fix_imbalance=True, auto_transform=True, correlation_threshold=0.7)[source]¶
Bases:
TransformerMixin
- fit_transform(X, y=None, **fit_params)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
ads.dataset.regression_dataset module¶
- class ads.dataset.regression_dataset.RegressionDataset(df, sampled_df, target, target_type, shape, **kwargs)[source]¶
Bases:
ADSDatasetWithTarget
ads.dataset.sampled_dataset module¶
- class ads.dataset.sampled_dataset.PandasDataset(sampled_df, type_discovery=True, types={}, metadata=None, progress=<ads.dataset.progress.DummyProgressBar object>)[source]¶
Bases:
object
This class provides APIs that can work on a sampled dataset.
- plot(x, y=None, plot_type='infer', yscale=None, verbose=True, sample_size=0)[source]¶
Supports plotting feature distribution, and relationship between features.
- Parameters:
x (str) – The name of the feature to plot
y (str, optional) – Name of the feature to plot against x
plot_type (str, default: infer) –
Override the inferred plot type for certain combinations of the data types of x and y. By default, the best plot type is inferred based on x and y data types. Valid values:
box_plot - discrete feature vs continuous feature. Draw a box plot to show distributions with respect to categories,
scatter - continuous feature vs continuous feature. Draw a scatter plot with possibility of several semantic groupings.
yscale (str, optional) – One of {“linear”, “log”, “symlog”, “logit”}. The y axis scale type to apply. Can be used when either x or y is an ordinal feature.
verbose (bool, default True) – Displays Note/Tips if True
- plot_gis_scatter(lon='longitude', lat='latitude', ax=None)[source]¶
Supports plotting Choropleth maps
ads.dataset.target module¶
- class ads.dataset.target.TargetVariable(sampled_ds, target, target_type)[source]¶
Bases:
object
This class provides target specific APIs.