ads.feature_engineering.accessor.mixin package

Submodules

ads.feature_engineering.accessor.mixin.correlation module

ads.feature_engineering.accessor.mixin.correlation.cat_vs_cat(df: DataFrame, normal_form: bool = True) DataFrame

Calculates the correlation of all pairs of categorical features and categorical features.

ads.feature_engineering.accessor.mixin.correlation.cat_vs_cont(df: DataFrame, categorical_columns, continuous_columns, normal_form: bool = True) DataFrame

Calculates the correlation of all pairs of categorical features and continuous features.

ads.feature_engineering.accessor.mixin.correlation.cont_vs_cont(df: DataFrame, normal_form: bool = True) DataFrame

Calculates the Pearson correlation between two columns of the DataFrame.

ads.feature_engineering.accessor.mixin.eda_mixin module

This exploratory data analysis (EDA) Mixin is used in the ADS accessor for the Pandas Dataframe. The series of purpose-driven methods enable the data scientist to complete analysis on the dataframe.

From the accessor we have access to the pandas object the user is interacting with as well as corresponding lists of feature types per column.

class ads.feature_engineering.accessor.mixin.eda_mixin.EDAMixin

Bases: object

correlation_ratio() DataFrame

Generate a Correlation Ratio data frame for all categorical-continuous variable pairs.

Returns:

  • pandas.DataFrame

  • Correlation Ratio correlation data frame with the following 3 columns

    1. Column 1 (name of the first categorical/continuous column)

    2. Column 2 (name of the second categorical/continuous column)

    3. Value (correlation value)

Note

Pairs will be replicated. For example for variables x and y, we would have (x,y), (y,x) both with same correlation value. We will also have (x,x) and (y,y) with value 1.0.

correlation_ratio_plot() Axes

Generate a heatmap of the Correlation Ratio correlation for all categorical-continuous variable pairs.

Returns:

Correlation Ratio correlation plot object that can be updated by the customer

Return type:

Plot object

cramersv() DataFrame

Generate a Cramer’s V correlation data frame for all categorical variable pairs.

Gives a warning for dropped non-categorical columns.

Returns:

Cramer’s V correlation data frame with the following 3 columns:
  1. Column 1 (name of the first categorical column)

  2. Column 2 (name of the second categorical column)

  3. Value (correlation value)

Return type:

pandas.DataFrame

Note

Pairs will be replicated. For example for variables x and y, we would have (x,y), (y,x) both with same correlation value. We will also have (x,x) and (y,y) with value 1.0.

cramersv_plot() Axes

Generate a heatmap of the Cramer’s V correlation for all categorical variable pairs.

Gives a warning for dropped non-categorical columns.

Returns:

Cramer’s V correlation plot object that can be updated by the customer

Return type:

Plot object

feature_count() DataFrame

Counts the number of columns for each feature type and each primary feature. The column of primary is the number of primary feature types that is assigned to the column.

Returns:

The number of columns for each feature type The number of columns for each primary feature

Return type:

Dataframe with

Examples

>>> df.ads.feature_type
{'PassengerId': ['ordinal', 'category'],
'Survived': ['ordinal'],
'Pclass': ['ordinal'],
'Name': ['category'],
'Sex': ['category']}
>>> df.ads.feature_count()
    Feature Type        Count       Primary
0       category            3             2
1        ordinal            3             3
feature_plot() DataFrame

For every column in the dataframe plot generate a list of summary plots based on the most relevant feature type.

Returns:

Dataframe with 2 columns: 1. Column - feature name 2. Plot - plot object

Return type:

pandas.DataFrame

feature_stat() DataFrame

Summary statistics Dataframe provided.

This returns feature stats on each column using FeatureType summary method.

Examples

>>> df = pd.read_csv('~/advanced-ds/tests/vor_datasets/vor_titanic.csv')
>>> df.ads.feature_stat().head()
         Column    Metric                       Value
0       PassengerId         count                       891.000
1       PassengerId         mean                        446.000
2       PassengerId         standard deviation      257.354
3       PassengerId         sample minimum          1.000
4       PassengerId         lower quartile              223.500
Returns:

Dataframe with 3 columns: name, metric, value

Return type:

pandas.DataFrame

pearson() DataFrame

Generate a Pearson correlation data frame for all continuous variable pairs.

Gives a warning for dropped non-numerical columns.

Returns:

  • pandas.DataFrame

  • Pearson correlation data frame with the following 3 columns

    1. Column 1 (name of the first continuous column)

    2. Column 2 (name of the second continuous column)

    3. Value (correlation value)

Note

Pairs will be replicated. For example for variables x and y, we’d have (x,y), (y,x) both with same correlation value. We’ll also have (x,x) and (y,y) with value 1.0.

pearson_plot() Axes

Generate a heatmap of the Pearson correlation for all continuous variable pairs.

Returns:

Pearson correlation plot object that can be updated by the customer

Return type:

Plot object

warning() DataFrame

Generates a data frame that lists feature specific warnings.

Returns:

The list of feature specific warnings.

Return type:

pandas.DataFrame

Examples

>>> df.ads.warning()
    Column    Feature Type         Warning               Message       Metric    Value
--------------------------------------------------------------------------------------
0      Age      continuous           Zeros      Age has 38 zeros        Count       38
1      Age      continuous           Zeros   Age has 12.2% zeros   Percentage    12.2%

ads.feature_engineering.accessor.mixin.eda_mixin_series module

This exploratory data analysis (EDA) Mixin is used in the ADS accessor for the Pandas Series. The series of purpose-driven methods enable the data scientist to complete univariate analysis.

From the accessor we have access to the pandas object the user is interacting with as well as corresponding list of feature types.

class ads.feature_engineering.accessor.mixin.eda_mixin_series.EDAMixinSeries

Bases: object

feature_plot() Axes

For the series generate a summary plot based on the most relevant feature type.

Returns:

Plot object for the series based on the most relevant feature type.

Return type:

matplotlib.axes._subplots.AxesSubplot

feature_stat() DataFrame

Summary statistics Dataframe provided.

This returns feature stats on series using FeatureType summary method.

Examples

>>> df = pd.read_csv('~/advanced-ds/tests/vor_datasets/vor_titanic.csv')
>>> df['Cabin'].ads.feature_stat()
    Metric      Value
0       count       891
1       unqiue      147
2       missing     687
Returns:

Dataframe with 2 columns and rows for different metric values

Return type:

pandas.DataFrame

warning() DataFrame

Generates a data frame that lists feature specific warnings.

Returns:

The list of feature specific warnings.

Return type:

pandas.DataFrame

Examples

>>> df["Age"].ads.warning()
  Feature Type       Warning               Message         Metric      Value
 ---------------------------------------------------------------------------
0   continuous         Zeros      Age has 38 zeros          Count         38
1   continuous         Zeros   Age has 12.2% zeros     Percentage      12.2%

ads.feature_engineering.accessor.mixin.feature_types_mixin module

The module that represents the ADS Feature Types Mixin class that extends Pandas Series and Dataframe accessors.

Classes

ADSFeatureTypesMixin

ADS Feature Types Mixin class that extends Pandas Series and Dataframe accessors.

class ads.feature_engineering.accessor.mixin.feature_types_mixin.ADSFeatureTypesMixin

Bases: object

ADS Feature Types Mixin class that extends Pandas Series and DataFrame accessors.

warning_registered(cls) pd.DataFrame

Lists registered warnings for registered feature types.

validator_registered(cls) pd.DataFrame

Lists registered validators for registered feature types.

help(self, prop: str = None) None

Help method that prints either a table of available properties or, given a property, returns its docstring.

help(prop: Optional[str] = None) None

Help method that prints either a table of available properties or, given an individual property, returns its docstring.

Parameters:

prop (str) – The Name of property.

Returns:

Nothing.

Return type:

None

validator_registered() DataFrame

Lists registered validators for registered feature types.

Returns:

The list of registered validators for registered feature types

Return type:

pandas.DataFrame

Examples

>>> df.ads.validator_registered()
         Column     Feature Type        Validator                 Condition                    Handler
------------------------------------------------------------------------------------------------------
0   PhoneNumber    phone_number   is_phone_number                        ()            default_handler
1   PhoneNumber    phone_number   is_phone_number    {'country_code': '+7'}   specific_country_handler
2    CreditCard    credit_card     is_credit_card                        ()            default_handler
>>> df['PhoneNumber'].ads.validator_registered()
    Feature Type            Validator                 Condition                     Handler
-------------------------------------------------------------------------------------------
0   phone_number      is_phone_number                        ()             default_handler
1   phone_number      is_phone_number    {'country_code': '+7'}    specific_country_handler
warning_registered() DataFrame

Lists registered warnings for all registered feature types.

Returns:

The list of registered warnings for registered feature types.

Return type:

pandas.DataFrame

Examples

>>> df.ads.warning_registered()
       Column    Feature Type             Warning                    Handler
   -------------------------------------------------------------------------
   0      Age      continuous               zeros              zeros_handler
   1      Age      continuous    high_cardinality   high_cardinality_handler
>>> df["Age"].ads.warning_registered()
       Feature Type             Warning                    Handler
   ---------------------------------------------------------------
   0     continuous               zeros              zeros_handler
   1     continuous    high_cardinality   high_cardinality_handler

ads.feature_engineering.accessor.mixin.utils module

Module contents