ads.feature_engineering.accessor.mixin package

Submodules

ads.feature_engineering.accessor.mixin.correlation module

ads.feature_engineering.accessor.mixin.correlation.cat_vs_cat(df: DataFrame, normal_form: bool = True) → DataFrame: Calculates the correlation of all pairs of categorical features and categorical features.

ads.feature_engineering.accessor.mixin.correlation.cat_vs_cont(df: DataFrame, categorical_columns, continuous_columns, normal_form: bool = True) → DataFrame: Calculates the correlation of all pairs of categorical features and continuous features.

ads.feature_engineering.accessor.mixin.correlation.cont_vs_cont(df: DataFrame, normal_form: bool = True) → DataFrame: Calculates the Pearson correlation between two columns of the DataFrame.

ads.feature_engineering.accessor.mixin.eda_mixin module

This exploratory data analysis (EDA) Mixin is used in the ADS accessor for the Pandas Dataframe. The series of purpose-driven methods enable the data scientist to complete analysis on the dataframe.

From the accessor we have access to the pandas object the user is interacting with as well as corresponding lists of feature types per column.

class ads.feature_engineering.accessor.mixin.eda_mixin.EDAMixin

Bases: object

correlation_ratio() → DataFrame

Generate a Correlation Ratio data frame for all categorical-continuous variable pairs.

Returns:

pandas.DataFrame
Correlation Ratio correlation data frame with the following 3 columns –
1. Column 1 (name of the first categorical/continuous column)
2. Column 2 (name of the second categorical/continuous column)
3. Value (correlation value)

Note

Pairs will be replicated. For example for variables x and y, we would have (x,y), (y,x) both with same correlation value. We will also have (x,x) and (y,y) with value 1.0.

correlation_ratio_plot() → Axes

Generate a heatmap of the Correlation Ratio correlation for all categorical-continuous variable pairs.

Returns:: Correlation Ratio correlation plot object that can be updated by the customer
Return type:: Plot object

cramersv() → DataFrame

Generate a Cramer’s V correlation data frame for all categorical variable pairs.

Gives a warning for dropped non-categorical columns.

Returns:

Cramer’s V correlation data frame with the following 3 columns:

Column 1 (name of the first categorical column)
Column 2 (name of the second categorical column)
Value (correlation value)

Return type:

pandas.DataFrame

Note

Pairs will be replicated. For example for variables x and y, we would have (x,y), (y,x) both with same correlation value. We will also have (x,x) and (y,y) with value 1.0.

cramersv_plot() → Axes

Generate a heatmap of the Cramer’s V correlation for all categorical variable pairs.

Gives a warning for dropped non-categorical columns.

Returns:: Cramer’s V correlation plot object that can be updated by the customer
Return type:: Plot object

feature_count() → DataFrame

Counts the number of columns for each feature type and each primary feature. The column of primary is the number of primary feature types that is assigned to the column.

Returns:: The number of columns for each feature type The number of columns for each primary feature
Return type:: Dataframe with

Examples

>>> df.ads.feature_type
{'PassengerId': ['ordinal', 'category'],
'Survived': ['ordinal'],
'Pclass': ['ordinal'],
'Name': ['category'],
'Sex': ['category']}
>>> df.ads.feature_count()
    Feature Type        Count       Primary
0       category            3             2
1        ordinal            3             3

feature_plot() → DataFrame

For every column in the dataframe plot generate a list of summary plots based on the most relevant feature type.

Returns:: Dataframe with 2 columns: 1. Column - feature name 2. Plot - plot object
Return type:: pandas.DataFrame

feature_stat() → DataFrame

Summary statistics Dataframe provided.

This returns feature stats on each column using FeatureType summary method.

Examples

>>> df = pd.read_csv('~/advanced-ds/tests/vor_datasets/vor_titanic.csv')
>>> df.ads.feature_stat().head()
         Column    Metric                       Value
0       PassengerId         count                       891.000
1       PassengerId         mean                        446.000
2       PassengerId         standard deviation      257.354
3       PassengerId         sample minimum          1.000
4       PassengerId         lower quartile              223.500

Returns:: Dataframe with 3 columns: name, metric, value
Return type:: pandas.DataFrame

pearson() → DataFrame

Generate a Pearson correlation data frame for all continuous variable pairs.

Gives a warning for dropped non-numerical columns.

Returns:

pandas.DataFrame
Pearson correlation data frame with the following 3 columns –
1. Column 1 (name of the first continuous column)
2. Column 2 (name of the second continuous column)
3. Value (correlation value)

Note

Pairs will be replicated. For example for variables x and y, we’d have (x,y), (y,x) both with same correlation value. We’ll also have (x,x) and (y,y) with value 1.0.

pearson_plot() → Axes

Generate a heatmap of the Pearson correlation for all continuous variable pairs.

Returns:: Pearson correlation plot object that can be updated by the customer
Return type:: Plot object

warning() → DataFrame

Generates a data frame that lists feature specific warnings.

Returns:: The list of feature specific warnings.
Return type:: pandas.DataFrame

Examples

>>> df.ads.warning()
    Column    Feature Type         Warning               Message       Metric    Value
--------------------------------------------------------------------------------------
0      Age      continuous           Zeros      Age has 38 zeros        Count       38
1      Age      continuous           Zeros   Age has 12.2% zeros   Percentage    12.2%

ads.feature_engineering.accessor.mixin.eda_mixin_series module

This exploratory data analysis (EDA) Mixin is used in the ADS accessor for the Pandas Series. The series of purpose-driven methods enable the data scientist to complete univariate analysis.

From the accessor we have access to the pandas object the user is interacting with as well as corresponding list of feature types.

class ads.feature_engineering.accessor.mixin.eda_mixin_series.EDAMixinSeries

Bases: object

feature_plot() → Axes

For the series generate a summary plot based on the most relevant feature type.

Returns:: Plot object for the series based on the most relevant feature type.
Return type:: matplotlib.axes._subplots.AxesSubplot

feature_stat() → DataFrame

Summary statistics Dataframe provided.

This returns feature stats on series using FeatureType summary method.

Examples

>>> df = pd.read_csv('~/advanced-ds/tests/vor_datasets/vor_titanic.csv')
>>> df['Cabin'].ads.feature_stat()
    Metric      Value
0       count       891
1       unqiue      147
2       missing     687

Returns:: Dataframe with 2 columns and rows for different metric values
Return type:: pandas.DataFrame

warning() → DataFrame

Generates a data frame that lists feature specific warnings.

Returns:: The list of feature specific warnings.
Return type:: pandas.DataFrame

Examples

>>> df["Age"].ads.warning()
  Feature Type       Warning               Message         Metric      Value
 ---------------------------------------------------------------------------
0   continuous         Zeros      Age has 38 zeros          Count         38
1   continuous         Zeros   Age has 12.2% zeros     Percentage      12.2%

ads.feature_engineering.accessor.mixin.feature_types_mixin module

The module that represents the ADS Feature Types Mixin class that extends Pandas Series and Dataframe accessors.

Classes

ADSFeatureTypesMixin
ADS Feature Types Mixin class that extends Pandas Series and Dataframe accessors.

class ads.feature_engineering.accessor.mixin.feature_types_mixin.ADSFeatureTypesMixin

Bases: object

ADS Feature Types Mixin class that extends Pandas Series and DataFrame accessors.

warning_registered(cls) → pd.DataFrame: Lists registered warnings for registered feature types.

validator_registered(cls) → pd.DataFrame: Lists registered validators for registered feature types.

help(self, prop: str = None) → None: Help method that prints either a table of available properties or, given a property, returns its docstring.

help(prop: Optional[str] = None) → None

Help method that prints either a table of available properties or, given an individual property, returns its docstring.

Parameters:: prop (str) – The Name of property.
Returns:: Nothing.
Return type:: None

validator_registered() → DataFrame

Lists registered validators for registered feature types.

Returns:: The list of registered validators for registered feature types
Return type:: pandas.DataFrame

Examples

>>> df.ads.validator_registered()
         Column     Feature Type        Validator                 Condition                    Handler
------------------------------------------------------------------------------------------------------
0   PhoneNumber    phone_number   is_phone_number                        ()            default_handler
1   PhoneNumber    phone_number   is_phone_number    {'country_code': '+7'}   specific_country_handler
2    CreditCard    credit_card     is_credit_card                        ()            default_handler

>>> df['PhoneNumber'].ads.validator_registered()
    Feature Type            Validator                 Condition                     Handler
-------------------------------------------------------------------------------------------
0   phone_number      is_phone_number                        ()             default_handler
1   phone_number      is_phone_number    {'country_code': '+7'}    specific_country_handler

warning_registered() → DataFrame

Lists registered warnings for all registered feature types.

Returns:: The list of registered warnings for registered feature types.
Return type:: pandas.DataFrame

Examples

>>> df.ads.warning_registered()
       Column    Feature Type             Warning                    Handler
   -------------------------------------------------------------------------
   0      Age      continuous               zeros              zeros_handler
   1      Age      continuous    high_cardinality   high_cardinality_handler

>>> df["Age"].ads.warning_registered()
       Feature Type             Warning                    Handler
   ---------------------------------------------------------------
   0     continuous               zeros              zeros_handler
   1     continuous    high_cardinality   high_cardinality_handler

ads.feature_engineering.accessor.mixin package

Submodules

ads.feature_engineering.accessor.mixin.correlation module

ads.feature_engineering.accessor.mixin.eda_mixin module

ads.feature_engineering.accessor.mixin.eda_mixin_series module

ads.feature_engineering.accessor.mixin.feature_types_mixin module

Classes

ads.feature_engineering.accessor.mixin.utils module

Module contents