Overview

There is a distinction between the data type of a feature and the nature of data that it represents. The data type represents the form of the data that the computer understands. ADS uses the term “feature type” to refer to the nature of the data. For example, a medical record id could be represented as an integer, its data type, but the feature type would be “medical record id”. The feature type represents the data the way the data scientist understands it. Pandas uses the term ‘column’ or ‘Series’ to refer to a column of data. In ADS the term ‘feature’ is used to refer to a column or series when feature types have been assigned to it.

ADS provides the feature type module on top of your Pandas dataframes and series to manage and use the typing information to better understand your data. The feature type framework comes with some common feature types. However, the power of using feature types is that you can easily create your own and apply them to your specific data. You don’t need to try to represent your data in a synthetic way that does not match the nature of your data. This framework allows you to create methods that validate whether the data fits the specifications of your organization. For example, for a medical record type you could create methods to validate that the data is properly formatted. You can also have the system generate warnings to sure the data is valid as a whole or create graphs for summary plots.

The framework allows you to create and assign multiple feature types. For example, a medical record id could also have a feature type id and an integer feature type.

Key Components

The feature type system allows data scientists to separate the concept of how data is represented physically from what the data actually measures. That is, the data can have feature types that classify the data based on what it represents and not how the data is stored in memory. Each set of data can have multiple feature types through a system of multiple inheritances. For example, an organization that sells cars might have a set of data that represents their purchase price of a car, that is the wholesale price. You could have a feature set of wholesale_price, car_price, USD, and continuous. This multiple inheritance allows a data scientist to create feature type warnings and feature type validators for each feature type.

A feature type is a class that inherits from FeatureType. It has several attributes and methods that can be overridden to customize the properties of the feature type. The following is a brief summary of some of the key methods.

Correlations

There are also various correlation methods, such as .correlation_ratio(), .pearson(), and .cramersv() that provide information about the correlation between different features in the form of a dataframe. Each row represents a single correlation metric. This information can also be represented in a plot with the .correlation_ratio_plot(), .pearson_plot(), and .cramersv_plot() methods.

Multiple Inheritance

This is done through a system of inheritance. For example, a hospital may have a medical record number for each patient. That data might have the patient_id, id, and integer feature types. The patient_id is the child feature type with id being its parent. The integer is the parent of the id feature type. It’s also the last feature type in the inheritance chain, and is called the default feature type.

When calling attributes and methods on a feature type, ADS searches the inheritance chain for the first matching feature type that defines the attribute or method that you are calling. For example, you want to produce statistics for the previously described patient id feature. Assume that the patient_id class didn’t override the .feature_stat() method. ADS would then look to the id feature type and see if it was overridden. If it was, it dispatches that method.

This system allows you to over override the methods that are specific to the feature type that you are creating and improves the reusability of your code. The default feature types are specified by ADS, and they have overridden all the attributes and methods with smart defaults. Therefore, you don’t need to override any of these properties unless you want to.

Summary Plot

The .feature_plot() method returns a Seaborn plot object that summarizes the feature. You can define what you want the plot to look like for your feature. Further, you can modify the plot after it’s returned, which allows you to customize it to fit your specific needs.

Summary Statistics

The .feature_stat() method returns a dataframe where each row represents a summary statistic and the numerical value for that statistic. You can customize this so that it returns summary statistics that are relevant to your specific feature type. For example, a credit card feature type may return a count of the financial network that issued the cards.

Validators

The feature type validators are a set of is_* methods, where * is generally the name of the feature type. For example, the method .is_wholesale_price()can create a boolean Pandas Series that indicates what values meet the validation criteria. It allows you to quickly identify which values need to be filtered, or require future examination into problems in the data pipeline. The feature type validators can be as complex as necessary. For example, they might take a client ID and call an API to validate each client ID is active.

Warnings

Feature type warnings are used for rapid validation of the data. For example, the wholesale_price might have a method that ensures that the value is a positive number because you can’t purchase a car with negative money. The car_price feature type may have a check to ensure that it is within a reasonable price range. USD can check the value to make sure that it represents a valid US dollar amount. It can’t have values below one cent. The continuous feature type is the default feature type, and it represents the way the data is stored internally.

Forms of Feature Types

There are several different forms of feature types. These are designed to balance the need to document a feature type and the ease of customization. With each feature that you define you can specify multiple feature types. The custom feature type gives you the most flexibility in that all the attributes and methods of the FeatureType class can be overridden. The tag feature type allows you to create a feature type that essentially is a label. Its attributes and methods cannot be overridden, but it allows you to create a feature type without creating a class. The default type is provided by ADS. It is based on the Pandas dtype, and sets the default attributes and methods. Each inheritance chain automatically ends in a default feature type.

Custom

The most common and powerful feature type is the custom feature type. It is a Python class that inherits from FeatureType. It has attributes and methods that you can be override to define the properties of the feature type to fit your specific needs.

As with multiple inheritance, a custom feature type uses an inheritance chain to determine which attribute or method is dispatched when called. The idea is that you would have a feature that has many custom feature types with each feature type being more specific to the nature of the feature’s data. Therefore, you only create the attributes and methods that are specific to the child feature type and the rest are reused from other custom or default feature types. This allows for the abstraction of the concepts that your feature represents and the reusability of your code.

Since a custom feature type is a Python class, you can add user-defined attributes and methods to the feature type to extend its capabilities.

Custom feature types must be registered with ADS before you can use them.

Default

The default feature type is based on the Pandas dtype. Setting the default feature type is optional when specifying the inheritance chain for a feature. ADS automatically appends the default feature type as an ancestor to all custom feature types. The default feature type is listed before the tag feature types in the inheritance chain. Each feature only has one default feature type. You can’t mute or remove it unless the underlying Pandas dtype has changed. For example, you have a Pandas Series called series that has a dtype of string so its default feature type is string. If you change the type by calling series = series.astype('category'), then the default feature type is automatically changed to categorical.

ADS automatically detects the dtype of each Series and sets the default feature type. The default feature type can be one of the following:

  • boolean

  • category

  • continuous

  • date_time

  • integer

  • object

  • string

This example creates a Pandas Series of credit card numbers, and prints the default feature type:

series = pd.Series(["4532640527811543", "4556929308150929", "4539944650919740"], name='creditcard')
series.ads.default_type
'string'

You can include the default feature type using the .feature_type property. If you do, then the default feature type isn’t added a second time.

series.ads.feature_type = ['credit_card', 'string']
series.ads.feature_type
['credit_card', 'string']

You can’t directly create or modify default feature types.

Tag

It’s often convenient to tag a dataset with additional information without the need to create a custom feature type class. This is the role of the Tag() function, which allows you to create a feature type without having to explicitly define and register a class. The tradeoff is that you can’t define most attributes and all methods of the feature type. Therefore, tools like feature type warnings and validators, and summary statistics and plots cannot be customized.

Tags are semantic and provide more context about the actual meaning of a feature. This could directly affect the interpretation of the information.

The process of creating your tag is the same as setting the feature types because it is a feature type. You use the .feature_type property to create tags on a feature type.

The next example creates a set of credit card numbers, sets the feature type to credit_card, and tags the dataset to be inactive cards. Also, the cards are from North American financial institutions. You can put any text you want in the Tag() because no underlying feature type class has to exist.

series = pd.Series(["4532640527811543", "4556929308150929", "4539944650919740",
                    "4485348152450846"], name='Credit Card')
series.ads.feature_type=['credit_card', Tag('Inactive Card'), Tag('North American')]
series.ads.feature_type
['credit_card', 'string', 'Inactive Card', 'North American']

Tags are always listed after the other feature types:

A list of tags can be obtained using the tags attribute:

series.ads.tags
['Inactive Card', 'North American']