Feature Type Selection

Pandas provide methods to select the columns that you want by using their column names or positions. However, a common task that data scientists perform is to select columns that have specific attributes. This is often done by manually examining the column names and making a list of them. Or by having attributes encoded to the column name and then creating a search pattern to return a list.

None of these methods are efficient or robust. The feature type system in ADS allows you to define feature types on the features. Since you have feature types assigned to a set of features, the feature type selection allows you to create a new dataframe with only the columns that have, or don’t have, specific feature types associated with them.

You can select a subset of columns based on the feature types using the .feature_select() method. The include parameter defaults to None. It takes a list of feature types (feature type object or feature type name) to include in the returned dataframe. The exclude parameter defaults to None. It takes a list of feature types to exclude from the returned dataframe. You can’t set both include and exclude to None. A feature type can’t be included or excluded at the same time.

attrition_path = os.path.join('/opt', 'notebooks', 'ads-examples', 'oracle_data', 'orcl_attrition.csv')
df = pd.read_csv(attrition_path,
                 usecols=['Attrition', 'TravelForWork', 'JobFunction', 'EducationalLevel'])
df.ads.feature_type = {'Attrition': ['boolean'],
                       'TravelForWork': ['category'],
                       'JobFunction': ['category'],
                       'EducationalLevel': ['category']}

Next, create a dataframe that only has columns that have a Boolean feature type:

df.ads.feature_select(include=['boolean'])

You can create a dataframe that excludes columns that have a Boolean feature type:

df.ads.feature_select(exclude=['boolean'])