If we have a machine learning task, such as predicting whether a client will repay a future loan, we will want to combine all the information about clients into a single table. The tables are related (through the client_id and the loan_id variables) and we could use a series of transformations and aggregations to do this process by hand. However, we will shortly see that we can instead use featuretools to automate the process.

Entities and EntitySets

The first two concepts of featuretools are entities and entitysets. An entity is simply a table (or a DataFrame if you think in Pandas). An EntitySet is a collection of tables and the relationships between them. Think of an entityset as just another Python data structure, with its own methods and attributes.

We can create an empty entityset in featuretools using the following:

import featuretools as ft # Create new entityset

es = ft.EntitySet(id = 'clients')

Now we have to add entities. Each entity must have an index, which is a column with all unique elements. That is, each value in the index must appear in the table only once. The index in the clients dataframe is the client_id because each client has only one row in this dataframe. We add an entity with an existing index to an entityset using the following syntax:

The loans dataframe also has a unique index, loan_id and the syntax to add this to the entityset is the same as for clients . However, for the payments dataframe, there is no unique index. When we add this entity to the entityset, we need to pass in the parameter make_index = True and specify the name of the index. Also, although featuretools will automatically infer the data type of each column in an entity, we can override this by passing in a dictionary of column types to the parameter variable_types .

For this dataframe, even though missed is an integer, this is not a numeric variable since it can only take on 2 discrete values, so we tell featuretools to treat is as a categorical variable. After adding the dataframes to the entityset, we inspect any of them:

The column types have been correctly inferred with the modification we specified. Next, we need to specify how the tables in the entityset are related.

Table Relationships

The best way to think of a relationship between two tables is the analogy of parent to child. This is a one-to-many relationship: each parent can have multiple children. In the realm of tables, a parent table has one row for every parent, but the child table may have multiple rows corresponding to multiple children of the same parent.

For example, in our dataset, the clients dataframe is a parent of the loans dataframe. Each client has only one row in clients but may have multiple rows in loans . Likewise, loans is the parent of payments because each loan will have multiple payments. The parents are linked to their children by a shared variable. When we perform aggregations, we group the child table by the parent variable and calculate statistics across the children of each parent.

To formalize a relationship in featuretools, we only need to specify the variable that links two tables together. The clients and the loans table are linked via the client_id variable and loans and payments are linked with the loan_id . The syntax for creating a relationship and adding it to the entityset are shown below:

The entityset now contains the three entities (tables) and the relationships that link these entities together. After adding entities and formalizing relationships, our entityset is complete and we are ready to make features.

Feature Primitives

Before we can quite get to deep feature synthesis, we need to understand feature primitives. We already know what these are, but we have just been calling them by different names! These are simply the basic operations that we use to form new features:

Aggregations: operations completed across a parent-to-child (one-to-many) relationship that group by the parent and calculate stats for the children. An example is grouping the loan table by the client_id and finding the maximum loan amount for each client.

table by the and finding the maximum loan amount for each client. Transformations: operations done on a single table to one or more columns. An example is taking the difference between two columns in one table or taking the absolute value of a column.

New features are created in featuretools using these primitives either by themselves or stacking multiple primitives. Below is a list of some of the feature primitives in featuretools (we can also define custom primitives):

Feature Primitives

These primitives can be used by themselves or combined to create features. To make features with specified primitives we use the ft.dfs function (standing for deep feature synthesis). We pass in the entityset , the target_entity , which is the table where we want to add the features, the selected trans_primitives (transformations), and agg_primitives (aggregations):

The result is a dataframe of new features for each client (because we made clients the target_entity ). For example, we have the month each client joined which is a transformation feature primitive:

We also have a number of aggregation primitives such as the average payment amounts for each client:

Even though we specified only a few feature primitives, featuretools created many new features by combining and stacking these primitives.

The complete dataframe has 793 columns of new features!

Deep Feature Synthesis

We now have all the pieces in place to understand deep feature synthesis (dfs). In fact, we already performed dfs in the previous function call! A deep feature is simply a feature made of stacking multiple primitives and dfs is the name of process that makes these features. The depth of a deep feature is the number of primitives required to make the feature.

For example, the MEAN(payments.payment_amount) column is a deep feature with a depth of 1 because it was created using a single aggregation. A feature with a depth of two is LAST(loans(MEAN(payments.payment_amount)) This is made by stacking two aggregations: LAST (most recent) on top of MEAN. This represents the average payment size of the most recent loan for each client.

We can stack features to any depth we want, but in practice, I have never gone beyond a depth of 2. After this point, the features are difficult to interpret, but I encourage anyone interested to try “going deeper”.