Sometimes there is a need to work with a table of data where one of its columns has multiple values in each cell. Say, the replies to a “Which social networks do you use?” multiple choice question in a form:

Form replies

Since the respondent could pick multiple options for the social networks they use, each cell can have more than one option, separated by semicolons.

However, this causes a problem for analysis. There’s no problem with the “Gender”, but the data about the social networks is not in a suitable format for analysis. In its current state, it’s not possible to easily analyze the data or create visualizations. In other words, there is a need to split those replies and unpivot them so that there is only one option per row.

And so that’s precisely what I will show you today. We will start with the above dataset and load it into a pandas DataFrame in Python to extract a more detailed form of the dataset:

Resulting data

As the options picked for a single response are unpivoted into separate rows, the remaining columns are equally replicated to keep the integrity of the data.

For the purpose of this demo and taking into acount the data we’ll be working with (the data shown above), I will show you two slightly different scenarios of achieving this result. The scenarios are almost the same, except for a minor detail: in the first one we’ll load the dataset specifying the “Respondent ID” as the index column, while in the second case we let pandas create an automatic index for the DataFrame. The impact of this difference will be in how we select columns to manage the transformations in the DataFrame.

Finally, before diving into the code, I’d like to give credit to this StackOverflow answer. It was thanks to this answer that I found out how to make the code work, and so the scripts I will show you are heavily inspired by that reply, with some adjustments to make it more robust.

First scenario: using a specified column for the index

First up, the solution for the DataFrame having the “Respondent ID” column as its index column:

Unpivot delimited data with a specified index column

Since the scripts are so similar, I will give you the complete overview of the code in the first scenario and then highlight only the differences in the second scenario.

Each script contains three functions, two of which are helper functions:

unpivot_delimited_series : helper function to perform the split and unpivot of the column with delimited data (as a Series/column separated from the remaining data)

: helper function to perform the split and unpivot of the column with delimited data (as a Series/column separated from the remaining data) get_other_columns : helper function to get the name of all the columns available in the DataFrame, excluding the column with delimited data

: helper function to get the name of all the columns available in the DataFrame, excluding the column with delimited data unpivot_delimited_data : main function responsible for managing the whole process (takes in the original DataFrame and returns a new DataFrame with all the transformations needed)

We load the dataset and then feed only the column with delimited data (i.e., a pandas Series) to the unpivot_delimited_series function to perform the transformations. Because the options chosen in a single response are split into multiple rows, the Series ends up with a two-level index, where the first level represents the “Respondent ID”s (each respondent) and the second level represents each option chosen by a single respondent.

Result of the column with delimited data

However, the transformed column is separated from the rest of the data, and the original DataFrame still has the delimited data. Thus, we need to create a brand new DataFrame that merges the other columns from the original DataFrame to this transformed Series.

For this merge to happen, we need to have a common column between the transformed Series and the original DataFrame. Well, they do share the “Respondent ID” ids, but right now that is only the first-level of the Series index, while on the DataFrame it represents the complete index. Plus, we can’t identify it as a “normal” column in either case because it is the index. Thus, we need to reset the index of both the DataFrame and the Series, that is, add a new index column to each, and keep the old indices as “normal” columns.

Transformed delimited data with reset index (first scenario)

As you can see, through the transformations, the Series ends up as a three-column DataFrame of its own, with two columns to represent the previous two-level index (“Respondent ID” and the automatically named “level_1”), and a third column for unpivoted data (automatically named “0”, zero). On the other hand, the original DataFrame looks as follows at this point:

Original DataFrame with reset index

The “Respondent ID” is no longer the index column and now the DataFrame looks exactly as it looks if the “Respondent ID” was not chosen as the index column upon loading.

Before performing the merge, it is also convenient to get the names of the available columns in the original DataFrame (accomplished with the get_other_columns helper function) so that in the merge we don’t include the original delimited data.

With the two DataFrames in hand (one with the original data and another with the unpivoted delimited data), we can now create a third new DataFrame by merging the first two on the “Respondent ID” column, the common element between both DataFrames.

(Note that the original column of delimited data is filtered out of the original DataFrame for the merge)

Matching column for the merge (first scenario)

Merged DataFrame (first scenario)

After the merge, the only thing left to do is some cleaning in the resulting DataFrame. The first thing is to rename the column with transformed data which ended up called “0” (zero). The other thing is to delete unwanted columns, namely those that were created due to the index resets (looking at you “level_1”). After these two operations, we obtain the finalized DataFrame with the delimited data split and unpivoted, without losing any data from the other columns.

Final DataFrame

In this case, I left the column of transformed data as the last column, but the columns could be easily rearranged to keep the original column order with the following command:

merged_data = merged_data[["Respondent ID", "Used Social Networks", "Gender"]]

Regarding the index, I chose to keep an automatic index because the “Respondent ID” no longer has unique data, as the IDs are replicated to keep track of the response each option belongs to. If you wish to keep it as the index column, you can do so with

merged_data = merged_data.set_index("Respondent ID")

Second scenario: using the automatic index

Unpivot delimited data with an automatic index

In this second scemario, we start with a DataFrame that has a common automatic index, that is, in the context of our sample data, the “Respondent ID” column is just another column of data:

Original DataFrame for the second scenario

The major difference in this second approach is in the merge that originates the third DataFrame. Because the Series with delimited data did not include “Respondent ID” in its index, the two-level index is made up of two levels of automatic indices. The principle is the same, the first level for each respondent and the second level for each option of one response, but now they have nothing to do with the “Respondent ID”:

Transformed delimited data with reset index (second scenario)

That first level, “level_0”, was the “Respondent ID” in the first scenario, while the “level_1” is the same second index level.

At least now the first level (“level_0”) is the matching column to the index of the original DataFrame. The only thing keeping us from performing the merge is that we can’t specify the index column of the original DataFrame. To solve that problem, we reset that DataFrame’s index to have it as another “normal” column of data that we can select for the merge (“index”).

Original DataFrame with reset index

Because the matching column does not have the same name in both DataFrames, we need to specify the name used in each DataFrame with the left_on and right_on arguments. Which DataFrame is the left and which one is the right one is decided by the order of appearance in the arguments passed to merge : the DataFrame with the unpivoted data comes up first so it is the left one.

(Note that the original column of delimited data is filtered out of the original DataFrame for the merge)

Matching columns for the merge

merged_data = pd.merge(

unpivoted_series,

source_data_to_merge,

how="inner",

left_on="level_0",

right_on="index"

)

Merged DataFrame (second scenario)

The rest of the script serves the same purpose as in the first approach: to clean up the merged DataFrame!

Final DataFrame

Extra: visualize with Matplotlib and Plotly

Now that we’ve come this far, it would be cool to see how this resulting DataFrame can be visualized. Let me quickly show you how to plot the social network usage, broken down by gender, in a column chart, in both Matplolib and Plotly.

First with Matplotlib

Column chart visualization with Matplotlib

Social Networks by Gender (Matplotlib)

And now for Plotly

Column chart visualization with Plotly

Social Networks by Gender (Plotly)

Conclusions

And that’s it for this explanation on how to split and unpivot delimited data, without losing data from the original dataset. As a summary of the process, the data goes through the following steps:

The data from the column with delimiters is split and unpivoted, creating a two-level index for the now separated response options

The indices are reset in the original DataFrame and in the new DataFrame with unpivoted data to have an accessible matching column

The two DataFrames are merged into a third new DataFrame

Summary of the transformations

This process takes the data through a lot of transformations and I hope the explanation with two possible scenarios was useful :)

You can find all the code used in this demo on GitHub and a working demo on repl.it here.