Setting a column based on another one and multiple conditions in pandas

This short notebook shows a way to set the value of one column in a CSV file, that satisfies multiple conditions, by extracting information from another column using regular expressions. The pandas library is the best tool I know for programmatically working with CSV files. It offers many methods for modifying columns and supports different data types.

While working on ukealong.com, a website with more than 1,000 play along videos for the ukulele, I faced such a task. The meta data for the videos, which includes artists and track names is maintained in a CSV file. The source data comes from several YouTube channels, that create these play along videos. Let's look at a small subset of the data, that includes 10 records and only the relevant columns for demonstrating the column setting.

%load_ext signature import re import pandas as pd df = pd.read_csv('data/ukealong-column-setting-sample.csv') df

artists track title channel_id 0 NaN NaN "Something" (The Beatles) Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA 1 NaN NaN Arthur Theme Song Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA 2 NaN NaN "Black Magic Woman" (Santana) Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA 3 Israel Kamakawiwo'ole White Sandy Beach of Hawaii "White Sandy Beach of Hawaii" (Braddah IZ) Uku... UCDglnz22aXMzpug5HbD1bCA 4 NaN What Will We Do With a Drunken Sailor What Will We Do With a Drunken Sailor UC6jZYLoYuV1CxY4Stzm6mgg 5 NaN Frere Jacques Frere Jacques UCZjDV_1UEbVsAQA_q9tyTWw 6 NaN NaN "What's Up?" (4 Non Blondes) Ukulele Play Along! UCDglnz22aXMzpug5HbD1bCA 7 NaN NaN "Lucy in the Sky with Diamonds" (The Beatles) ... UCDglnz22aXMzpug5HbD1bCA 8 NaN NaN Lava Ukulele Play-along // Cynthia Lin (Chords... UCD2q6i-C0ZLJUK-VCp49TJA 9 Weezer Island In The Sun "Island In The Sun" Ukulele Play-Along! UCbQn9nS2_W-dsmnr3h_Rpvg

After importing videos from this playlist by Ukulele Underground, I had more than 100 records with empty artists and track columns. I updated some of them manually and thought there must be a better way. Most of their YouTube titles follow the format "TRACK" (ARTIST) Ukulele Play-Along!, so the relevant information could be extracted using regular expressions. I only wanted to update records where the artists and track columns were not already set, which led to the following condition:

condition = (df['channel_id'] == 'UCDglnz22aXMzpug5HbD1bCA') & df['artists'].isna() & df['track'].isna()

The code cell below shows the two statements for extracting the information from the title column, which contains the title used on YouTube, using two regular expression. This can be done comfortably using the pandas.Series.str.extract method. The expand parameter has to be set to False so a Series object is returned.

df.loc[condition, 'artists'] = df.title.str.extract(r'\((.+?)\)', expand=False) df.loc[condition, 'track'] = df.title.str.extract(r'^"(.+?)"', expand=False) df

artists track title channel_id 0 The Beatles Something "Something" (The Beatles) Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA 1 NaN NaN Arthur Theme Song Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA 2 Santana Black Magic Woman "Black Magic Woman" (Santana) Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA 3 Israel Kamakawiwo'ole White Sandy Beach of Hawaii "White Sandy Beach of Hawaii" (Braddah IZ) Uku... UCDglnz22aXMzpug5HbD1bCA 4 NaN What Will We Do With a Drunken Sailor What Will We Do With a Drunken Sailor UC6jZYLoYuV1CxY4Stzm6mgg 5 NaN Frere Jacques Frere Jacques UCZjDV_1UEbVsAQA_q9tyTWw 6 4 Non Blondes What's Up? "What's Up?" (4 Non Blondes) Ukulele Play Along! UCDglnz22aXMzpug5HbD1bCA 7 The Beatles Lucy in the Sky with Diamonds "Lucy in the Sky with Diamonds" (The Beatles) ... UCDglnz22aXMzpug5HbD1bCA 8 NaN NaN Lava Ukulele Play-along // Cynthia Lin (Chords... UCD2q6i-C0ZLJUK-VCp49TJA 9 Weezer Island In The Sun "Island In The Sun" Ukulele Play-Along! UCbQn9nS2_W-dsmnr3h_Rpvg

After running this code on the actual CSV file, most of the Ukulele Underground records showed the correct information in the two columns.

In the first published version of this notebook I imported the standard library re module and defined two custom functions using regular expressions to extract the information. Thanks to redditor otterom, who made me aware of the extract method, the code is now more succinct and straightforward.

Quite frankly, writing the initial version of the code probably took longer than updating 100 records manually would have, let alone documenting the process in this notebook und updating it. On the other hand I can integrate the code in the process of adding new videos. Moreover, I learned something new, now have a reference for similar tasks in the future and it hopefully helps other people who face a similar problem.

Last but not least, if you play the ukulele or want to learn it, I hope you'll dig ukealong.com, share the site with your friends and show the creators of the play along videos some love by subscribing to their channels and liking their videos on YouTube. Playing music is really a joyful way to spend your time and helps to clear the mind in these times of uncertainty.

%signature

Author: Ramiro Gómez • Last edited: March 31, 2020

Linux 5.3.0-42-generic - CPython 3.7.6 - IPython 7.13.0 - matplotlib 3.2.1 - numpy 1.18.1 - pandas 1.0.3

Shirts for Python Programmers