Python tutorial September 5, 2019

100 pandas tricks to save you time and energy

Below you'll find 100 tricks that will save you time and energy every time you use pandas! These the best tricks I've learned from 5 years of teaching the pandas library.

"Soooo many nifty little tips that will make my life so much easier!" - C.K. "Kevin, these tips are so practical. I can say without hesitation that you provide the best resources for pandas I have ever used." - N.W.

P.S. You can also watch a video of my top 25 tricks! 🐼🤹

Categories

Reading files

🐼🤹‍♂️ pandas trick:



5 useful "read_csv" parameters that are often overlooked:



➡️ names: specify column names

➡️ usecols: which columns to keep

➡️ dtype: specify data types

➡️ nrows: # of rows to read

➡️ na_values: strings to recognize as NaN#Python #DataScience #pandastricks — Kevin Markham (@justmarkham) August 19, 2019

🐼🤹‍♂️ pandas trick:



⚠️ Got bad data (or empty rows) at the top of your CSV file? Use these read_csv parameters:



➡️ header = row number of header (start counting at 0)

➡️ skiprows = list of row numbers to skip



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/t1M6XkkPYG — Kevin Markham (@justmarkham) September 3, 2019

🐼🤹‍♂️ pandas trick:



Two easy ways to reduce DataFrame memory usage:

1. Only read in columns you need

2. Use 'category' data type with categorical data.



Example:

df = https://t.co/Ib52aQAdkA_csv('file.csv', usecols=['A', 'C', 'D'], dtype={'D':'category'})#Python #pandastricks — Kevin Markham (@justmarkham) June 21, 2019

🐼🤹‍♂️ pandas trick:



You can read directly from a compressed file:

df = https://t.co/Ib52aQAdkA_csv('https://t.co/3JAwA8h7FJ')



Or write to a compressed file:https://t.co/ySXYEf6MjY_csv('https://t.co/3JAwA8h7FJ')



Also supported: .gz, .bz2, .xz#Python #pandas #pandastricks — Kevin Markham (@justmarkham) July 4, 2019

🐼🤹‍♂️ pandas trick #99:



Do you sometimes end up with an "Unnamed: 0" column in your DataFrame? 🤔



Solution: Set the first column as the index (when reading)

Alternative: Don't save the index to the file (when writing)



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/WuUJb7fMPZ — Kevin Markham (@justmarkham) December 18, 2019

🐼🤹‍♂️ pandas trick:



Are your dataset rows spread across multiple files, but you need a single DataFrame?



Solution:

1. Use glob() to list your files

2. Use a generator expression to read files and concat() to combine them

3. 🥳



See example 👇#Python #DataScience #pandastricks pic.twitter.com/qtKpzEoSC3 — Kevin Markham (@justmarkham) June 20, 2019

🐼🤹‍♂️ pandas trick #78:



Do you need to build a DataFrame from multiple files, but also keep track of which row came from which file?



1. List files w/ glob()

2. Read files w/ gen expression, create new column w/ assign(), combine w/ concat()



See example 👇#Python #pandastricks pic.twitter.com/kXgXw69pSW — Kevin Markham (@justmarkham) October 10, 2019

🐼🤹‍♂️ pandas trick #100! 🎉



Want to read a HUGE dataset into pandas but don't have enough memory?



Randomly sample the dataset *during file reading* by passing a function to "skiprows"



See example 👇



Thanks to @TedPetrou for this trick! 🙌#Python #DataScience #pandastricks pic.twitter.com/FOPxURbNgc — Kevin Markham (@justmarkham) December 19, 2019

🐼🤹‍♂️ pandas trick:



Need to quickly get data from Excel or Google Sheets into pandas?



1. Copy data to clipboard

2. df = https://t.co/Ib52aQAdkA_clipboard()

3. 🥳



See example 👇



Learn 25 more tips & tricks: https://t.co/6akbxXG6SI#Python #DataScience #pandas #pandastricks pic.twitter.com/M2Yw0NAXRe — Kevin Markham (@justmarkham) July 15, 2019

🐼🤹‍♂️ pandas trick #71:



Want to extract tables from a PDF into a DataFrame? Try tabula-py!



from tabula import read_pdf

df = read_pdf('test.pdf', pages='all')



Documentation: https://t.co/geQh9u4AEr



Thanks for the trick @Netchose! 🙌#Python #DataScience #pandas #pandastricks — Kevin Markham (@justmarkham) September 30, 2019

Reading from the web

🐼🤹‍♂️ pandas trick:



Want to read a JSON file from the web? Use read_json() to read it directly from a URL into a DataFrame! 😎



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/gei6eeudiq — Kevin Markham (@justmarkham) September 9, 2019

🐼🤹‍♂️ pandas trick #68:



Want to scrape a web page? Try read_html()!



Definitely worth trying before bringing out a more complex tool (Beautiful Soup, Selenium, etc.)



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/sPKrea9wk1 — Kevin Markham (@justmarkham) September 18, 2019

🐼🤹‍♂️ pandas trick #74:



Are you scraping a webpage using read_html(), but it returns too many tables? 😕



Use the 'match' parameter to find tables that contain a particular string! 🧶



See example 👇



Thanks to @JrMontana08 for the trick! 🙌#Python #DataScience #pandastricks pic.twitter.com/4Ocbv6H3r7 — Kevin Markham (@justmarkham) October 3, 2019

Creating example DataFrames

🐼🤹‍♂️ pandas trick:



Need to create an example DataFrame? Here are 3 easy options:



pd.DataFrame({'col_one':[10, 20], 'col_two':[30, 40]})

pd.DataFrame(np.random.rand(2, 3), columns=list('abc'))

pd.util.testing.makeMixedDataFrame()



See output 👇#Python #pandas #pandastricks pic.twitter.com/SSlZsd6OEj — Kevin Markham (@justmarkham) June 28, 2019

🐼🤹‍♂️ pandas trick:



Need to create a DataFrame for testing?



pd.util.testing.makeDataFrame() ➡️ contains random values

.makeMissingDataframe() ➡️ some values missing

.makeTimeDataFrame() ➡️ has DateTimeIndex

.makeMixedDataFrame() ➡️ mixed data types#Python #pandas #pandastricks — Kevin Markham (@justmarkham) July 10, 2019

🐼🤹‍♂️ pandas trick #91:



Need to create a time series dataset for testing? Use pd.util.testing.makeTimeDataFrame()



Need more control over the columns & data? Generate data with np.random & overwrite index with makeDateIndex()



See example 👇#Python #DataScience #pandastricks pic.twitter.com/fLrNWf1tsa — Kevin Markham (@justmarkham) November 22, 2019

Creating columns

🐼🤹‍♂️ pandas trick:



Need to create a bunch of new columns based on existing columns? Use this pattern:



for col in df.columns:

df[f'{col}_new'] = df[col].apply(my_function)



See example 👇



Thanks to @pmbaumgartner for this trick!#Python #DataScience #pandas #pandastricks pic.twitter.com/7qvKn9UypE — Kevin Markham (@justmarkham) September 16, 2019

🐼🤹‍♂️ pandas trick #73:



Need to remove a column from a DataFrame and store it as a separate Series? Use "pop"! 🍾



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/R45OEMbWVm — Kevin Markham (@justmarkham) October 2, 2019

🐼🤹‍♂️ pandas trick #90:



Want to insert a new column into a DataFrame at a specific location? Use the "insert" method:



df.insert(location, name, value)



See example 👇



P.S. You can find the other 89 tricks here: https://t.co/TflgUtl6zD#Python #DataScience #pandas #pandastricks pic.twitter.com/zmPvdLq7jG — Kevin Markham (@justmarkham) November 21, 2019

Renaming columns

🐼🤹‍♂️ pandas trick:



3 ways to rename columns:



1. Most flexible option:

df = df.rename({'A':'a', 'B':'b'}, axis='columns')



2. Overwrite all column names:

df.columns = ['a', 'b']



3. Apply string method:

df.columns = df.columns.str.lower()#Python #DataScience #pandastricks — Kevin Markham (@justmarkham) July 16, 2019

🐼🤹‍♂️ pandas trick:



Add a prefix to all of your column names:

df.add_prefix('X_')



Add a suffix to all of your column names:

df.add_suffix('_Y')#Python #DataScience — Kevin Markham (@justmarkham) June 11, 2019

🐼🤹‍♂️ pandas trick:



Need to rename all of your columns in the same way? Use a string method:



Replace spaces with _:

df.columns = df.columns.str.replace(' ', '_')



Make lowercase & remove trailing whitespace:

df.columns = df.columns.str.lower().str.rstrip()#Python #pandastricks — Kevin Markham (@justmarkham) June 25, 2019

Selecting rows and columns

🐼🤹‍♂️ pandas trick:



Need to select multiple rows/columns? "loc" is usually the solution:



select a slice (inclusive):

df.loc[0:4, 'col_A':'col_D']



select a list:

df.loc[[0, 3], ['col_A', 'col_C']]



select by condition:

df.loc[df.col_A=='val', 'col_D']#Python #pandastricks — Kevin Markham (@justmarkham) July 3, 2019

🐼🤹‍♂️ pandas trick:



"loc" selects by label, and "iloc" selects by position.



But what if you need to select by label *and* position? You can still use loc or iloc!



See example 👇



P.S. Don't use "ix", it has been deprecated since 2017.#Python #DataScience #pandas #pandastricks pic.twitter.com/SpFkjWYEE0 — Kevin Markham (@justmarkham) August 1, 2019

🐼🤹‍♂️ pandas trick #82:



Want to select from a DataFrame by label *and* position?



Most readable approach is to chain "loc" (selection by label) and "iloc" (selection by position).



See example 👇



Thanks to @Dean_La for this trick!#Python #DataScience #pandas #pandastricks pic.twitter.com/FCbkmaG6uD — Kevin Markham (@justmarkham) November 7, 2019

🐼🤹‍♂️ pandas trick:



Reverse column order in a DataFrame:

df.loc[:, ::-1]



Reverse row order:

df.loc[::-1]



Reverse row order and reset the index:

df.loc[::-1].reset_index(drop=True)



Want more #pandastricks? Working on a video right now, stay tuned... 🎥#Python #DataScience — Kevin Markham (@justmarkham) June 12, 2019

🐼🤹‍♂️ pandas trick #80:



Want to select multiple slices of columns from a DataFrame?



1. Use df.loc to select & pd.concat to combine

2. Slice df.columns & select using brackets

3. Use np.r_ to combine slices & df.iloc to select



See example 👇#Python #DataScience #pandastricks pic.twitter.com/IhbYbgpLKk — Kevin Markham (@justmarkham) November 5, 2019

Filtering rows by condition

🐼🤹‍♂️ pandas trick:



Filter DataFrame by multiple OR conditions:

df[(df.color == 'red') | (df.color == 'green') | (df.color == 'blue')]



Shorter way:

df[df.color.isin(['red', 'green', 'blue'])]



Invert the filter:

df[~df.color.isin(['red', 'green', 'blue'])]#Python #pandastricks — Kevin Markham (@justmarkham) June 13, 2019

🐼🤹‍♂️ pandas tricks is back! 🎉



Want to know the *count* of rows that match a condition?

(condition).sum()



Want to know the *percentage* of rows that match a condition?

(condition).mean()



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/COqZy4EB2S — Kevin Markham (@justmarkham) November 4, 2019

🐼🤹‍♂️ pandas trick #76:



Want to filter a DataFrame to only include the largest categories?



1. Save the value_counts() output

2. Get the index of its head()

3. Use that index with isin() to filter the DataFrame



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/plzO4qesDH — Kevin Markham (@justmarkham) October 7, 2019

🐼🤹‍♂️ pandas trick #77:



Want to combine the smaller categories in a Series into a single category called "Other"?



1. Save the index of the largest values of value_counts()

2. Use where() to replace all other values with "Other"



See example 👇#Python #DataScience #pandastricks pic.twitter.com/FPxtuzwll4 — Kevin Markham (@justmarkham) October 9, 2019

🐼🤹‍♂️ pandas trick #93:



Want to combine the small categories in a Series (<10% frequency) into a single category?



1. Save the normalized value counts

2. Filter by frequency & save the index

3. Replace small categories with "Other"



See example 👇#Python #pandas #pandastricks pic.twitter.com/z6w1x8s6qg — Kevin Markham (@justmarkham) December 10, 2019

🐼🤹‍♂️ pandas trick:



Are you trying to filter a DataFrame using lots of criteria? It can be hard to write ✏️ and to read! 🔍



Instead, save the criteria as objects and use them to filter. Or, use reduce() to combine the criteria!



See example 👇#Python #DataScience #pandastricks pic.twitter.com/U9NV27RIjQ — Kevin Markham (@justmarkham) August 28, 2019

🐼🤹‍♂️ pandas trick:



Want to filter a DataFrame that doesn't have a name?



Use the query() method to avoid creating an intermediate variable!



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/NyUOOSr7Sc — Kevin Markham (@justmarkham) July 25, 2019

🐼🤹‍♂️ pandas trick:



Need to refer to a local variable within a query() string? Just prefix it with the @ symbol!



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/PfXcASWDdC — Kevin Markham (@justmarkham) August 13, 2019

🐼🤹‍♂️ pandas trick:



If you want to use query() on a column name containing a space, just surround it with backticks! (New in pandas 0.25)



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/M5ZSRVr3no — Kevin Markham (@justmarkham) July 30, 2019

Manipulating strings

🐼🤹‍♂️ pandas trick:



Want to concatenate two string columns?



Option 1: Use a string method 🧶

Option 2: Use plus signs ➕



See example 👇



Which option do you prefer, and why?#Python #DataScience #pandas #pandastricks pic.twitter.com/SsjBAMqkxB — Kevin Markham (@justmarkham) August 22, 2019

🐼🤹‍♂️ pandas trick:



Need to split a string into multiple columns? Use str.split() method, expand=True to return a DataFrame, and assign it to the original DataFrame.



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/wZ4okQZ9Dy — Kevin Markham (@justmarkham) July 9, 2019

🐼🤹‍♂️ pandas trick #89:



Need to split names of variable length into first_name & last_name?



1. Use str.split(n=1) to split only once (returns a Series of lists)

2. Chain str[0] and str[1] on the end to select the list elements



See example 👇#Python #DataScience #pandastricks pic.twitter.com/fkikdaLkus — Kevin Markham (@justmarkham) November 20, 2019

🐼🤹‍♂️ pandas trick #75:



Need to count the number of words in a Series? Just use a string method to count the spaces and add 1!



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/U6quTmrvNT — Kevin Markham (@justmarkham) October 4, 2019

Working with data types

🐼🤹‍♂️ pandas trick:



Numbers stored as strings? Try astype():

df.astype({'col1':'int', 'col2':'float'})



But it will fail if you have any invalid input. Better way:

df.apply(https://t.co/H90jtE9QMp_numeric, errors='coerce')



Converts invalid input to NaN 🎉#Python #pandastricks — Kevin Markham (@justmarkham) June 17, 2019

🐼🤹‍♂️ pandas trick #94:



Want to save a *massive* amount of memory? Fix your data types:



➡️ 'int8' for small integers

➡️ 'category' for strings with few unique values

➡️ 'Sparse' if most values are 0 or NaN



More info: https://t.co/yEJnaWnGfj by @itamarst#Python #pandastricks pic.twitter.com/jiBrkldFCt — Kevin Markham (@justmarkham) December 11, 2019

🐼🤹‍♂️ pandas trick #92:



Need to clean an object column with mixed data types? Use "replace" (not str.replace) and regex!



See example 👇



P.S. Not sure when to use "replace" versus "str.replace"? Read this: https://t.co/GF9l1IRzzi#Python #DataScience #pandas #pandastricks pic.twitter.com/qMV17MNvr3 — Kevin Markham (@justmarkham) December 9, 2019

🐼🤹‍♂️ pandas trick:



Two useful properties of ordered categories:

1️⃣ You can sort the values in logical (not alphabetical) order

2️⃣ Comparison operators also work logically



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/HeYZ3P3gPP — Kevin Markham (@justmarkham) August 8, 2019

🐼🤹‍♂️ pandas trick #83:



Problem: Your dataset has many columns and you want to ensure the correct data types



Solution:

1. Create CSV of column names & dtypes

2. Read it into a DF

3. Convert it to dict

4. Use dict to specify dtypes of dataset



👇 Example 👇#Python #pandastricks pic.twitter.com/10DeKtc6wj — Kevin Markham (@justmarkham) November 8, 2019

Encoding data

🐼🤹‍♂️ pandas trick:



Need to convert a column from continuous to categorical? Use cut():



df['age_groups'] = pd.cut(df.age, bins=[0, 18, 65, 99], labels=['child', 'adult', 'elderly'])



0 to 18 ➡️ 'child'

18 to 65 ➡️ 'adult'

65 to 99 ➡️ 'elderly'#Python #pandas #pandastricks — Kevin Markham (@justmarkham) July 2, 2019

🐼🤹‍♂️ pandas trick #72:



Need to convert a column from continuous to categorical?



➡️ Use cut() to specify bin edges

➡️ Use qcut() to specify number of bins (creates bins of approx. equal size)

➡️ Both allow you to label the bins



See example 👇#Python #DataScience #pandastricks pic.twitter.com/2UhsNEIwDX — Kevin Markham (@justmarkham) October 1, 2019

🐼🤹‍♂️ pandas trick:



Want to dummy encode (or "one hot encode") your DataFrame? Use pd.get_dummies(df) to encode all object & category columns.



Want to drop the first level since it provides redundant info? Set drop_first=True.



See example & read thread 👇#Python #pandastricks pic.twitter.com/g0XjJ44eg2 — Kevin Markham (@justmarkham) August 5, 2019

🐼🤹‍♂️ pandas trick #85:



Three useful ways to convert one set of values to another:



1. map() using a dictionary

2. factorize() to encode each value as an integer

3. comparison statement to return boolean values



See example 👇#Python #DataScience #pandastricks @python_tip pic.twitter.com/9G5vcXW7ci — Kevin Markham (@justmarkham) November 13, 2019

🐼🤹‍♂️ pandas trick:



Need to apply the same mapping to multiple columns at once? Use "applymap" (DataFrame method) with "get" (dictionary method).



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/WU4AmeHP4O — Kevin Markham (@justmarkham) August 30, 2019

Extracting data from lists

🐼🤹‍♂️ pandas trick:



Has your data ever been TRAPPED in a Series of Python lists? 🔒



Expand the Series into a DataFrame by using apply() and passing it the Series constructor 🔓



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/ZvysqaRz6S — Kevin Markham (@justmarkham) June 27, 2019

🐼🤹‍♂️ pandas trick:



Do you have a Series containing lists of items? Create one row for each item using the "explode" method 💥



New in pandas 0.25! See example 👇



🤯#Python #DataScience #pandas #pandastricks pic.twitter.com/ix5d8CLg57 — Kevin Markham (@justmarkham) August 12, 2019

🐼🤹‍♂️ pandas trick:



Does your Series contain comma-separated items? Create one row for each item:



✂️ "str.split" creates a list of strings

⬅️ "assign" overwrites the existing column

💥 "explode" creates the rows (new in pandas 0.25)



See example 👇#Python #pandas #pandastricks pic.twitter.com/OqZNWdarP0 — Kevin Markham (@justmarkham) August 14, 2019

🐼🤹‍♂️ pandas trick:



💥 "explode" takes a list of items and creates one row for each item (new in pandas 0.25)



You can also do the reverse! See example 👇



Thanks to @EForEndeavour for this tip 🙌#Python #DataScience #pandas #pandastricks pic.twitter.com/4UBxbzHS51 — Kevin Markham (@justmarkham) August 16, 2019

Working with time series data

🐼🤹‍♂️ pandas trick:



If you need to create a single datetime column from multiple columns, you can use to_datetime() 📆



See example 👇



You must include: month, day, year

You can also include: hour, minute, second#Python #DataScience #pandas #pandastricks pic.twitter.com/0bip6SRDdF — Kevin Markham (@justmarkham) July 8, 2019

🐼🤹‍♂️ pandas trick #97:



Want to convert "year" and "day of year" into a single datetime column? 📆



1. Combine them into one number

2. Convert to datetime and specify its format



See example 👇



List of all format codes: https://t.co/SSd0dAWxM7#Python #DataScience #pandastricks pic.twitter.com/S7KlTo7rLE — Kevin Markham (@justmarkham) December 16, 2019

🐼🤹‍♂️ pandas trick:



One reason to use the datetime data type is that you can access many useful attributes via "dt", like:

df.column.dt.hour



Other attributes include: year, month, day, dayofyear, week, weekday, quarter, days_in_month...



See full list 👇#Python #pandastricks pic.twitter.com/z405STKqKY — Kevin Markham (@justmarkham) August 2, 2019

🐼🤹‍♂️ pandas trick:



Need to perform an aggregation (sum, mean, etc) with a given frequency (monthly, yearly, etc)?



Use resample! It's like a "groupby" for time series data. See example 👇



"Y" means yearly. See list of frequencies: https://t.co/oPDx85yqFT#Python #pandastricks pic.twitter.com/nweqbHXEtd — Kevin Markham (@justmarkham) July 18, 2019

🐼🤹‍♂️ pandas trick #87:



Problem: You have time series data that you want to aggregate by day, but you're only interested in weekends.



Solution:

1. resample by day ('D')

2. filter by day of week (5=Saturday, 6=Sunday)



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/5yCPLpE6kr — Kevin Markham (@justmarkham) November 15, 2019

🐼🤹‍♂️ pandas trick:



Want to calculate the difference between each row and the previous row? Use df.col_name.diff()



Want to calculate the percentage change instead? Use df.col_name.pct_change()



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/5EGYqpNPC3 — Kevin Markham (@justmarkham) August 27, 2019

🐼🤹‍♂️ pandas trick:



Need to convert a datetime Series from UTC to another time zone?



1. Set current time zone ➡️ tz_localize('UTC')

2. Convert ➡️ tz_convert('America/Chicago')



Automatically handles Daylight Savings Time!



See example 👇#Python #DataScience #pandastricks pic.twitter.com/ztzMXcgkFY — Kevin Markham (@justmarkham) July 31, 2019

Handling missing values

🐼🤹‍♂️ pandas trick:



Calculate % of missing values in each column:

df.isna().mean()



Drop columns with any missing values:

df.dropna(axis='columns')



Drop columns in which more than 10% of values are missing:

df.dropna(thresh=len(df)*0.9, axis='columns')#Python #pandastricks — Kevin Markham (@justmarkham) June 19, 2019

🐼🤹‍♂️ pandas trick #95:



Want to know the *count* of missing values in a DataFrame?

➡️ df.isna().sum().sum()



Just want to know if there are *any* missing values?

➡️ df.isna().any().any()

➡️ df.isna().any(axis=None)



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/BmmYJfk4xo — Kevin Markham (@justmarkham) December 12, 2019

🐼🤹‍♂️ pandas trick:



Need to fill missing values in your time series data? Use df.interpolate()



Defaults to linear interpolation, but many other methods are supported!



Want more pandas tricks? Watch this:

👉 https://t.co/6akbxXXHKg 👈#Python #DataScience #pandas #pandastricks pic.twitter.com/JjH08dvjMK — Kevin Markham (@justmarkham) July 12, 2019

🐼🤹‍♂️ pandas trick:



Do you need to store missing values ("NaN") in an integer Series? Use the "Int64" data type!



See example 👇



(New in v0.24, API is experimental/subject to change)#Python #DataScience #pandas #pandastricks pic.twitter.com/mN7Ud53Rls — Kevin Markham (@justmarkham) August 15, 2019

Using aggregation functions

🐼🤹‍♂️ pandas trick:



Instead of aggregating by a single function (such as 'mean'), you can aggregate by multiple functions by using 'agg' (and passing it a list of functions) or by using 'describe' (for summary statistics 📊)



See example 👇#Python #DataScience #pandastricks pic.twitter.com/Emg3zLAocB — Kevin Markham (@justmarkham) July 19, 2019

🐼🤹‍♂️ pandas trick:



Did you know that "last" is an aggregation function, just like "sum" and "mean"?



Can be used with a groupby to extract the last value in each group. See example 👇



P.S. You can also use "first" and "nth" functions!#Python #DataScience #pandas #pandastricks pic.twitter.com/WKJtNIUxwz — Kevin Markham (@justmarkham) August 9, 2019

🐼🤹‍♂️ pandas trick #86:



Are you applying multiple aggregations after a groupby? Try "named aggregation":



✅ Allows you to name the output columns

❌ Avoids a column MultiIndex



New in pandas 0.25! See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/WIVQVcn4re — Kevin Markham (@justmarkham) November 14, 2019

🐼🤹‍♂️ pandas trick:



Are you applying multiple aggregations after a groupby? Try "named aggregation":



✅ Allows you to name the output columns

❌ Avoids a column MultiIndex



New in pandas 0.25! See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/VXJz6ShZbc — Kevin Markham (@justmarkham) August 21, 2019

🐼🤹‍♂️ pandas trick:



Want to combine the output of an aggregation with the original DataFrame?



Instead of: df.groupby('col1').col2.func()

Use: df.groupby('col1').col2.transform(func)



"transform" changes the output shape



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/9dkcAGpTYK — Kevin Markham (@justmarkham) September 4, 2019

Using cumulative functions

🐼🤹‍♂️ pandas trick:



Need to calculate a running total (or "cumulative sum")? Use the cumsum() function! Also works with groupby()



See example 👇



Other cumulative functions: cummax(), cummin(), cumprod()#Python #DataScience #pandas #pandastricks pic.twitter.com/H4whqlV2ky — Kevin Markham (@justmarkham) September 6, 2019

Random sampling

🐼🤹‍♂️ pandas trick:



Randomly sample rows from a DataFrame:

df.sample(n=10)

df.sample(frac=0.25)



Useful parameters:

➡️ random_state: use any integer for reproducibility

➡️ replace: sample with replacement

➡️ weights: weight based on values in a column 😎#Python #pandastricks pic.twitter.com/j2AyoTLRKb — Kevin Markham (@justmarkham) August 20, 2019

🐼🤹‍♂️ pandas trick:



Want to shuffle your DataFrame rows?

df.sample(frac=1, random_state=0)



Want to reset the index after shuffling?

df.sample(frac=1, random_state=0).reset_index(drop=True)#Python #DataScience #pandas #pandastricks — Kevin Markham (@justmarkham) August 26, 2019

🐼🤹‍♂️ pandas trick:



Split a DataFrame into two random subsets:



df_1 = df.sample(frac=0.75, random_state=42)

df_2 = df.drop(df_1.index)



(Only works if df's index values are unique)



P.S. Working on a video of my 25 best #pandastricks, stay tuned! 📺#Python #pandas #DataScience — Kevin Markham (@justmarkham) June 18, 2019

Merging DataFrames

🐼🤹‍♂️ pandas trick:



When you are merging DataFrames, you can identify the source of each row (left/right/both) by setting indicator=True.



See example 👇



P.S. Learn 25 more #pandastricks in 25 minutes: https://t.co/6akbxXG6SI#Python #DataScience #pandas pic.twitter.com/tkb2LiV4eh — Kevin Markham (@justmarkham) July 23, 2019

🐼🤹‍♂️ pandas trick:



Merging datasets? Check that merge keys are unique in BOTH datasets:

pd.merge(left, right, validate='one_to_one')



✅ Use 'one_to_many' to only check uniqueness in LEFT

✅ Use 'many_to_one' to only check uniqueness in RIGHT#Python #DataScience #pandastricks — Kevin Markham (@justmarkham) June 26, 2019

Styling DataFrames

🐼🤹‍♂️ pandas trick:



Want to add formatting to your DataFrame? For example:

- hide the index

- add a caption

- format numbers & dates

- highlight min & max values



Watch 👇 to learn how!



Code: https://t.co/HKroWYVIEs



25 more tricks: https://t.co/6akbxXG6SI#Python #pandastricks pic.twitter.com/AKQr7zVR7S — Kevin Markham (@justmarkham) July 17, 2019

Exploring a dataset

🐼🤹‍♂️ pandas trick:



Want to explore a new dataset without too much work?



1. Pick one:

➡️ pip install pandas-profiling

➡️ conda install -c conda-forge pandas-profiling



2. import pandas_profiling

3. df.profile_report()

4. 🥳



See example 👇#Python #DataScience #pandastricks pic.twitter.com/srq5rptEUj — Kevin Markham (@justmarkham) July 29, 2019

🐼🤹‍♂️ pandas trick:



Need to check if two Series contain the same elements?



❌ Don't do this:

df.A == df.B



✅ Do this:

df.A.equals(df.B)



✅ Also works for DataFrames:

df.equals(df2)



equals() properly handles NaNs, whereas == does not#Python #DataScience #pandas #pandastricks — Kevin Markham (@justmarkham) June 24, 2019

🐼🤹‍♂️ pandas trick #84:



My favorite feature in pandas 0.25: If DataFrame has more than 60 rows, only show 10 rows (saves your screen space!)



You can modify this: pd.set_option('min_rows', 4)



See example 👇



More info: https://t.co/8vwkHWxnPH#Python #DataScience #pandastricks pic.twitter.com/K7NXJXzIgY — Kevin Markham (@justmarkham) November 11, 2019

🐼🤹‍♂️ pandas trick:



Want to examine the "head" of a wide DataFrame, but can't see all of the columns?



Solution #1: Change display options to show all columns

Solution #2: Transpose the head (swaps rows and columns)



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/9sw7O7cPeh — Kevin Markham (@justmarkham) July 24, 2019

🐼🤹‍♂️ pandas trick:



Want to plot a DataFrame? It's as easy as:

df.plot(kind='...')



You can use:

line 📈

bar 📊

barh

hist

box 📦

kde

area

scatter

hexbin

pie 🥧



Other plot types are available via pd.plotting!



Examples: https://t.co/fXYtPeVpZX#Python #dataviz #pandastricks pic.twitter.com/kp82wA15S4 — Kevin Markham (@justmarkham) August 23, 2019

🐼🤹‍♂️ pandas trick #96:



Want to create interactive plots using pandas 0.25? 📊



1. Pick one:

➡️ pip install hvplot

➡️ conda install -c conda-forge hvplot



2. pd.options.plotting.backend = 'hvplot'

3. df.plot(...)

4. 🥳



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/HjH9hTQGqD — Kevin Markham (@justmarkham) December 13, 2019

Handling warnings

🐼🤹‍♂️ pandas trick:



Did you encounter the dreaded SettingWithCopyWarning? 👻



The usual solution is to rewrite your assignment using "loc":



❌ df[df.col == val1].col = val2

✅ df.loc[df.col == val1, 'col'] = val2



See example 👇#Python #DataScience #pandastricks @python_tip pic.twitter.com/6L6IukTpBO — Kevin Markham (@justmarkham) September 10, 2019

🐼🤹‍♂️ pandas trick:



Did you get a "SettingWithCopyWarning" when creating a new column? You are probably assigning to a DataFrame that was created from another DataFrame.



Solution: Use the "copy" method when copying a DataFrame!



See example 👇#Python #DataScience #pandastricks pic.twitter.com/LrRNFyN6Qn — Kevin Markham (@justmarkham) September 12, 2019

Other

🐼🤹‍♂️ pandas trick #88:



Goal: Rearrange the columns in your DataFrame



Options:

1. Specify all column names in desired order

2. Specify columns to move, followed by remaining columns

3. Specify column positions in desired order



See example 👇#Python #pandastricks @python_tip pic.twitter.com/r739QtBims — Kevin Markham (@justmarkham) November 19, 2019

🐼🤹‍♂️ pandas trick #98:



Problem: Your DataFrame is in "wide format" (lots of columns), but you need it in "long format" (lots of rows)



Solution: Use melt()! ⛄➡️💧



See example 👇



Long format is better for analysis, transformation, merges...#Python #DataScience #pandastricks pic.twitter.com/4mmoiuFUGD — Kevin Markham (@justmarkham) December 17, 2019

🐼🤹‍♂️ pandas trick:



If you've created a groupby object, you can access any of the groups (as a DataFrame) using the get_group() method.



See example 👇#Python #DataScience #pandas #pandastricks pic.twitter.com/6Ya0kxMpgk — Kevin Markham (@justmarkham) September 2, 2019

🐼🤹‍♂️ pandas trick:



Do you have a Series with a MultiIndex?



Reshape it into a DataFrame using the unstack() method. It's easier to read, plus you can interact with it using DataFrame methods!



See example 👇



P.S. Want a video with my top 25 #pandastricks? 📺#Python #pandas pic.twitter.com/DKHwN03A7J — Kevin Markham (@justmarkham) July 1, 2019

🐼🤹 pandas trick:



There are many display options you can change:

max_rows

max_columns

max_colwidth

precision

date_dayfirst

date_yearfirst



How to use:

pd.set_option('display.max_rows', 80)

pd.reset_option('display.max_rows')



See all:

pd.describe_option()#Python #pandastricks — Kevin Markham (@justmarkham) July 26, 2019

🐼🤹‍♂️ pandas trick:



Show total memory usage of a DataFrame:https://t.co/LkpMP7wWOi(memory_usage='deep')



Show memory used by each column:

df.memory_usage(deep=True)



Need to reduce? Drop unused columns, or convert object columns to 'category' type.#Python #pandas #pandastricks — Kevin Markham (@justmarkham) July 5, 2019

🐼🤹‍♂️ pandas trick #70:



Need to know which version of pandas you're using?



➡️ pd.__version__



Need to know the versions of its dependencies (numpy, matplotlib, etc)?



➡️ https://t.co/84gN00FdzJ_versions()



Helpful when reading the documentation! 📚#Python #pandas #pandastricks — Kevin Markham (@justmarkham) September 20, 2019

🐼🤹‍♂️ pandas trick:



Want to use NumPy without importing it? You can access ALL of its functionality from within pandas! See example 👇



This is probably *not* a good idea since it breaks with a long-standing convention. But it's a neat trick 😎#Python #pandas #pandastricks pic.twitter.com/pZbXwuj6Kz — Kevin Markham (@justmarkham) July 22, 2019