The recommended type of input (at least by me) to the pivot_table is a simple DataFrame like the one I have already created: Your index will be the id of your database (or you could even have an auto-generated index like in the example) and the columns will be the values you want to aggregate and reshape. This is very easy to create either by reading a file (xls/csv) or by a simple SQL query (substituting all foreign keys with a representative value). In the above example, we actually have the following columns: author, genre, name, pages, year, decade, size - this is a pool of data that will be very useful to remember for later and it is important to also keep it in your mind for your data. So, use a unique id as the index and remember the names of your columns.

As we can see in the documentation, the pivot_table method uses four basic parameters:

index : An array of the data that will be used as indexes to the resulting (i.e the reshaped and aggregated) DataFrame

: An array of the data that will be used as indexes to the resulting (i.e the reshaped and aggregated) DataFrame columns : An array of the data that will be used as a columns to the resulting DataFrame

: An array of the data that will be used as a columns to the resulting DataFrame values : An array of the data whose values we want to aggregate in each cell

: An array of the data whose values we want to aggregate in each cell aggfunc : Which is the function (or functions) that will be used for aggregating the values

So, how it actually works? You select a number of the headers from your pool of data and assign them to either index or columns , depending if you want to put them horizontally or vertically. Notice that both index and columns :

take either a string (to denote a single column) or an array to denote multiple columns

are optional (but you must define one of them) — if you skip either columns or index you’ll get a Series instead of a DataFrame

instead of a are interchangable (you can put any header from your pool to either index or columns, depending on how you want to display your data)

are mutually exclusive (you can’t put the same header in both index and columns)

Multiple data headers means that you’ll have hierachical indexes / columns in your pivot (or MultiIndex as it’s called - remember that Index is used to store the axis of the DataFrame), ie the rows/columns would be grouped by a hierarchy. Let’s see an example of multiple indexes:

If we used 'decade' as an index, then the pivot_table index would be like

70s value1 value2 …

value1 value2 … 80s value1 value2 …

value1 value2 … 90s value1 value2 …

while, if we used ['decade', 'year'] we’d hove something like

70s 1975 value1 value2 … 1978 value1 value2 …

80s 1980 value1 value2 … 1982 value1 value2 … …

90s 1990 value1 value2 … …



So, each year would automatically be grouped to its corresponing decade. The same would be true if we used ['decade', 'year'] in columns (but we’ll now have a vertical grouping from top to bottom). Notice that pandas doesn’t know if the values have any parent / child relationship but just goes from left to right (or top to bottom). For example, if we had used ['year', 'decade'] , we’d get something like:

1975 70s' value1 value2 …

value1 value2 … 1978 70s' value1 value2 …

value1 value2 … 1980 80s' value1 value2 …

value1 value2 … 1982 80s' value1 value2 …

Also, pandas doesn’t care if the values of the hierarchical index are actually related. We could for example had selected a multi index of ['decade', 'size', 'genre'] that would

display the list of decades at the left (or at the top if we used it as a column)

for each decade will display the sizes of the book of that decade at the center (header column or row) and finally

at the right header (or bottom correspondingly) will display the available genres for each size.

So, since we have 3 values for each decade, 3 values for each size and 4 values for each genre in our dataset, each decade will appear 1 time (at the left), each size will appear 3 times (one for each decade) in the middle and each genre will appear 3x3 = 9(one for each combination of decade and size) times in the right. The total number of lines that our MultiIndex will contain is 3x3x4 = 36 (one line for each combination of decade/size/genre).

I hope the above clarifies how index and columns are used to create the headers for rows and index of pivot_table . I will show some examples of various index and columns combinations but first, I’d like to talk about contents of the pivot table (since we’ve only talked about the headers of rows/columns until now).

The values that the pivot_table will contain are defined through the other two parameters, values and aggfunc : We select one or more columns of the initial DataFrame through the values parameter and these are aggregated in the corresponding cell of the resulting dataframe using the aggfunc fuction, so for each cell as defined by index and column, pandas will pick the values that correspond to that cell and pass them to a function that will return the result (by combining these values). As can be understood, the values must be different than index and columns (so all three sets of values, index and columns must not intersect). By default, the values and aggfunc parameters may be ommited - this will result in using average as the function and selecting all numerical columns (that are not in indexes or columns of course) in the values.

I know that this is difficult to understand so I’ll give a simple example right away. Let’s first create a nice set of data for our samples: