Code.Report

Insights for productivity and tech related info

Pandas GroupBy Data Aggregation

2019-10-08 Code.ReportData Science, Python

Aggregation and Grouping with Pandas

In this short Python/Pandas tutorial, we shall see some simple commands that can help dealing with data aggregation (such as sum, count, mean, min, max, etc.) and grouping. We will be using Pandas, an open source Python library that provides high-performance, easy-to-use data structures and data analysis tools.

These are commands commonly used in analytics and data science projects, but can also help when you only need a simple data extraction of a CSV file, excel file or some webserver log.

Before we start, you should have already installed the numpy and pandas dependency, depending on your python environment:

NumPy

conda install numpy
or
pip install numpy

Pandas

conda install pandas
or
pip install pandas

First of all, we need to import numpy and pandas library. Then we will populate a panda DataFrame with some random data for this example.

import numpy as np
import pandas as pd

data = {'Company':['GOOGLE','GOOGLE','TESLA','TESLA','LEGO','LEGO'],
           'Person':['Stark','Thor','Hulk','Leela','Cacau','Elon'],
           'Sales':[120,520,320,114,563,248]}

df = pd.DataFrame(data)

Try for yourself executing the dataframe, and you shall see how it looks:

df

Panda.DataFrame.groupby() function

You can, then, specify which company you want to group with the groupby function and pass the result group to a variable

company = df.groupby('Company')

With the variable pointing to the grouped column, we can get some data from it, for example the mean of the values:

company.mean()

code.report

The result of the mean was applied only to the Sales, because Person is not numeric, and the mean function ignores every column that is non numeric You can also get the grouped values and get the standart deviation from it with std()

company.std()

code.report

The grouped data is a dataframe, so you can get the sum and specify only one company with sum() and loc[]

company.sum().loc["LEGO"]
  • Sales 811
  • Name: LEGO, dtype: int64

Same thing with only one line of command:

df.groupby('Company').sum().loc['TESLA']
  • Sales 434
  • Name: TESLA, dtype: int64

Other usefull agregate functions:

Count

df.groupby('Company').count()

code.report

Max value

df.groupby('Company').max()

code.report

Describe

We can also call a describe function that gives us mean, standard deviation, min, max and some other info

df.groupby('Company').describe()

code.report

Transpose

With the transpose function, you can transpose the rows, in this case companies, to to columns.

df.groupby('Company').describe().transpose()

code.report