Code.Report

Insights for productivity and tech related info

Dealing with missing values

2019-10-07 Code.ReportData Science, Python

Preparing data for an analysis is, sometimes, the worst part of the battle. It can be a very tedious task and it can take up to 80% of the whole analysis time, according to IBM Data Analytics.

Data cleaning is just a part of the process on every data science project, and often we have to deal with missing data while preparing it. In this article we will go over simple ways to detect, summarize, and replace missing values using Python with NumPy and Pandas libraries.

Before we start, you should have already installed the numpy and pandas dependency, depending on your python environment:

NumPy

conda install numpy
or
pip install numpy

Pandas

conda install pandas
or
pip install pandas

Importing the dependencies

import numpy as np
import pandas as pd

Let’s generate a dictionary with missing values. We can specify a reference for a missing value using numpy np.nan

d = {'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C':[1,2,3]}

Now let’s create a Pandas DataFrame with the dictionary and visualize it:

df = pd.DataFrame(d)
df

code.report

Sometimes we just need to remove the rows that contains occurrence of empty data. The dropna function in a Pandas DataFrame will remove every row that contains a missing value. On the example below, only the first row did not have any missing value.

df.dropna()

code.report

You can also specify a parameter axis=1 to indicate to remove columns with missing values instead of rows. On the execution bellow, the only column that did not have empty value was the column C.

df.dropna(axis=1)

code.report

Threshold

You can set a threshold to limit quantity of missing values that you accept. The dropna function will take in consideration and it will remove the rows or columns if it gets to that threshold. For example, let’s say that we want to remove all rows with 2 or more missing values:

df.dropna(thresh=2)

code.report

Filling missing values

With the Pandas DataFrame fillna let’s you easily fill each occurrence of missing data for the value passed as parameter. Let’s fill the empty data with the string “Fill”:

df.fillna(value='Fill')

code.report

Filling data with mean values

Sometimes, it is better to fill a missing data with some other value that makes more sense for your analysis. On the example below, we will get the DataFrame df, find the column ‘A’, fill the empty values with the mean of the values of the same column ‘A’. Let’s see how that looks with only one command:

df['A'].fillna(value=df['A'].mean())

code.report