Data Detective Series:

Vinoo Jacob
3 min readSep 22, 2019

--

Exploratory Data Analysis Essentials for Beginners — Part 3

Now that we have a good idea of the data we are analysing in Part 1 and Part 2, time to understand which parts of it is interesting and those that need some more work before we can use it effectively. Think of this as analysing your evidence in detail to see which ones may help you more and help you develop some approaches in solving the puzzle.

  1. Analysing each variable one at a time (univariate analysis)

The first step is to understand how the data is distributed. The distribution can give us clues to outliers in the data that needs to be addressed and also determine if any additional processing is required, which we will talk about in a little bit.

The best way to look at distribution is to use a box plot. Matplotlib library is a good place to start.

import matplotlib.pyplot as pltdf[‘trip_duration’].plot.box()

How to interpret a box plot —

  • The green line indicates the median of the data
  • The top and the bottom of the box represents 1st and 3rd quartile
  • Outliers are usually shown as circles outside the box

As you can see in this box plot, there is an outlier in the data that completely skews it. You can see this by comparing the outlier value to the box. As this is a data related to taxi trips in New York, you can assume that either the data was entered wrongly or it is an exception.

Let’s assume that most trips in New York take 2.5 hours (or 9,000 seconds). So if we could clean up the data to exclude the outliers above this number.

df_trip_duration_clean = df[df[‘trip_duration’]<9000]

Let’s take a look at the plot again eliminating the outliers

As you can see, still there are quite a number of outliers. 3,600 seems to be a good limit for the data. Again, having the domain knowledge helps us immensely to determine these levels and see if it makes sense from the perspective of the problem to validate whether these assumptions are valid.

So after we remove the outliers above 3,600, our plot looks like this.

It is also a good idea to check how many records you will lose by removing these outliers. If the percentage of records is very high, it could affect downstream processes such as training of your model.

len(df_trip_duration_clean)-len(df)(len(df_trip_duration_clean)-len(df))*100/len(df

Fortunately, in this case, it is only 0.8% and unlikely to affect our downstream processes.

As you can see, plotting numerical variables helps us to get a good understanding of the data. In Part 4, we will look at some more ways to analyse the data.

--

--

Vinoo Jacob
Vinoo Jacob

Written by Vinoo Jacob

A data enthusiast, data curator and passionate about transforming businesses using data analytics and digital technologies.

No responses yet