3. Data visualisation#

Data visualisation is an important skill for data scientists. In fact, data manipulation and visualisations go hand in hand. Before any analysis, it is important to visualise data to explore its distribution, relationships, normality, etc.

We will be working with matplotlib library within Python for data visualistion. Although there are many other packages, matplotlib is the foundational library. Thus it is important to master matplotlib first before exploring other advanced libraries. You can import matplotlib as follows. We will also need the pandas package to read data:

import matplotlib.pyplot as plt

import pandas as pd

We will be working with the gapminder dataset, which is the real world data. I have saved this data as a CSV (comma-separated values) file on GitHub. A CSV file is a text file used to store data in a tabular format. You will use .read_csv() function from pandas to read this file and assign it to gapminder:

gapminder = pd.read_csv("https://raw.githubusercontent.com/aubreympungose/data-science-course/main/weeks/data/gapminder.csv")

# Take a look at the first observation of the data
gapminder.head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

We have loaded the dataset. You can see columns and rows.

gapminder.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

We can see that gapminder has 6 columns and 1704 rows. The columns in the dataset are:

  • country: Simply the country

  • continent: Continent

  • year: The year data was collected

  • lifeExp: Life expectancy of a country in year

  • pop: Population of the country in a year

  • gdpPercap: Gross Domestic Product of a country in a year

It is a time series data that track countries. Look at the range of years:

print(gapminder["year"].min())

print(gapminder["year"].max())
1952
2007

The datasets contain observations collected from 1952 to 2007.

3.1. A first plot#

Suppose you want to show relationship between life expectancy and GDP per capita. We can create a scatterplot:

plt.scatter(x = gapminder["gdpPercap"], y = gapminder["lifeExp"])

plt.show()
_images/703f3dd86ffe6dede543fb1267b236a73863ed038094638a102aa19af475f58b.png

We have created a first plot. Let examine the code above:

  • We used scatter() function from pylot sub-package of matplotlib

  • We specified that we need the to plot two columns: gdpPercap on x-axis and lifeExp on the y-axis.

  • We then used .show() function from pyplot to show the plot.

3.2. Labels#

Notice that our first plot does not have any labels on the both axis, and also does not have a title. We can add all of these:

plt.scatter(x = gapminder["gdpPercap"], y = gapminder["lifeExp"])

# Set x-axis labels 
plt.xlabel('GDP per capita')

# Set y-axis
plt.ylabel('Life expectancy')

# set title of the plot
plt.title('GDP vs Life Expectancy (1952-2007)')

# show the plot

plt.show()
_images/a0587fd230a11a2610464d09e09c7c584d8c6979fb161b82c97d6947b6981bd7.png

3.2.1. Transforming axis scales#

Notice that x-axis is not normally distributed. One of the method to use is to transform data to log10, to normnalise it:

plt.scatter(x = gapminder["gdpPercap"], y = gapminder["lifeExp"])

plt.xlabel('GDP per capita')

plt.ylabel('Life expectancy')

plt.title('GDP vs Life Expectancy (1952-2007)')

# Apply log scale to x-axis
plt.xscale('log')


plt.show()
_images/746ebeb062bd1b5e8a1a82392aaff199d4c76b18abfc45d99cf2df14b3bb00f9.png

3.2.2. Customise: colour, size, shape#

Sometimes you may need to change how variables/data point appear. Suppose you want to make all the countries belonging to each continent to be of same colour. Here, you would need to create a dictionary where each continent name is a key and colour as a value, then create a plot

colour_dict = {
    'Asia':'red',
    'Europe':'green',
    'Africa':'blue',
    'Americas':'yellow',
    'Oceania':'black'
}

colors = [colour_dict[continent] for continent in gapminder['continent']]

plt.scatter(gapminder['gdpPercap'], gapminder['lifeExp'], c=colors)

plt.xlabel('GDP per capita')

plt.ylabel('Life expectancy')

plt.title('GDP vs Life Expectancy (1952-2007)')

plt.xscale('log')


plt.show()
_images/3f96c4b4d1d0c848ee380bc228732f3ab27c87bc3a33358444b0b94ea6871d27.png