Data visualisation

3. Data visualisation#

Data visualisation is an important skill for data scientists. In fact, data manipulation and visualisations go hand in hand. Before any analysis, it is important to visualise data to explore its distribution, relationships, normality, etc.

We will be working with matplotlib library within Python for data visualistion. Although there are many other packages, matplotlib is the foundational library. Thus it is important to master matplotlib first before exploring other advanced libraries. You can import matplotlib as follows. We will also need the pandas package to read data:

import matplotlib.pyplot as plt

import pandas as pd

We will be working with the gapminder dataset, which is the real world data. I have saved this data as a CSV (comma-separated values) file on GitHub. A CSV file is a text file used to store data in a tabular format. You will use .read_csv() function from pandas to read this file and assign it to gapminder:

gapminder = pd.read_csv("https://raw.githubusercontent.com/aubreympungose/data-science-course/main/weeks/data/gapminder.csv")

# Take a look at the first observation of the data
gapminder.head()

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

We have loaded the dataset. You can see columns and rows.

gapminder.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

We can see that gapminder has 6 columns and 1704 rows. The columns in the dataset are:

country: Simply the country
continent: Continent
year: The year data was collected
lifeExp: Life expectancy of a country in year
pop: Population of the country in a year
gdpPercap: Gross Domestic Product of a country in a year

It is a time series data that track countries. Look at the range of years:

print(gapminder["year"].min())

print(gapminder["year"].max())

1952
2007

The datasets contain observations collected from 1952 to 2007.

3.1. A first plot#

Suppose you want to show relationship between life expectancy and GDP per capita. We can create a scatterplot:

plt.scatter(x = gapminder["gdpPercap"], y = gapminder["lifeExp"])

plt.show()

_images/703f3dd86ffe6dede543fb1267b236a73863ed038094638a102aa19af475f58b.png

We have created a first plot. Let examine the code above:

We used scatter() function from pylot sub-package of matplotlib
We specified that we need the to plot two columns: gdpPercap on x-axis and lifeExp on the y-axis.
We then used .show() function from pyplot to show the plot.

3.2. Labels#

Notice that our first plot does not have any labels on the both axis, and also does not have a title. We can add all of these:

plt.scatter(x = gapminder["gdpPercap"], y = gapminder["lifeExp"])

# Set x-axis labels 
plt.xlabel('GDP per capita')

# Set y-axis
plt.ylabel('Life expectancy')

# set title of the plot
plt.title('GDP vs Life Expectancy (1952-2007)')

# show the plot

plt.show()

_images/a0587fd230a11a2610464d09e09c7c584d8c6979fb161b82c97d6947b6981bd7.png

3.2.1. Transforming axis scales#

Notice that x-axis is not normally distributed. One of the method to use is to transform data to log10, to normnalise it:

plt.scatter(x = gapminder["gdpPercap"], y = gapminder["lifeExp"])

plt.xlabel('GDP per capita')

plt.ylabel('Life expectancy')

plt.title('GDP vs Life Expectancy (1952-2007)')

# Apply log scale to x-axis
plt.xscale('log')

plt.show()

_images/746ebeb062bd1b5e8a1a82392aaff199d4c76b18abfc45d99cf2df14b3bb00f9.png

3.2.2. Customise: colour, size, shape#

Sometimes you may need to change how variables/data point appear. Suppose you want to make all the countries belonging to each continent to be of same colour. Here, you would need to create a dictionary where each continent name is a key and colour as a value, then create a plot

colour_dict = {
    'Asia':'red',
    'Europe':'green',
    'Africa':'blue',
    'Americas':'yellow',
    'Oceania':'black'
}

colors = [colour_dict[continent] for continent in gapminder['continent']]

plt.scatter(gapminder['gdpPercap'], gapminder['lifeExp'], c=colors)

plt.xlabel('GDP per capita')

plt.ylabel('Life expectancy')

plt.title('GDP vs Life Expectancy (1952-2007)')

plt.xscale('log')


plt.show()

_images/3f96c4b4d1d0c848ee380bc228732f3ab27c87bc3a33358444b0b94ea6871d27.png