3 Data Visualisation

Before We Begin

Base R vs External Packages

Until now, we have used functions within Base R, which are already installed in R. While Base/bulit-in R functions are important, however, in many cases, we want to use external packages to do any task we want. This also applies in other programming languages like Python. For example, if we want to do spatial and GIS analysis, we can install the sf package; for machine learning, we can use caret and tidymodels packages. There are over 2 000 R packages, contributed by different individuals around the world, and they are stored and curated in the CRAN website. In most of the cases, you will be working with external packages.

One of the most popular packages in R is the tidyverse meta-package, which include a collection of packages for working with data; some of packages in the tidyverse are:

  • dplyr: for data cleaning, wrangling and transformation

  • ggplot2: for data visualisation

  • tidyr: for tidying up data

  • readr: for importing data

  • purrr: for advanced functional programming

  • stringr: for manipulating string/text data

There are other packages in the tidyverse. The philosophy of tidyverse is tidy data:

  1. Each variable is a column; each column is a variable.
  2. Each observation is row; each row is an observation.
  3. Each value is a cell; each cell is a single value. [@r_4_ds]

Tidy data

Tidy data is a principle we are going to stick on through this course:

Tidy data @wickham_tidyverse_2019

So all in all, tidyverse make it easier to work with dataframes and most people prefer it than base R functions. We will use an example at the end to understand the differences between Base R and tydiverse. You will need to install the package first. In R you install a package by calling install.package() function:

install.packages("tidyverse")

Whenever you need to use an external package and its functions, you first need to load it using library() function. In our case, we want to load the tidyverse package we have just installed:

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

You will load other packages like this.

3.1 Introduction to data visualisation

Data visualisation is an important skill for data scientists. In fact, data manipulation and visualisations go hand in hand. Before any analysis, it is important to visualise data to explore its distribution, relationships, normality, etc.

In this section, we will use the ggplot2 package within tidyverse to learn the foundations of data visualisation. The ggplot2 package got it philosophy from the book The Grammar of Graphics, written by Leland Wilkinson. The ggplot2 package was developed by Hadley Wickham, probably one of the most greatest data scientist in this era.

We will be working with the gapminder dataset, which is the real world data. You will need to install its first because it comes as a package:

install.packages("gapminder")

After installing the gapminder data, you will have to load it using library function:

library(gapminder)

Remember that we said everything we create is an object and we need to assign it? Let us assign gapminder that and name simply as gapminder using the <- operator:

gapminder <- gapminder

Explore the data first; how many columns and rows are in gapminder dataframe? We will use str() function:

str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

We can see that gapminder has 6 and 1704. The columns in the dataset are:

  • country: Simply the country

  • continent: Continent

  • year: The year data was collected

  • lifeExp: Life expectancy of a country in year

  • pop: Population of the country in a year

  • gdpPercap: Gross Domestic Product of a country in a year

ggplot2 has steps/processes you follow to create a plot. Let us illustrate using the gapminder dataset. Load ggplot2 package first:

library(ggplot2)

A plot created in using ggplot2 has the following components/layers, and we will go through them step-by-step:

source

3.2 Create a plot

From the ggapminder dataframe, we will create a scatterplot of life expectancy and GDP per capita, and add all the components of ggplot step-by-step.

3.2.1 Layer 1: data

We use the ggplot() function to add data, in this case, gapminder dataframe:

ggplot(data = gapminder)

3.2.2 Layer 2: aesthetics

Aesthetics are used to allocate x and y variables, depending on the type of the plot we want to create, in this case, x variable is gdpPercap and y variable is lifeExp:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp))

There are other aesthetics that we can add, such as size, colour, shape, group, etc. We will use these later in this section.

3.2.3 Layer 3: geometry

Geometry is the type of plot/object we want to create. In our casewe want to create a scatterplot, by using geom_point() function:

ggplot(data = gapminder,
    aes(x = gdpPercap, y = lifeExp)) + 
    geom_point()

We now have our first plot. There other geometry types in ggplot depending on the type of data you have

  • geom_point() : for sactterplots

  • geom_line() : for line plots

  • geom_histogram() : for histogram

  • geom_area() : for area charts

  • geom_boxplot() : for boxplots

  • geom_bar() : for bar graphs

In the code above, we have three steps to create a plot:

  1. ggplot(data = gapminder): we are simply telling ggplot that we are using gapminder dataset

  2. ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)): we are adding mapping aesthetics or aesthetics, allocating x, y axis.

  3. ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point(): We have added a geometry layer through geoms_point() function to create a scatterplot.

3.2.4 Layer 4: Labels

ggplot2 package can handle various plot labels, including axis titles and graph titles. We can do this using labs() function:

ggplot(data = gapminder,
       aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() +
  labs(x = "GDP per capita (log scale)",
       y = "Life expectancy",
       title = "GDP vs Life Expectancy (1952-2007)",
       subtitle = "People in developed countries tend to live longer",
       caption = "Data source: Gapminder")

3.2.5 Facets

Faceting is used when we’d like to split a particular visualization by the values of another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ.

When we one to split the plots into various sub-categories, by using a categorical variable, we use facet_wrap() function. For example, we may want to split the above plot by continent:

ggplot(data = gapminder,
       aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() +
  facet_wrap(~continent)

There are other argument that comes with facet_wrap() function. We can specify tghe number of rows and columns, using nrow() and ncol() functions, respectively.

ggplot(data = gapminder,
       aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() +
  facet_wrap(~continent, ncol = 5)

3.2.6 Transforming axis scales

Notice that x-axis is not normally distributed. One of the method to use is to transform data to log10, to normnalise it:

ggplot(data = gapminder,
    aes(x = gdpPercap, y = lifeExp)) + 
    geom_point() +
  scale_x_log10()

Look how it changes.

There are many scales functions and you will learn them along the way by coding and exploring ggplot.

3.2.7 Returning to aeathetics

We can add other aesthetics in our plot, for example, we can allocate colour to the continent column:

ggplot(data = gapminder,
    aes(x = gdpPercap, y = lifeExp, colour = continent)) + 
    geom_point() +
  scale_x_log10()

Notice how countries in Europe tend to have higher GDP per capita and and higher life expectancy compared to African countries.

We can also change the shape of points in the aesthetics:

ggplot(data = gapminder,
    aes(x = gdpPercap, y = lifeExp, colour = continent, shape = continent)) + 
    geom_point() +
  scale_x_log10()

There are many other aesthetics arguments that are used and they are beyond the scope of this course. It takes practice.

3.2.8 Themes

Try experimenting with different themes that comes with ggplot. theme_minimal() will produce a minimalist theme with less background:

ggplot(data = gapminder,
       aes(x = gdpPercap, y = lifeExp, colour = continent)) + 
  geom_point() +
  scale_x_log10() +
  labs(x = "GDP per capita (log scale)",
       y = "Life expectancy",
       title = "GDP vs Life Expectancy (1952-2007)",
       subtitle = "People in developed countries tend to live longer",
       caption = "Data source: Gapminder") +
  theme_minimal()

There are other themes that can transform your plots to look more elegant.

You can also choose the how fonts appear using themes() function:

ggplot(data = gapminder,
       aes(x = gdpPercap, y = lifeExp, colour = continent)) + 
  geom_point() +
  scale_x_log10() +
  labs(x = "GDP per capita (log scale)",
       y = "Life expectancy",
       title = "GDP vs Life Expectancy (1952-2007)",
       subtitle = "People in developed countries tend to live longer",
       caption = "Data source: Gapminder") +
  theme(text = element_text(size = 15))

With themes() function, you can remove borders, change the colour of fonts, remove the legend, etc.

3.3 Visualising Numerical data

3.3.1 Single variable

For visualising one variable, we mostly histogram, density plot, etc:

Create a histogram of GDP per capita:

ggplot(data = gapminder, aes(gdpPercap)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that the GDP per capita variable is skewed. Density plots are also similar to histograms:

ggplot(data = gapminder, aes(gdpPercap)) +
  geom_density()

3.3.2 Visualising more than one numerical variables

A scatter plot is used to show relationship between two variables

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + 
    geom_point()

We can add other aeasthetics such as shape, colour etc: Let’s add the colour aesthetics:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, colour = continent)) + 
    geom_point()

We can change the add the size aesthetics and use population of the country:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, colour = continent, size = pop)) + 
  geom_point()

3.3.3 Visualising numerical by group/category

A boxplot is useful when we want to view statistics by a particular group, let say, GDP by continent:

ggplot(data = gapminder, aes(x = continent, y = gdpPercap)) +
  geom_boxplot()

We can also use a column chart, let say, view GDP per capita by continent:

ggplot(data = gapminder, aes(x = continent, y = gdpPercap)) +
  geom_col()

3.4 Visualising categorical/character data

In the gapminder dataset, we have two categorical columns: country and continent. One of the plot used for handling categorical data is bar chart. In ggplot, we use geom_bar:

ggplot(data = gapminder, aes(x = continent)) + 
  geom_bar()

Bonus one: Interactive charts

You can make your charts interactive by using plotly package, you will need to install it first

install.packages(plotly)

Then load the package:

library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout

First, create a plot using ggplot() and save it using the <- operator:

first_plot <- ggplot(data = gapminder,
       aes(x = gdpPercap, y = lifeExp, colour = continent)) + 
  geom_point() +
  scale_x_log10() +
  labs(x = "GDP per capita (log scale)",
       y = "Life expectancy",
       title = "GDP vs Life Expectancy (1952-2007)",
       subtitle = "People in developed countries tend to live longer",
       caption = "Data source: Gapminder") +
  theme_minimal() 

We named the plot first_plot. From the plotly package, you going to use ggplotly() function and put the plot object you have created:

ggplotly(first_plot)

Experiment with the results, when you hoover around the plot, you can see it shows information by variable. You can select which continent to make visible by clicking on the legend. Beautiful!

Bonus Two: Animate

You can create an animated chart using the gganimate package. Install first:

install.packages(gganimate)

Load the package:

library(gganimate)

You would want to see how the life expectancy and gdp per capita changes over time. First create the plot, but add few functions:

animated_plot <- ggplot(data = gapminder,
       aes(x = gdpPercap, 
           y = lifeExp, 
           size = pop,
           colour = continent)) + 
  geom_point() +
  scale_x_log10() +
  labs(x = "GDP per capita (log scale)",
       y = "Life expectancy",
       title = "GDP vs Life Expectancy (1952-2007)",
       subtitle = 'Year: {frame_time}',
       caption = "Data source: Gapminder") +
  theme_minimal() +
  transition_time(year) +
  ease_aes('linear') 


animate(animated_plot)

Look at the results!

This section introduced you to basics of data visualisation using ggplot2 package. You may need to consult the following materials for intermediate and advanced skills in data visualisation: