4 Data Wrangling and Manipulation

4.1 PART I

Learning objectives

  • Understand dataframes

  • Manipulate, transform and analyse dataframes

4.1.1 Introduction

Dataframes, as we saw the previous section, is a tabular data format, with columns and rows. Columns are variable names and rows are individual observations.

Let us a basic data frame with three columns: country, life expectancy, and population of Southern African countries. First, we create individual vectors:

# Create a character vector
country_name <- c("Angola", "Botswana", "Lesotho", "Malawi", "Mozambique", "Namibia", "South Africa", "Swaziland", "Zambia", "Zimbabwe")

country_name
 [1] "Angola"       "Botswana"     "Lesotho"      "Malawi"       "Mozambique"  
 [6] "Namibia"      "South Africa" "Swaziland"    "Zambia"       "Zimbabwe"    
## Create a numeric vector

life_expectancy <- c(61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3)

life_expectancy
 [1] 61.6 61.1 57.1 53.1 62.9 59.3 59.3 62.3 61.2 59.3
## create another numeric vector

population <- c(34500000, 2590000, 228000, 19890000, 32080000,  2530000, 59390000, 1190000, 19470000, 1599000)

Because we have 3 vectors of equal length, we can create a dataframe, using a data.frame() function:

southern_africa_df <- data.frame(country_name, life_expectancy, population) #combine two vectors to create a dataframe

southern_africa_df ## print the dataframe
   country_name life_expectancy population
1        Angola            61.6   34500000
2      Botswana            61.1    2590000
3       Lesotho            57.1     228000
4        Malawi            53.1   19890000
5    Mozambique            62.9   32080000
6       Namibia            59.3    2530000
7  South Africa            59.3   59390000
8     Swaziland            62.3    1190000
9        Zambia            61.2   19470000
10     Zimbabwe            59.3    1599000

A dataframe is basically made of vectors (columns). In the southern_africa_df dataframe, we have 3 columns: country_name, life_expectancy population. We can extract these elements using the $ sign:

southern_africa_df$country_name
 [1] "Angola"       "Botswana"     "Lesotho"      "Malawi"       "Mozambique"  
 [6] "Namibia"      "South Africa" "Swaziland"    "Zambia"       "Zimbabwe"    
southern_africa_df$life_expectancy
 [1] 61.6 61.1 57.1 53.1 62.9 59.3 59.3 62.3 61.2 59.3
southern_africa_df$population
 [1] 34500000  2590000   228000 19890000 32080000  2530000 59390000  1190000
 [9] 19470000  1599000

4.1.3 Using dplyr

We will use the dplyr package to manipulate data. dplyr is part of tidyverse meta-package and is mostly used to clean, manipulate and transform dataframes. dplyr has 5 main verbs/functions:

  • select(): for selecting particular columns in a dataframe

  • filter(): for sub-setting rows based on the specified criteria

  • arrange(): sort rows/observations based (ascending/descending)

  • summarise(): for summary statistics

  • mutate(): create new columns based on the manipulation of the existing columns.

  • group_by(): get aggregate statistics by a category

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

You can also load dplyr by loading the tidyverse package; this will load all the packages in in `tidyverse at once. This is useful, for example, when you want to clean and visualise data:

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ stringr 1.5.0
✔ tidyr   1.3.0     ✔ forcats 1.0.0
✔ readr   2.1.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Note that you will net to install tidyverse if you did not install it from the previous section:

Warning: package 'tidyverse' is in use and will not be installed

Let us load the gapminder dataset that we used from the previous week. If you did not install, you will need to do so.

library(gapminder)

Than assign it to the object, gapminder, using the <- operator:

gapminder <- gapminder

Write the following code to change the country and continent from factor to character, I will explain this in the next chapter:

gapminder$country <- as.character(gapminder$country)

gapminder$continent <- as.character(gapminder$continent)

Now you have an object, which is a dataframe, called gapminder, you can see it in the Environment pane in RStudio. It shows that it has ncol(gapminder) columns/variables and nrow(gapminder) observations/rows

gapminder dataframe in Environment pane

You can use the str() fucntion to get the structure of the dataframe:

str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

We have 6 variables:

  • “country”

  • “continent”

  • “year”

  • “lifeExp”: life expectancy of a country

  • “pop”: population

  • “gdpPercap”: GDP per capita

You can get a sample of the first observations/rows using head() function:

head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <chr>       <chr>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

4.1.4 select() function

We may be interested in some of the columns in the gampminder dataframe, let say, for example, “country” and “pop” variables and discard others; we use the select() function:

library(dplyr)
select(gapminder, "country", "pop")
# A tibble: 1,704 × 2
   country          pop
   <chr>          <int>
 1 Afghanistan  8425333
 2 Afghanistan  9240934
 3 Afghanistan 10267083
 4 Afghanistan 11537966
 5 Afghanistan 13079460
 6 Afghanistan 14880372
 7 Afghanistan 12881816
 8 Afghanistan 13867957
 9 Afghanistan 16317921
10 Afghanistan 22227415
# ℹ 1,694 more rows

R return only the country and population variables. select() function only return the variables/columns we are interested in:

The pipe operator (|>)

The best way to work with R, esepcially tidyverse, is using the pipe (|>) operator. The pipe connect various tasks, for example:

gapminder |>
  select(country, pop)
# A tibble: 1,704 × 2
   country          pop
   <chr>          <int>
 1 Afghanistan  8425333
 2 Afghanistan  9240934
 3 Afghanistan 10267083
 4 Afghanistan 11537966
 5 Afghanistan 13079460
 6 Afghanistan 14880372
 7 Afghanistan 12881816
 8 Afghanistan 13867957
 9 Afghanistan 16317921
10 Afghanistan 22227415
# ℹ 1,694 more rows

The pipe chains your operations. It says:

  • Take gapminder dataframe

  • Then using the select() function to select only pop and country columns

You will that pipe becomes important when having multiple chains of functions at once.

You will notice that in other sources: web, books, etc. have %>% instead of |>. It works the same way. The |> is new. %>% comes from mgttr, an external package, so you needed to load it first before using it. The native pipe |> is built in R, so it there is no need to load it. So, we will stick with the |>.

In your RStudio, follow these instructions:

  • Go to ‘Tools’ and click on ‘Global Options’

  • Navigate to ‘Code’ on the left pane

  • Then select the fourth option, “use native pipe operator, |>”

  • Click ‘Apply’

You can use the short cut Ctrl + Shift + M If you using Windows to insert the pipe instead of typing it. If you are using Mac, use Cmd + Shift + M.

4.1.4 Return to select() function

As I have said, select() only return the columns specified

gapminder %>% 
  select(country, pop)
# A tibble: 1,704 × 2
   country          pop
   <chr>          <int>
 1 Afghanistan  8425333
 2 Afghanistan  9240934
 3 Afghanistan 10267083
 4 Afghanistan 11537966
 5 Afghanistan 13079460
 6 Afghanistan 14880372
 7 Afghanistan 12881816
 8 Afghanistan 13867957
 9 Afghanistan 16317921
10 Afghanistan 22227415
# ℹ 1,694 more rows

Also, notice that we selected “country’ and”pop” columns but did not save them and assigned to an object. To store the results into an object, we have to use assignment operator (<-):

gapminder_selected <- gapminder |>
  select(country, pop)

gapminder_selected # print
# A tibble: 1,704 × 2
   country          pop
   <chr>          <int>
 1 Afghanistan  8425333
 2 Afghanistan  9240934
 3 Afghanistan 10267083
 4 Afghanistan 11537966
 5 Afghanistan 13079460
 6 Afghanistan 14880372
 7 Afghanistan 12881816
 8 Afghanistan 13867957
 9 Afghanistan 16317921
10 Afghanistan 22227415
# ℹ 1,694 more rows

Now we have saved results as gapminder_selected, and we can do other analysis with this object because we have saved it. So, in a nutshell, when you run a code without assigning results into an object, using the <- operator, results will not be saved but will be printed.

You can also select columns based by index/position. Let us select only 1st and 3rd coulmns

gapminder |> 
  select(1, 3)
# A tibble: 1,704 × 2
   country      year
   <chr>       <int>
 1 Afghanistan  1952
 2 Afghanistan  1957
 3 Afghanistan  1962
 4 Afghanistan  1967
 5 Afghanistan  1972
 6 Afghanistan  1977
 7 Afghanistan  1982
 8 Afghanistan  1987
 9 Afghanistan  1992
10 Afghanistan  1997
# ℹ 1,694 more rows

Or you can select a range of columns

gapminder |> 
  select(country:year)
# A tibble: 1,704 × 3
   country     continent  year
   <chr>       <chr>     <int>
 1 Afghanistan Asia       1952
 2 Afghanistan Asia       1957
 3 Afghanistan Asia       1962
 4 Afghanistan Asia       1967
 5 Afghanistan Asia       1972
 6 Afghanistan Asia       1977
 7 Afghanistan Asia       1982
 8 Afghanistan Asia       1987
 9 Afghanistan Asia       1992
10 Afghanistan Asia       1997
# ℹ 1,694 more rows

Here, we have select columns from country to year only.

You can also exclude certain columns by simply adding the minus (-) sign; let say we want to exclude the gdpPercap from the dataframe

gapminder |> 
  select(-gdpPercap)
# A tibble: 1,704 × 5
   country     continent  year lifeExp      pop
   <chr>       <chr>     <int>   <dbl>    <int>
 1 Afghanistan Asia       1952    28.8  8425333
 2 Afghanistan Asia       1957    30.3  9240934
 3 Afghanistan Asia       1962    32.0 10267083
 4 Afghanistan Asia       1967    34.0 11537966
 5 Afghanistan Asia       1972    36.1 13079460
 6 Afghanistan Asia       1977    38.4 14880372
 7 Afghanistan Asia       1982    39.9 12881816
 8 Afghanistan Asia       1987    40.8 13867957
 9 Afghanistan Asia       1992    41.7 16317921
10 Afghanistan Asia       1997    41.8 22227415
# ℹ 1,694 more rows

4.1.5 filter() function

You may be interested in observations/rows that meet certain conditions. We use filter() function to return the list that meet our conditions. filter() has several operators:

  • ==: equal to

  • !=: not equal to

  • >: greater than

  • <: less than

  • >=: greater than or equal to

  • <=: less than or equal to

Let us filter based on the continent column. How many continent are there? Create a basic table first

table(gapminder$continent)

  Africa Americas     Asia   Europe  Oceania 
     624      300      396      360       24 

There are five continents. Let us filter only observations from the African continent:

africa_gapminder <- gapminder |> 
  filter(continent == "Africa")

africa_gapminder  # print 
# A tibble: 624 × 6
   country continent  year lifeExp      pop gdpPercap
   <chr>   <chr>     <int>   <dbl>    <int>     <dbl>
 1 Algeria Africa     1952    43.1  9279525     2449.
 2 Algeria Africa     1957    45.7 10270856     3014.
 3 Algeria Africa     1962    48.3 11000948     2551.
 4 Algeria Africa     1967    51.4 12760499     3247.
 5 Algeria Africa     1972    54.5 14760787     4183.
 6 Algeria Africa     1977    58.0 17152804     4910.
 7 Algeria Africa     1982    61.4 20033753     5745.
 8 Algeria Africa     1987    65.8 23254956     5681.
 9 Algeria Africa     1992    67.7 26298373     5023.
10 Algeria Africa     1997    69.2 29072015     4797.
# ℹ 614 more rows

africa_gapminder object is a dataframe we have just created, with only countries from the African continent. What filter() does is to return observations/rows that only meet our conditions:

Let us filter a numeric column; for example, we may want to filter countries with population of above 50 million and assign results to an object named high_pop:

high_pop <- gapminder |> 
  filter(pop > 50000000)

high_pop ## print
# A tibble: 190 × 6
   country    continent  year lifeExp       pop gdpPercap
   <chr>      <chr>     <int>   <dbl>     <int>     <dbl>
 1 Bangladesh Asia       1957    39.3  51365468      662.
 2 Bangladesh Asia       1962    41.2  56839289      686.
 3 Bangladesh Asia       1967    43.5  62821884      721.
 4 Bangladesh Asia       1972    45.3  70759295      630.
 5 Bangladesh Asia       1977    46.9  80428306      660.
 6 Bangladesh Asia       1982    50.0  93074406      677.
 7 Bangladesh Asia       1987    52.8 103764241      752.
 8 Bangladesh Asia       1992    56.0 113704579      838.
 9 Bangladesh Asia       1997    59.4 123315288      973.
10 Bangladesh Asia       2002    62.0 135656790     1136.
# ℹ 180 more rows

You can also combine 2 conditions. Let us say we want:

  • countries in Africa

  • the year 2007

africa_2007 <- gapminder |> 
  filter(continent == "Africa", year == 2007)

Filter rows from the Africa OR Asia

gapminder |> 
  filter(continent == "Africa" | continent == "Asia")
# A tibble: 1,020 × 6
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,010 more rows

Sometimes you may need to filter multiple rows, let say, more than 5 countries. It would be inconvinient to type country == "..." five times. This is where the %in% operator is useful:

africa_2007 |> 
  filter(country %in% c("South Africa", "Zimbabwe", "Ghana", "Tanzania", "Tunisia"))
# A tibble: 5 × 6
  country      continent  year lifeExp      pop gdpPercap
  <chr>        <chr>     <int>   <dbl>    <int>     <dbl>
1 Ghana        Africa     2007    60.0 22873338     1328.
2 South Africa Africa     2007    49.3 43997828     9270.
3 Tanzania     Africa     2007    52.5 38139640     1107.
4 Tunisia      Africa     2007    73.9 10276158     7093.
5 Zimbabwe     Africa     2007    43.5 12311143      470.

4.1.6 arrange() function

arrange() function is self-explanatory: it sort values into ascending or descending order

africa_2007 |> 
  arrange(lifeExp) |> 
  head(n = 10) ## show the first 10 observations
# A tibble: 10 × 6
   country                  continent  year lifeExp      pop gdpPercap
   <chr>                    <chr>     <int>   <dbl>    <int>     <dbl>
 1 Swaziland                Africa     2007    39.6  1133066     4513.
 2 Mozambique               Africa     2007    42.1 19951656      824.
 3 Zambia                   Africa     2007    42.4 11746035     1271.
 4 Sierra Leone             Africa     2007    42.6  6144562      863.
 5 Lesotho                  Africa     2007    42.6  2012649     1569.
 6 Angola                   Africa     2007    42.7 12420476     4797.
 7 Zimbabwe                 Africa     2007    43.5 12311143      470.
 8 Central African Republic Africa     2007    44.7  4369038      706.
 9 Liberia                  Africa     2007    45.7  3193942      415.
10 Rwanda                   Africa     2007    46.2  8860588      863.

The default sorting is ascending order; you can sort into descending order by using desc() function within arrange():

africa_2007 |> 
  arrange(desc(lifeExp)) |> 
  head(n = 10) ## show the first 10 observations
# A tibble: 10 × 6
   country               continent  year lifeExp      pop gdpPercap
   <chr>                 <chr>     <int>   <dbl>    <int>     <dbl>
 1 Reunion               Africa     2007    76.4   798094     7670.
 2 Libya                 Africa     2007    74.0  6036914    12057.
 3 Tunisia               Africa     2007    73.9 10276158     7093.
 4 Mauritius             Africa     2007    72.8  1250882    10957.
 5 Algeria               Africa     2007    72.3 33333216     6223.
 6 Egypt                 Africa     2007    71.3 80264543     5581.
 7 Morocco               Africa     2007    71.2 33757175     3820.
 8 Sao Tome and Principe Africa     2007    65.5   199579     1598.
 9 Comoros               Africa     2007    65.2   710960      986.
10 Mauritania            Africa     2007    64.2  3270065     1803.

4.1.7 summarise()

We use the summarise() function to get the summary statistics. Common statistics include mean, median, standard deviation, minimum, maximum, etc.

gapminder |> 
  summarise(mean_life_exp = mean(lifeExp))
# A tibble: 1 × 1
  mean_life_exp
          <dbl>
1          59.5

Standard deviation:

gapminder |> 
  summarise(sd_life_exp = sd(lifeExp))
# A tibble: 1 × 1
  sd_life_exp
        <dbl>
1        12.9

You can get various summary statistics by combing multiple arguments within summarise():

gapminder |> 
  summarise(mean_life_exp = mean(lifeExp),   
            sd_life_exp = sd(lifeExp),
            min_life_exp = min(lifeExp),
            max_life_exp = max(lifeExp))
# A tibble: 1 × 4
  mean_life_exp sd_life_exp min_life_exp max_life_exp
          <dbl>       <dbl>        <dbl>        <dbl>
1          59.5        12.9         23.6         82.6

Magic!

4.1.8 Aggregate Statisitcs with group_by()

In many instances we are interested in aggregate statistics, that is, summary statistics by a category. Say you want to group rows by the continent column then get the use summarise() to get average GDP per capita for each continent. This is where the group() function is useful:

## First create a dataframe of the year 2007 only

gapminder_2007 <- gapminder |> 
  filter(year == 2007)

## Then use group_by()

gapminder_2007 |> 
  group_by(continent) |> 
  summarise(gdp_continent = mean(gdpPercap)) |> 
  arrange(desc(gdp_continent))  ## arrange by gdp_continent into descending
# A tibble: 5 × 2
  continent gdp_continent
  <chr>             <dbl>
1 Oceania          29810.
2 Europe           25054.
3 Asia             12473.
4 Americas         11003.
5 Africa            3089.

4.1.9 Create columns with mutate()

mutate create new columns based on the existing columns. In most cases we want to transform the data. For example, we know that the GDP per capita is calculated from dividing the total GDP by a population. We may want to create a new column named gdp_total with a formula (gdp_total = gdpPercap * pop). We can use mutate():

gapminder_2007 <- gapminder_2007 |> 
  mutate(gdp_total = gdpPercap * pop)

head(gapminder_2007)
# A tibble: 6 × 7
  country     continent  year lifeExp      pop gdpPercap     gdp_total
  <chr>       <chr>     <int>   <dbl>    <int>     <dbl>         <dbl>
1 Afghanistan Asia       2007    43.8 31889923      975.  31079291949.
2 Albania     Europe     2007    76.4  3600523     5937.  21376411360.
3 Algeria     Africa     2007    72.3 33333216     6223. 207444851958.
4 Angola      Africa     2007    42.7 12420476     4797.  59583895818.
5 Argentina   Americas   2007    75.3 40301927    12779. 515033625357.
6 Australia   Oceania    2007    81.2 20434176    34435. 703658358894.

You can see that the new column named gdp_total has been added.

You can also combine multiple arguments within mutate():

gapminder_2007 <- gapminder_2007 |> 
  mutate(gdp_total = gdpPercap * pop,
         life_exp_squared = lifeExp ** 2)

Wraping all up

The pipe operator is very useful. You can chain multiple operations into a single code. Let say we want to do the following:

  • Find country called South Africa

  • find average life expectancy by year

  • Arrange by year into a descending order

gapminder |> 
  filter(country == "South Africa") |> 
  group_by(year) |> 
  summarise(life_exp = mean(lifeExp)) |> 
  arrange(desc(year))
# A tibble: 12 × 2
    year life_exp
   <int>    <dbl>
 1  2007     49.3
 2  2002     53.4
 3  1997     60.2
 4  1992     61.9
 5  1987     60.8
 6  1982     58.2
 7  1977     55.5
 8  1972     53.7
 9  1967     51.9
10  1962     50.0
11  1957     48.0
12  1952     45.0

You can take it even further by creating a plot. You know that BRICS countries are Brazil, Russia, India, China and South Africa. You want to find trends in life expectancy. Note that there is no data for Russia

gapminder |> 
  filter(country %in% c("Brazil", "Russia", "India", "China", "South Africa")) |> 
  select(country, year, lifeExp) |> 
  ggplot(aes(x = year, y = lifeExp, colour = country)) +
  geom_line(linewidth = 1.5) +
  theme_minimal() +
  scale_x_continuous(breaks = c(1950, 1960, 1980, 1990, 2000, 2007)) +
  theme(legend.position = "bottom",
        panel.grid.minor = element_blank()) +
  labs(title = "Trends in life expectancy of BRICS country",
       x = "",
       y = "Life expectancy (in years)",
       caption = "Data source: Gapminder")

Wait a minute: why did south Africa’s life expectancy started to decline rapidly from the 1990s? One plausible reason would be the impact of HIV/AIDS and lack of antiretrovirals (ARVs). Welcome to data analysis.

You can see above the efficiency of the pipe operator. Think of the pipe operator as:

OR

We have introduced to the basics of data manipulation in R using the dplyr package in R. While we have covered a lot, the path is long. However, once you master these basic functions, you will not struggle in the following sections. These resources may be helpful: