4. Data Manipulation#

4.1. Learning objectives#

  • Understand DataFrames

  • Manipulate, transform and analyse DataFrames

4.2. Introduction#

Dataframes, as we saw the previous section, is a tabular data format, with columns and rows. Columns are variable names and rows are individual observations. Just like other data structures dicussed in previous chapters, a DataFrame handles different data types:

  • int64: whole numbers (e.g age = 30)

  • float64; numbers with decimals (e.g. height = 1.9, temperature = 33.0)

  • datetime64: date and time

  • category: catogories or factors (e.g. gender = male, female; race = black, white, …)

  • bool: True or False

  • object: string

According to Hadley Wickman, a DataFrame should be in a tidy format:

  1. Each variable is a column; each column is a variable.

  2. Each observation is row; each row is an observation.

  3. Each value is a cell; each cell is a single value.

The pandas package provide a toolbox of working analysing, querying and transforming DataFrames. We will continue working with the `gapminder dataset. First load pandas and read the data, saving it as “gapminder”:

import pandas as pd

gapminder = pd.read_csv("https://raw.githubusercontent.com/aubreympungose/data-science-course/main/weeks/data/gapminder.csv")

Let’s examine the data by looking at few observations:

gapminder.head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

You can check how many variables are in the data and what type of data:

gapminder.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

You can see that gapminder has 6 columns and 1704 rows. The columns in the dataset are:

  • country: Simply the country

  • continent: Continent

  • year: The year data was collected

  • lifeExp: Life expectancy of a country in year

  • pop: Population of the country in a year

  • gdpPercap: Gross Domestic Product of a country in a year

You can also print the columns in the dataset by using .columns attribute.

gapminder.columns
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

You can get the number of rows by using .index attribute:

gapminder.index
RangeIndex(start=0, stop=1704, step=1)

4.3. MANIPULATING COLUMNS#

4.3.1. Access/Select columns#

We may be interested in some of the columns in the gampminder dataframe, let say, for example, “country” only. You can select a specific column by taking the name of the DataFrame, followed by square brackets ([]) and passing the name of the “column” inside square brackets:

gapminder["country"]
0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

To Access multiple, columns you follow the same method as above, except that you wrap column names list ([]), so you end up with double sqaure brackets. Let’s say we want to select “country”, “pop”, and “year” columns:

gapminder[["country", "pop", "year"]]
country pop year
0 Afghanistan 8425333 1952
1 Afghanistan 9240934 1957
2 Afghanistan 10267083 1962
3 Afghanistan 11537966 1967
4 Afghanistan 13079460 1972
... ... ... ...
1699 Zimbabwe 9216418 1987
1700 Zimbabwe 10704340 1992
1701 Zimbabwe 11404948 1997
1702 Zimbabwe 11926563 2002
1703 Zimbabwe 12311143 2007

1704 rows × 3 columns

You can alse access columns by their index using .iloc() function. Remember that an index in Python starts at 0, for example, to access the first column

gapminder.iloc[:, 0]
0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

In the above code, the colon (:) specify that we want to retun all rows; you will more abouth this when selecting by both columns and rows.

To access multiple columns, you pass a list of indexes of columns. For example, select the 1st, 3rd and 5th columns:

gapminder.iloc[:, [0, 2, 4]]
country year pop
0 Afghanistan 1952 8425333
1 Afghanistan 1957 9240934
2 Afghanistan 1962 10267083
3 Afghanistan 1967 11537966
4 Afghanistan 1972 13079460
... ... ... ...
1699 Zimbabwe 1987 9216418
1700 Zimbabwe 1992 10704340
1701 Zimbabwe 1997 11404948
1702 Zimbabwe 2002 11926563
1703 Zimbabwe 2007 12311143

1704 rows × 3 columns

Suppose you want access a range of of columns, for example, from the 1st to the 3rd column:

gapminder.iloc[:, 0:3]
country continent year
0 Afghanistan Asia 1952
1 Afghanistan Asia 1957
2 Afghanistan Asia 1962
3 Afghanistan Asia 1967
4 Afghanistan Asia 1972
... ... ... ...
1699 Zimbabwe Africa 1987
1700 Zimbabwe Africa 1992
1701 Zimbabwe Africa 1997
1702 Zimbabwe Africa 2002
1703 Zimbabwe Africa 2007

1704 rows × 3 columns

4.3.2. Rename columns#

You may need to change column names. For example, in the gapminder dataset, we may change the “pop” column to population. You will use the .rename() function, pass the column argument, and specify column names within the dictionary ({"old_name":"new_name"}). We will assign the results to the new DataFrame named gapminder_new:

gapminder_new = gapminder.rename(columns = {"pop":"population"})

gapminder_new.head()
country continent year lifeExp population gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

The “pop” column has been changed to “population”.

You can also rename multiple columns; in this case, rename “lifeExp” and “gdpPercap”:

gapminder_new = gapminder.rename(columns = {"lifeExp":"life_expectancy", "gdpPercap":"gdp_pc"})

gapminder_new.head()
country continent year life_expectancy pop gdp_pc
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

4.3.3. Making new columns, Removing columns#

To create a new column you simply take the name of the dataframe, pass the string of the new columnn inside sqaure bracket, and assign values:

gapminder["new_column"] = 20

gapminder.head()
country continent year lifeExp pop gdpPercap new_column
0 Afghanistan Asia 1952 28.801 8425333 779.445314 20
1 Afghanistan Asia 1957 30.332 9240934 820.853030 20
2 Afghanistan Asia 1962 31.997 10267083 853.100710 20
3 Afghanistan Asia 1967 34.020 11537966 836.197138 20
4 Afghanistan Asia 1972 36.088 13079460 739.981106 20

There is a new column added that has values of 20.

In many cases, we may need to create a new column resulting from manipulating other columns. For example, we know that the GDP per capita is calculated from dividing the total GDP by a population. If we want to create a new column of the total GDP, we need to multiply GDP per capita by the population size of the country:

gapminder["total_gdp"] = gapminder["gdpPercap"] * gapminder["pop"]

gapminder.head()
country continent year lifeExp pop gdpPercap new_column total_gdp
0 Afghanistan Asia 1952 28.801 8425333 779.445314 20 6.567086e+09
1 Afghanistan Asia 1957 30.332 9240934 820.853030 20 7.585449e+09
2 Afghanistan Asia 1962 31.997 10267083 853.100710 20 8.758856e+09
3 Afghanistan Asia 1967 34.020 11537966 836.197138 20 9.648014e+09
4 Afghanistan Asia 1972 36.088 13079460 739.981106 20 9.678553e+09

We have a new column (“total_gdp”) added to the gapminder DataFrame.

You can remove columns by using `.drop() function:

gapminder = gapminder.drop(["total_gdp", "new_column"], axis = 1)

gapminder.head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

4.4. Manipulate Rows#

You can rows by any column by using .sort_values() function:

gapminder.sort_values(["lifeExp"])
country continent year lifeExp pop gdpPercap
1292 Rwanda Africa 1992 23.599 7290203 737.068595
0 Afghanistan Asia 1952 28.801 8425333 779.445314
552 Gambia Africa 1952 30.000 284320 485.230659
36 Angola Africa 1952 30.015 4232095 3520.610273
1344 Sierra Leone Africa 1952 30.331 2143249 879.787736
... ... ... ... ... ... ...
1487 Switzerland Europe 2007 81.701 7554661 37506.419070
695 Iceland Europe 2007 81.757 301931 36180.789190
802 Japan Asia 2002 82.000 127065841 28604.591900
671 Hong Kong, China Asia 2007 82.208 6980412 39724.978670
803 Japan Asia 2007 82.603 127467972 31656.068060

1704 rows × 6 columns

You can see that Rwanda had the lowest life expectancy.

To sort values into descending order (highest to lowest), you pass ascending = False:

gapminder.sort_values(["pop"], ascending = False)
country continent year lifeExp pop gdpPercap
299 China Asia 2007 72.961 1318683096 4959.114854
298 China Asia 2002 72.028 1280400000 3119.280896
297 China Asia 1997 70.426 1230075000 2289.234136
296 China Asia 1992 68.690 1164970000 1655.784158
707 India Asia 2007 64.698 1110396331 2452.210407
... ... ... ... ... ... ...
1299 Sao Tome and Principe Africa 1967 54.425 70787 1384.840593
1298 Sao Tome and Principe Africa 1962 51.893 65345 1071.551119
420 Djibouti Africa 1952 34.812 63149 2669.529475
1297 Sao Tome and Principe Africa 1957 48.945 61325 860.736903
1296 Sao Tome and Principe Africa 1952 46.471 60011 879.583586

1704 rows × 6 columns

You can see that China had the highest population.

You may be interested in observations/rows that meet certain conditions. For example, in the gapminder data, you may need to keel all observations from the African continent, and save as a new DataFrame called gapminder_africa:

gapminder_africa = gapminder[gapminder["continent"] ==  "Africa"]

gapminder_africa.head()
country continent year lifeExp pop gdpPercap
24 Algeria Africa 1952 43.077 9279525 2449.008185
25 Algeria Africa 1957 45.685 10270856 3013.976023
26 Algeria Africa 1962 48.303 11000948 2550.816880
27 Algeria Africa 1967 51.407 12760499 3246.991771
28 Algeria Africa 1972 54.518 14760787 4182.663766

Filter all the observations that are in the year 2007 and ave it as `gapminder_2007:

gapminder_2007 = gapminder[gapminder["year"] == 2007]

gapminder_2007.head()
country continent year lifeExp pop gdpPercap
11 Afghanistan Asia 2007 43.828 31889923 974.580338
23 Albania Europe 2007 76.423 3600523 5937.029526
35 Algeria Africa 2007 72.301 33333216 6223.367465
47 Angola Africa 2007 42.731 12420476 4797.231267
59 Argentina Americas 2007 75.320 40301927 12779.379640

when filtering rows, you are using boolean operators:

  • ==: equal to

  • !=: not equal to

  • >: greater than

  • <: less than

  • >=: greater than or equal to

  • <=: less than or equal to

You can also filter rows by combining 2 conditions. Let us say we want:

  • countries in Africa

  • the year 2007

and save as africa_2007:

africa_2007 = gapminder[(gapminder["continent"] == "Africa")  & (gapminder["year"] == 2007)]
africa_2007.head()
country continent year lifeExp pop gdpPercap
35 Algeria Africa 2007 72.301 33333216 6223.367465
47 Angola Africa 2007 42.731 12420476 4797.231267
131 Benin Africa 2007 56.728 8078314 1441.284873
167 Botswana Africa 2007 50.728 1639131 12569.851770
203 Burkina Faso Africa 2007 52.295 14326203 1217.032994

Sometimes you may need to filter multiple rows, let say, more than 5 countries. It would be inconvinient to type country == "..." five times. This is where the .isin() function is useful; you pass the list of observations you want. We will work with with observations from the year 2007.

gapminder_2007 = gapminder[gapminder["year"] == 2007]

countries = gapminder_2007[gapminder_2007["country"].isin(["China", "Brazil", "India", "South Africa"])]

countries.head()
country continent year lifeExp pop gdpPercap
179 Brazil Americas 2007 72.390 190010647 9065.800825
299 China Asia 2007 72.961 1318683096 4959.114854
707 India Asia 2007 64.698 1110396331 2452.210407
1415 South Africa Africa 2007 49.339 43997828 9269.657808

You can also filter rows by using their index positions, let say, the 1st and 3rd rows:

gapminder.iloc[[0, 4]]
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
4 Afghanistan Asia 1972 36.088 13079460 739.981106

You can also filter a range of rows For example, we may select top 10 countries with the highes population in 2007. So we will use the gapminder_2007 DataFrame. You will need to arrange by population first

gapminder_2007 = gapminder_2007.sort_values(["pop"], ascending = False)
gapminder_2007.head()
country continent year lifeExp pop gdpPercap
299 China Asia 2007 72.961 1318683096 4959.114854
707 India Asia 2007 64.698 1110396331 2452.210407
1619 United States Americas 2007 78.242 301139947 42951.653090
719 Indonesia Asia 2007 70.650 223547000 3540.651564
179 Brazil Americas 2007 72.390 190010647 9065.800825

You can see that observations are sorted in descending order from highest popultion to the lowest; now filter:

gapminder_2007.iloc[0:10]
country continent year lifeExp pop gdpPercap
299 China Asia 2007 72.961 1318683096 4959.114854
707 India Asia 2007 64.698 1110396331 2452.210407
1619 United States Americas 2007 78.242 301139947 42951.653090
719 Indonesia Asia 2007 70.650 223547000 3540.651564
179 Brazil Americas 2007 72.390 190010647 9065.800825
1175 Pakistan Asia 2007 65.483 169270617 2605.947580
107 Bangladesh Asia 2007 64.062 150448339 1391.253792
1139 Nigeria Africa 2007 46.859 135031164 2013.977305
803 Japan Asia 2007 82.603 127467972 31656.068060
995 Mexico Americas 2007 76.195 108700891 11977.574960

Another way of filtering rows is to use the .query() method. This is how you would filter observations from the African continent:

gapminder_africa = gapminder_2007.query("continent == 'Africa'")
gapminder_africa.head()
country continent year lifeExp pop gdpPercap
1139 Nigeria Africa 2007 46.859 135031164 2013.977305
467 Egypt Africa 2007 71.338 80264543 5581.180998
515 Ethiopia Africa 2007 52.947 76511887 690.805576
335 Congo, Dem. Rep. Africa 2007 46.462 64606759 277.551859
1415 South Africa Africa 2007 49.339 43997828 9269.657808

Filter observations where life expectancy is equal to or greater 80 and the year is 2007; assign the results to high_life_expect:

high_life_expect = gapminder.query("lifeExp >= 80 and year == 2007")

high_life_expect.head()
country continent year lifeExp pop gdpPercap
71 Australia Oceania 2007 81.235 20434176 34435.36744
251 Canada Americas 2007 80.653 33390141 36319.23501
539 France Europe 2007 80.657 61083916 30470.01670
671 Hong Kong, China Asia 2007 82.208 6980412 39724.97867
695 Iceland Europe 2007 81.757 301931 36180.78919

4.5. Summary statistics and aggregating#

One of the important task in data science is to explore the data using summary statistics. There are various methods used to summarise data, including:

Summary stat

pandas function

Mean/average

.mean()

Median/mid-point

.median()

Sum of values

.sum()

Minimum

.min()

Maximum

.max()

Standard deviation: spread/dispersion of data

.std()

Count or total number of observations

.count()

Let explore these functions. What is the average/mean life expectancy of all countries?

gapminder["lifeExp"].mean()
59.47443936619714

What is the median GDP per capita?

gapminder["gdpPercap"].median()
3531.8469885

Find the minimum population:

gapminder["pop"].min()
60011

You can experiment with other functions listed in the above table.

4.5.1. Grouping#

In many cases, you may need to get summary statistics by a category or group. In above examples, although it is important to get the mean population of the world, but this tells us little about variaion in different regions. pandas has .groupby() function that group observations into categories. For example, you may be interested on finding the average the GDP per capita in each continent. Here you would use the .groupby() function, passing the “continent” column inside and follow by the summary statistics function:

gapminder.groupby("continent")["gdpPercap"].mean()
continent
Africa       2193.754578
Americas     7136.110356
Asia         7902.150428
Europe      14469.475533
Oceania     18621.609223
Name: gdpPercap, dtype: float64

The result show mean wealth (GDP per capita) by a continent. Wouldn’t be nice to sort these statistics from highest to lowest?

gapminder.groupby("continent")["gdpPercap"].mean().sort_values(ascending = False)
continent
Oceania     18621.609223
Europe      14469.475533
Asia         7902.150428
Americas     7136.110356
Africa       2193.754578
Name: gdpPercap, dtype: float64

Now We can see that Oceania has the highest average wealth and Africa has the lowest.

Sometimes you may need to find multiple summary statistics: mean, median, standard deviation, minimum,and maximum at the same time. .agg() function is able to do this:

gapminder.groupby("continent")["gdpPercap"].agg(["mean", "median", "std", "min", "max"])
mean median std min max
continent
Africa 2193.754578 1192.138217 2827.929863 241.165876 21951.21176
Americas 7136.110356 5465.509853 6396.764112 1201.637154 42951.65309
Asia 7902.150428 2646.786844 14045.373112 331.000000 113523.13290
Europe 14469.475533 12081.749115 9355.213498 973.533195 49357.19017
Oceania 18621.609223 17983.303955 6358.983321 10039.595640 34435.36744

What if you want to find a summary stat of multiple columns? For example, find the mean pf these columns per continent: life expectancy, population, GDP per capita:

gapminder.groupby("continent").agg(
    mean_life_expect = ("lifeExp", "mean"),
    mean_pop = ("pop", "mean"),
    mean_gdp_pc = ("gdpPercap", "mean")
)
mean_life_expect mean_pop mean_gdp_pc
continent
Africa 48.865330 9.916003e+06 2193.754578
Americas 64.658737 2.450479e+07 7136.110356
Asia 60.064903 7.703872e+07 7902.150428
Europe 71.903686 1.716976e+07 14469.475533
Oceania 74.326208 8.874672e+06 18621.609223

4.6. Conclusion#

In this chapter you have learned about different ways of manipulating DataFrames. Next, we go to other pandas methods.