Data Manipulation

4. Data Manipulation#

4.1. Learning objectives#

Understand DataFrames
Manipulate, transform and analyse DataFrames

4.2. Introduction#

Dataframes, as we saw the previous section, is a tabular data format, with columns and rows. Columns are variable names and rows are individual observations. Just like other data structures dicussed in previous chapters, a DataFrame handles different data types:

int64: whole numbers (e.g age = 30)
float64; numbers with decimals (e.g. height = 1.9, temperature = 33.0)
datetime64: date and time
category: catogories or factors (e.g. gender = male, female; race = black, white, …)
bool: True or False
object: string

According to Hadley Wickman, a DataFrame should be in a tidy format:

Each variable is a column; each column is a variable.

Each observation is row; each row is an observation.

Each value is a cell; each cell is a single value.

The pandas package provide a toolbox of working analysing, querying and transforming DataFrames. We will continue working with the `gapminder dataset. First load pandas and read the data, saving it as “gapminder”:

import pandas as pd

gapminder = pd.read_csv("https://raw.githubusercontent.com/aubreympungose/data-science-course/main/weeks/data/gapminder.csv")

Let’s examine the data by looking at few observations:

gapminder.head()

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

You can check how many variables are in the data and what type of data:

gapminder.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

You can see that gapminder has 6 columns and 1704 rows. The columns in the dataset are:

country: Simply the country
continent: Continent
year: The year data was collected
lifeExp: Life expectancy of a country in year
pop: Population of the country in a year
gdpPercap: Gross Domestic Product of a country in a year

You can also print the columns in the dataset by using .columns attribute.

gapminder.columns

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

You can get the number of rows by using .index attribute:

gapminder.index

RangeIndex(start=0, stop=1704, step=1)

4.3. MANIPULATING COLUMNS#

4.3.1. Access/Select columns#

We may be interested in some of the columns in the gampminder dataframe, let say, for example, “country” only. You can select a specific column by taking the name of the DataFrame, followed by square brackets ([]) and passing the name of the “column” inside square brackets:

gapminder["country"]

     Afghanistan
     Afghanistan
     Afghanistan
     Afghanistan
     Afghanistan
           ...     
     Zimbabwe
     Zimbabwe
     Zimbabwe
     Zimbabwe
     Zimbabwe
Name: country, Length: 1704, dtype: object

To Access multiple, columns you follow the same method as above, except that you wrap column names list ([]), so you end up with double sqaure brackets. Let’s say we want to select “country”, “pop”, and “year” columns:

gapminder[["country", "pop", "year"]]

	country	pop	year
0	Afghanistan	8425333	1952
1	Afghanistan	9240934	1957
2	Afghanistan	10267083	1962
3	Afghanistan	11537966	1967
4	Afghanistan	13079460	1972
...	...	...	...
1699	Zimbabwe	9216418	1987
1700	Zimbabwe	10704340	1992
1701	Zimbabwe	11404948	1997
1702	Zimbabwe	11926563	2002
1703	Zimbabwe	12311143	2007

1704 rows × 3 columns

You can alse access columns by their index using .iloc() function. Remember that an index in Python starts at 0, for example, to access the first column

gapminder.iloc[:, 0]

     Afghanistan
     Afghanistan
     Afghanistan
     Afghanistan
     Afghanistan
           ...     
     Zimbabwe
     Zimbabwe
     Zimbabwe
     Zimbabwe
     Zimbabwe
Name: country, Length: 1704, dtype: object

In the above code, the colon (:) specify that we want to retun all rows; you will more abouth this when selecting by both columns and rows.

To access multiple columns, you pass a list of indexes of columns. For example, select the 1st, 3rd and 5th columns:

gapminder.iloc[:, [0, 2, 4]]

	country	year	pop
0	Afghanistan	1952	8425333
1	Afghanistan	1957	9240934
2	Afghanistan	1962	10267083
3	Afghanistan	1967	11537966
4	Afghanistan	1972	13079460
...	...	...	...
1699	Zimbabwe	1987	9216418
1700	Zimbabwe	1992	10704340
1701	Zimbabwe	1997	11404948
1702	Zimbabwe	2002	11926563
1703	Zimbabwe	2007	12311143

1704 rows × 3 columns

Suppose you want access a range of of columns, for example, from the 1st to the 3rd column:

gapminder.iloc[:, 0:3]

	country	continent	year
0	Afghanistan	Asia	1952
1	Afghanistan	Asia	1957
2	Afghanistan	Asia	1962
3	Afghanistan	Asia	1967
4	Afghanistan	Asia	1972
...	...	...	...
1699	Zimbabwe	Africa	1987
1700	Zimbabwe	Africa	1992
1701	Zimbabwe	Africa	1997
1702	Zimbabwe	Africa	2002
1703	Zimbabwe	Africa	2007

1704 rows × 3 columns

4.3.2. Rename columns#

You may need to change column names. For example, in the gapminder dataset, we may change the “pop” column to population. You will use the .rename() function, pass the column argument, and specify column names within the dictionary ({"old_name":"new_name"}). We will assign the results to the new DataFrame named gapminder_new:

gapminder_new = gapminder.rename(columns = {"pop":"population"})

gapminder_new.head()

	country	continent	year	lifeExp	population	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

The “pop” column has been changed to “population”.

You can also rename multiple columns; in this case, rename “lifeExp” and “gdpPercap”:

gapminder_new = gapminder.rename(columns = {"lifeExp":"life_expectancy", "gdpPercap":"gdp_pc"})

gapminder_new.head()

	country	continent	year	life_expectancy	pop	gdp_pc
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

4.3.3. Making new columns, Removing columns#

To create a new column you simply take the name of the dataframe, pass the string of the new columnn inside sqaure bracket, and assign values:

gapminder["new_column"] = 20

gapminder.head()

	country	continent	year	lifeExp	pop	gdpPercap	new_column
0	Afghanistan	Asia	1952	28.801	8425333	779.445314	20
1	Afghanistan	Asia	1957	30.332	9240934	820.853030	20
2	Afghanistan	Asia	1962	31.997	10267083	853.100710	20
3	Afghanistan	Asia	1967	34.020	11537966	836.197138	20
4	Afghanistan	Asia	1972	36.088	13079460	739.981106	20

There is a new column added that has values of 20.

In many cases, we may need to create a new column resulting from manipulating other columns. For example, we know that the GDP per capita is calculated from dividing the total GDP by a population. If we want to create a new column of the total GDP, we need to multiply GDP per capita by the population size of the country:

gapminder["total_gdp"] = gapminder["gdpPercap"] * gapminder["pop"]

gapminder.head()

	country	continent	year	lifeExp	pop	gdpPercap	new_column	total_gdp
0	Afghanistan	Asia	1952	28.801	8425333	779.445314	20	6.567086e+09
1	Afghanistan	Asia	1957	30.332	9240934	820.853030	20	7.585449e+09
2	Afghanistan	Asia	1962	31.997	10267083	853.100710	20	8.758856e+09
3	Afghanistan	Asia	1967	34.020	11537966	836.197138	20	9.648014e+09
4	Afghanistan	Asia	1972	36.088	13079460	739.981106	20	9.678553e+09

We have a new column (“total_gdp”) added to the gapminder DataFrame.

You can remove columns by using `.drop() function:

gapminder = gapminder.drop(["total_gdp", "new_column"], axis = 1)

gapminder.head()

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

4.4. Manipulate Rows#

You can rows by any column by using .sort_values() function:

gapminder.sort_values(["lifeExp"])

	country	continent	year	lifeExp	pop	gdpPercap
1292	Rwanda	Africa	1992	23.599	7290203	737.068595
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
552	Gambia	Africa	1952	30.000	284320	485.230659
36	Angola	Africa	1952	30.015	4232095	3520.610273
1344	Sierra Leone	Africa	1952	30.331	2143249	879.787736
...	...	...	...	...	...	...
1487	Switzerland	Europe	2007	81.701	7554661	37506.419070
695	Iceland	Europe	2007	81.757	301931	36180.789190
802	Japan	Asia	2002	82.000	127065841	28604.591900
671	Hong Kong, China	Asia	2007	82.208	6980412	39724.978670
803	Japan	Asia	2007	82.603	127467972	31656.068060

1704 rows × 6 columns

You can see that Rwanda had the lowest life expectancy.

To sort values into descending order (highest to lowest), you pass ascending = False:

gapminder.sort_values(["pop"], ascending = False)

	country	continent	year	lifeExp	pop	gdpPercap
299	China	Asia	2007	72.961	1318683096	4959.114854
298	China	Asia	2002	72.028	1280400000	3119.280896
297	China	Asia	1997	70.426	1230075000	2289.234136
296	China	Asia	1992	68.690	1164970000	1655.784158
707	India	Asia	2007	64.698	1110396331	2452.210407
...	...	...	...	...	...	...
1299	Sao Tome and Principe	Africa	1967	54.425	70787	1384.840593
1298	Sao Tome and Principe	Africa	1962	51.893	65345	1071.551119
420	Djibouti	Africa	1952	34.812	63149	2669.529475
1297	Sao Tome and Principe	Africa	1957	48.945	61325	860.736903
1296	Sao Tome and Principe	Africa	1952	46.471	60011	879.583586

1704 rows × 6 columns

You can see that China had the highest population.

You may be interested in observations/rows that meet certain conditions. For example, in the gapminder data, you may need to keel all observations from the African continent, and save as a new DataFrame called gapminder_africa:

gapminder_africa = gapminder[gapminder["continent"] ==  "Africa"]

gapminder_africa.head()

	country	continent	year	lifeExp	pop	gdpPercap
24	Algeria	Africa	1952	43.077	9279525	2449.008185
25	Algeria	Africa	1957	45.685	10270856	3013.976023
26	Algeria	Africa	1962	48.303	11000948	2550.816880
27	Algeria	Africa	1967	51.407	12760499	3246.991771
28	Algeria	Africa	1972	54.518	14760787	4182.663766

Filter all the observations that are in the year 2007 and ave it as `gapminder_2007:

gapminder_2007 = gapminder[gapminder["year"] == 2007]

gapminder_2007.head()

	country	continent	year	lifeExp	pop	gdpPercap
11	Afghanistan	Asia	2007	43.828	31889923	974.580338
23	Albania	Europe	2007	76.423	3600523	5937.029526
35	Algeria	Africa	2007	72.301	33333216	6223.367465
47	Angola	Africa	2007	42.731	12420476	4797.231267
59	Argentina	Americas	2007	75.320	40301927	12779.379640

when filtering rows, you are using boolean operators:

==: equal to
!=: not equal to
>: greater than
<: less than
>=: greater than or equal to
<=: less than or equal to

You can also filter rows by combining 2 conditions. Let us say we want:

countries in Africa
the year 2007

and save as africa_2007:

africa_2007 = gapminder[(gapminder["continent"] == "Africa")  & (gapminder["year"] == 2007)]
africa_2007.head()

	country	continent	year	lifeExp	pop	gdpPercap
35	Algeria	Africa	2007	72.301	33333216	6223.367465
47	Angola	Africa	2007	42.731	12420476	4797.231267
131	Benin	Africa	2007	56.728	8078314	1441.284873
167	Botswana	Africa	2007	50.728	1639131	12569.851770
203	Burkina Faso	Africa	2007	52.295	14326203	1217.032994

Sometimes you may need to filter multiple rows, let say, more than 5 countries. It would be inconvinient to type country == "..." five times. This is where the .isin() function is useful; you pass the list of observations you want. We will work with with observations from the year 2007.

gapminder_2007 = gapminder[gapminder["year"] == 2007]

countries = gapminder_2007[gapminder_2007["country"].isin(["China", "Brazil", "India", "South Africa"])]

countries.head()

	country	continent	year	lifeExp	pop	gdpPercap
179	Brazil	Americas	2007	72.390	190010647	9065.800825
299	China	Asia	2007	72.961	1318683096	4959.114854
707	India	Asia	2007	64.698	1110396331	2452.210407
1415	South Africa	Africa	2007	49.339	43997828	9269.657808

You can also filter rows by using their index positions, let say, the 1st and 3rd rows:

gapminder.iloc[[0, 4]]

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

You can also filter a range of rows For example, we may select top 10 countries with the highes population in 2007. So we will use the gapminder_2007 DataFrame. You will need to arrange by population first

gapminder_2007 = gapminder_2007.sort_values(["pop"], ascending = False)
gapminder_2007.head()

	country	continent	year	lifeExp	pop	gdpPercap
299	China	Asia	2007	72.961	1318683096	4959.114854
707	India	Asia	2007	64.698	1110396331	2452.210407
1619	United States	Americas	2007	78.242	301139947	42951.653090
719	Indonesia	Asia	2007	70.650	223547000	3540.651564
179	Brazil	Americas	2007	72.390	190010647	9065.800825

You can see that observations are sorted in descending order from highest popultion to the lowest; now filter:

gapminder_2007.iloc[0:10]

	country	continent	year	lifeExp	pop	gdpPercap
299	China	Asia	2007	72.961	1318683096	4959.114854
707	India	Asia	2007	64.698	1110396331	2452.210407
1619	United States	Americas	2007	78.242	301139947	42951.653090
719	Indonesia	Asia	2007	70.650	223547000	3540.651564
179	Brazil	Americas	2007	72.390	190010647	9065.800825
1175	Pakistan	Asia	2007	65.483	169270617	2605.947580
107	Bangladesh	Asia	2007	64.062	150448339	1391.253792
1139	Nigeria	Africa	2007	46.859	135031164	2013.977305
803	Japan	Asia	2007	82.603	127467972	31656.068060
995	Mexico	Americas	2007	76.195	108700891	11977.574960

Another way of filtering rows is to use the .query() method. This is how you would filter observations from the African continent:

gapminder_africa = gapminder_2007.query("continent == 'Africa'")
gapminder_africa.head()

	country	continent	year	lifeExp	pop	gdpPercap
1139	Nigeria	Africa	2007	46.859	135031164	2013.977305
467	Egypt	Africa	2007	71.338	80264543	5581.180998
515	Ethiopia	Africa	2007	52.947	76511887	690.805576
335	Congo, Dem. Rep.	Africa	2007	46.462	64606759	277.551859
1415	South Africa	Africa	2007	49.339	43997828	9269.657808

Filter observations where life expectancy is equal to or greater 80 and the year is 2007; assign the results to high_life_expect:

high_life_expect = gapminder.query("lifeExp >= 80 and year == 2007")

high_life_expect.head()

	country	continent	year	lifeExp	pop	gdpPercap
71	Australia	Oceania	2007	81.235	20434176	34435.36744
251	Canada	Americas	2007	80.653	33390141	36319.23501
539	France	Europe	2007	80.657	61083916	30470.01670
671	Hong Kong, China	Asia	2007	82.208	6980412	39724.97867
695	Iceland	Europe	2007	81.757	301931	36180.78919

4.5. Summary statistics and aggregating#

One of the important task in data science is to explore the data using summary statistics. There are various methods used to summarise data, including:

Summary stat	pandas function
Mean/average	`.mean()`
Median/mid-point	`.median()`
Sum of values	`.sum()`
Minimum	`.min()`
Maximum	`.max()`
Standard deviation: spread/dispersion of data	`.std()`
Count or total number of observations	`.count()`

Let explore these functions. What is the average/mean life expectancy of all countries?

gapminder["lifeExp"].mean()

59.47443936619714

What is the median GDP per capita?

gapminder["gdpPercap"].median()

3531.8469885

Find the minimum population:

gapminder["pop"].min()

You can experiment with other functions listed in the above table.

4.5.1. Grouping#

In many cases, you may need to get summary statistics by a category or group. In above examples, although it is important to get the mean population of the world, but this tells us little about variaion in different regions. pandas has .groupby() function that group observations into categories. For example, you may be interested on finding the average the GDP per capita in each continent. Here you would use the .groupby() function, passing the “continent” column inside and follow by the summary statistics function:

gapminder.groupby("continent")["gdpPercap"].mean()

continent
Africa       2193.754578
Americas     7136.110356
Asia         7902.150428
Europe      14469.475533
Oceania     18621.609223
Name: gdpPercap, dtype: float64

The result show mean wealth (GDP per capita) by a continent. Wouldn’t be nice to sort these statistics from highest to lowest?

gapminder.groupby("continent")["gdpPercap"].mean().sort_values(ascending = False)

continent
Oceania     18621.609223
Europe      14469.475533
Asia         7902.150428
Americas     7136.110356
Africa       2193.754578
Name: gdpPercap, dtype: float64

Now We can see that Oceania has the highest average wealth and Africa has the lowest.

Sometimes you may need to find multiple summary statistics: mean, median, standard deviation, minimum,and maximum at the same time. .agg() function is able to do this:

gapminder.groupby("continent")["gdpPercap"].agg(["mean", "median", "std", "min", "max"])

	mean	median	std	min	max
continent
Africa	2193.754578	1192.138217	2827.929863	241.165876	21951.21176
Americas	7136.110356	5465.509853	6396.764112	1201.637154	42951.65309
Asia	7902.150428	2646.786844	14045.373112	331.000000	113523.13290
Europe	14469.475533	12081.749115	9355.213498	973.533195	49357.19017
Oceania	18621.609223	17983.303955	6358.983321	10039.595640	34435.36744

What if you want to find a summary stat of multiple columns? For example, find the mean pf these columns per continent: life expectancy, population, GDP per capita:

gapminder.groupby("continent").agg(
    mean_life_expect = ("lifeExp", "mean"),
    mean_pop = ("pop", "mean"),
    mean_gdp_pc = ("gdpPercap", "mean")
)

	mean_life_expect	mean_pop	mean_gdp_pc
continent
Africa	48.865330	9.916003e+06	2193.754578
Americas	64.658737	2.450479e+07	7136.110356
Asia	60.064903	7.703872e+07	7902.150428
Europe	71.903686	1.716976e+07	14469.475533
Oceania	74.326208	8.874672e+06	18621.609223

4.6. Conclusion#

In this chapter you have learned about different ways of manipulating DataFrames. Next, we go to other pandas methods.

Data Manipulation

Contents

4. Data Manipulation#

4.1. Learning objectives#

4.2. Introduction#

4.3. MANIPULATING COLUMNS#

4.3.1. Access/Select columns#

4.3.2. Rename columns#

4.3.3. Making new columns, Removing columns#

4.4. Manipulate Rows#

4.5. Summary statistics and aggregating#

4.5.1. Grouping#

4.6. Conclusion#