2. Data Structures in Python#
In this chapter we deal with different data structures that can hold data in Python. Specifically, we will focus on the follwoing data structures:
Lists
Tuples
Dictionaries
DataFrames
2.1. Lists#
List is a data structure that store a collection of elements/items within it. For example, in previous chapter we created a string variable named country that contained “South Africa” element:
country = "South Africa"
print(country)
South Africa
What if we wanted to create another country variable, named country_2 with “Zimbabwe” as an element? We can also do this:
country_2 = "Zimbabwe"
print(country_2)
Zimbabwe
What if we want to add country_3, country_4, etc? We can end up having many variables. This where lists come in. Lists are used to hold many items together. you can create lists in Python by using square barackets ([]):
southern_african_countries = ["Angola", "Botswana", "Lesotho", "Malawi", "Mozambique", "Namibia", "South Africa", "Swaziland", "Zambia", "Zimbabwe"]
print(southern_african_countries)
['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'Swaziland', 'Zambia', 'Zimbabwe']
We now have a list of all countries in the Southern African region. There are various functions that can be used to extract, analyse and manipulate elements in a list. For example, you may be interested on how many elements are in a lists; in our case, how many countries are in the southern_african_countries list. You can use the len() function:
southern_african_countries = ["Angola", "Botswana", "Lesotho", "Malawi", "Mozambique", "Namibia", "South Africa", "Swaziland", "Zambia", "Zimbabwe"]
print(len(southern_african_countries))
10
The list has 10 elements/items.
2.1.1. Subset a list#
You can extract the list item by using the [] and the index position of the item. Note that Python index start at 0, meaning the first element will be in the position 0. This is important especially when coming from R background. Let extract the first element:
print(southern_african_countries[0])
Angola
The first element is Angola.
You can use can access the last element by:
print(southern_african_countries[-1])
Zimbabwe
You can access more than 1 items; for example, extract the first, second and third items in the list by slicing:
print(southern_african_countries[0:3])
['Angola', 'Botswana', 'Lesotho']
Notice that we have included the index 3, which is the 4th item, although the element itself is not printed. When slicing a list, the last index mentioned is not included. This is very importnt to note. There are other slicing options:
print(southern_african_countries[:3])
['Angola', 'Botswana', 'Lesotho']
2.1.2. Manipulate a list#
List elements can be changed. For example, in 2018 Zwaziland changed their name to “eSwatini”. We can change this in a list. First re-create the list:
southern_african_countries = ["Angola", "Botswana", "Lesotho", "Malawi", "Mozambique", "Namibia", "South Africa", "Swaziland", "Zambia", "Zimbabwe"]
print(southern_african_countries)
['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'Swaziland', 'Zambia', 'Zimbabwe']
Then change the list element:
southern_african_countries[7] = "eSwatini"
print(southern_african_countries)
['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'eSwatini', 'Zambia', 'Zimbabwe']
We have changed the list element from “Swaziland” to “Eswatini”.
You can also add new elements in a list. Suppose a geographer told us that our list of Southern African countries missed 3 countries: Democratic Republic of the Congo, Mauritius, Madagascar and Seychelles. In Python, we can update our list and assign it to anew variable called southern_africa_updated. First find the length of the original southern_african_countries:
len(southern_african_countries)
10
Add new items:
southern_africa_updated = southern_african_countries + ["Democratic Republic of the Congo", "Mauritius", "Madagascar", "Seychelles"]
print(southern_africa_updated)
['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'eSwatini', 'Zambia', 'Zimbabwe', 'Democratic Republic of the Congo', 'Mauritius', 'Madagascar', 'Seychelles']
Find the lenght of the updated variable:
len(southern_africa_updated)
14
The updated variable has lenght of 14. This means we have added 4 items.
You can remove an element from the list by using del() function:
del southern_africa_updated[0]
print(southern_africa_updated)
['Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'eSwatini', 'Zambia', 'Zimbabwe', 'Democratic Republic of the Congo', 'Mauritius', 'Madagascar', 'Seychelles']
We have removed the element at the index 0, which is Botswana.
2.1.3. Manipulating list with numeric data#
In above examples we have worked with list that contain string data types: all out elements were string (country names). Suppose we have the life expectancy of those countries. Life expectancy is the average number of years is expected to live:
Let us create a numeric list, that we name life_expectancy, that has the average life expectancy of the countries of Southern Africa:
life_expectancy = [61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]
print(life_expectancy)
[61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]
We can find the minimum life expactancy:
print(min(life_expectancy))
53.1
Print the maximum life expectancy:
print(max(life_expectancy))
62.9
Not that a list can hold elements of of different data types: string, float, integer, boolean, etc.
person_1 = ["Name", "Aubrey", "Age", 32, "Height", 1.8, "Is male?", True]
print(person_1)
['Name', 'Aubrey', 'Age', 32, 'Height', 1.8, 'Is male?', True]
2.2. NumPy Arrays#
A NumPy array is a data structure that can hold numeic elements. It is short for Numeric Python. It is an important data structure if you want to manipulate numeric data. First, you will need to install the numpy package if not already installed: pip install numpy. Then load the library as:
import numpy as np
Suppose we have the lenght in kilometers of major South Africa rivers stored as a list and assigned it to river_lenght_km variable:
river_lenght_km = [2200, 1800, 1210, 502, 560, 645, 520, 480]
print(river_lenght_km)
[2200, 1800, 1210, 502, 560, 645, 520, 480]
We need to convert this lift into an numpy array:
river_lenght_km = np.array(river_lenght_km)
print(type(river_lenght_km))
<class 'numpy.ndarray'>
2.2.1. Summary statistics and mathematical operations#
There many functions within the numpy library. We can calculate summary statistics:
Get the mean/average:
print(np.mean(river_lenght_km))
989.625
Get the median:
print(np.median(river_lenght_km))
602.5
Get the standard deviation:
print(np.std(river_lenght_km))
631.6316049526021
There are other functions you can use.
What are the rivers that have lenght of greater than 1000 kilometers? Find these and assign the result to a variable named longest_rivers
longest_rivers = river_lenght_km[river_lenght_km > 1000]
print(longest_rivers)
[2200 1800 1210]
The river_lenght_km array is in kilometers. What if we want to convert to meters? Since 1 km = 1000 meters, you can convert kilometer to meter by multplying by 1000 since. Let’s do this and assign result to a new variable named river_lenght_meters
river_lenght_meters = river_lenght_km * 1000
print(river_lenght_meters)
[2200000 1800000 1210000 502000 560000 645000 520000 480000]
Just like lists, you can subsets numpy arrays using index positions of the element. To acces the second element of river_lenght_km:
print(river_lenght_km[1])
1800
You can sort elements into ascending or descending order:
np.sort(river_lenght_km)
array([ 480, 502, 520, 560, 645, 1210, 1800, 2200])
Sort into descending
np.sort(river_lenght_km)[::-1]
array([2200, 1800, 1210, 645, 560, 520, 502, 480])
Note that NumPy arrays can also be 2 dimensional; 2D arrays is a matrix like data with rows and columns:
two_d_array = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(two_d_array)
[[1 2 3]
[4 5 6]
[7 8 9]]
2.3. Dictionaries#
Remeember we created two lists previously, southern_african_countries and life_expectancy.
southern_african_countries = ["Angola", "Botswana", "Lesotho", "Malawi", "Mozambique", "Namibia", "South Africa", "Swaziland", "Zambia", "Zimbabwe"]
print(southern_african_countries)
life_expectancy = [61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]
print(life_expectancy)
['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'Swaziland', 'Zambia', 'Zimbabwe']
[61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]
We can find the corresponding life expectancy of, for example, “Botswana”. First find the index position of Botswana:
botswana_index = southern_african_countries.index("Botswana")
print(botswana_index)
1
The index of Botswana is 1. We can access the corresposing life expectancy:
life_expectancy[botswana_index]
61.1
The life expectancy of Botswana. But this is not efficient if we have large data. This where dictionaries come in:
southern_africa = {"Angola":61.6,
"Botswana":61.1,
"Lesotho":57.1,
"Malawi":53.1,
"Mozambique":62.9,
"Namibia":59.3}
print(southern_africa)
{'Angola': 61.6, 'Botswana': 61.1, 'Lesotho': 57.1, 'Malawi': 53.1, 'Mozambique': 62.9, 'Namibia': 59.3}
We have created a dictionary named southern_africa. A dictionary has keys and values and this is always in pairs. For example, we have Angola as a key with its corresponding value 61.6.
You can find values of each keys:
print(southern_africa["Malawi"])
53.1
Of Namibia:
print(southern_africa["Namibia"])
59.3
You can find keys of all keys:
print(southern_africa.keys())
dict_keys(['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia'])
If you look at the southern_africa dictionary, we did not include all other countries in the region. We may need to add, let say, Zimbabwe:
southern_africa["Zimbabwe"] = 59.3
print(southern_africa)
{'Angola': 61.6, 'Botswana': 61.1, 'Lesotho': 57.1, 'Malawi': 53.1, 'Mozambique': 62.9, 'Namibia': 59.3, 'Zimbabwe': 59.3}
You can see that the dictionary has been updated to include Zimbwabwe.
Suppose demographer pouint out that the values of Botswana is outdated; that the life expectancy of the country has increased from 61.1 to 63! We can update this information
southern_africa["Botswana"] = 63
print(southern_africa["Botswana"])
63
The value of Botswana has been changed.
What if we want to add new information in the dictionary? For example, we may want to add the capital cities of the countries in the southern_africa dicitionary:
southern_africa_2 = {"Angola": {"life_expectancy": 61.6, "capital":"Luanda"},
"Botswana": {"life_expectancy": 61.1, "capital":"Gaborone"},
"Lesotho": {"life_expectancy": 57.1, "capital":"Maseru"},
"Malawi": {"life_expectancy": 53.1, "capital":"Lilongwe"},
"Mozambique": {"life_expectancy": 62.9, "capital":"Maputo"},
"Namibia": {"life_expectancy": 59.3, "capital":"Windhoek"}}
print(southern_africa_2)
{'Angola': {'life_expectancy': 61.6, 'capital': 'Luanda'}, 'Botswana': {'life_expectancy': 61.1, 'capital': 'Gaborone'}, 'Lesotho': {'life_expectancy': 57.1, 'capital': 'Maseru'}, 'Malawi': {'life_expectancy': 53.1, 'capital': 'Lilongwe'}, 'Mozambique': {'life_expectancy': 62.9, 'capital': 'Maputo'}, 'Namibia': {'life_expectancy': 59.3, 'capital': 'Windhoek'}}
As you can see in the results, the dictionary has been updated.
2.4. DataFrames#
The previous data structures we have discussed (lists, np.arrays, dictionaries) can handly limited amount of data. In the real world, most data comes big, in a tabular format: with columns and rows. In Python, we use we use the pandas package to handle data in a tabular format. You need to install it first as pip install pandas
Then import pandas:
import pandas as pd
Let return to the previous lists that we have created: list of Southern African countries and corresponding life expectancy:
country = ["Angola", "Botswana", "Lesotho", "Malawi", "Mozambique", "Namibia", "South Africa", "Swaziland", "Zambia", "Zimbabwe"]
print(country)
['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'Swaziland', 'Zambia', 'Zimbabwe']
Create life_expectancy list:
life_expectancy = [61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]
print(life_expectancy)
[61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]
Let’s add one more list of, for example, population of each country:
population = [500, 600, 1000, 150, 490, 740, 300, 781, 610, 504]
print(population)
[500, 600, 1000, 150, 490, 740, 300, 781, 610, 504]
From these three lists, we can create a DataFrame using pandas:
southern_africa_df = pd.DataFrame({"country_name":country, "life_expect": life_expectancy, "pop":population})
print(southern_africa_df)
country_name life_expect pop
0 Angola 61.6 500
1 Botswana 61.1 600
2 Lesotho 57.1 1000
3 Malawi 53.1 150
4 Mozambique 62.9 490
5 Namibia 59.3 740
6 South Africa 59.3 300
7 Swaziland 62.3 781
8 Zambia 61.2 610
9 Zimbabwe 59.3 504
We have a DataFrame with three columns: country_name, life_expect, and pop and 10 rows (observations), where each row represent country. You can use the .head() function to view the first observations
print(southern_africa_df.head())
country_name life_expect pop
0 Angola 61.6 500
1 Botswana 61.1 600
2 Lesotho 57.1 1000
3 Malawi 53.1 150
4 Mozambique 62.9 490
You can check how many columns and rows in the DataFrame by using .info() function:
print(southern_africa_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country_name 10 non-null object
1 life_expect 10 non-null float64
2 pop 10 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 368.0+ bytes
None
You can get more information of your DataFrame by using .describe() function, it will return the summary statistics of all numeric columns:
print(southern_africa_df.describe())
life_expect pop
count 10.000000 10.000000
mean 59.720000 567.500000
std 2.898582 241.687241
min 53.100000 150.000000
25% 59.300000 492.500000
50% 60.200000 552.000000
75% 61.500000 707.500000
max 62.900000 1000.000000
You can subset both rows and columns, to return only those you are interested in. Let’s say you want to select only country_name and pop columns, you can do this by wrapping the DataFrame within double square brackets ([[]]) and specify those coulumns:
print(southern_africa_df[["country_name", "pop"]])
country_name pop
0 Angola 500
1 Botswana 600
2 Lesotho 1000
3 Malawi 150
4 Mozambique 490
5 Namibia 740
6 South Africa 300
7 Swaziland 781
8 Zambia 610
9 Zimbabwe 504
You can also select rows. For example, subset observations from Angola and save as a new DataFrame named angola:
angola = southern_africa_df[southern_africa_df["country_name"] == "Angola"]
print(angola)
country_name life_expect pop
0 Angola 61.6 500
Select observations from Angola and Zimbabwe:
angola_zim = southern_africa_df[southern_africa_df["country_name"].isin(["Angola", "Zimbabwe"])]
print(angola_zim)
country_name life_expect pop
0 Angola 61.6 500
9 Zimbabwe 59.3 504
Subset observations where life expectancy is below 60:
low_life_expect = southern_africa_df[southern_africa_df["life_expect"] < 60]
print(low_life_expect)
country_name life_expect pop
2 Lesotho 57.1 1000
3 Malawi 53.1 150
5 Namibia 59.3 740
6 South Africa 59.3 300
9 Zimbabwe 59.3 504
2.5. Conclusion#
There are many other ways in which you can manipulate, transform and analyse DataFrame, and pandas provide many methods to handle DataFrames. We will dive deepr into DataFrames and Pandas in one of the chapter.
In this chapter we have discussed different data structures that can hold data:
Lists
Numpy Arrays
Dictionaries
DataFrames
In the next chapter, we explore various ways in which we can visualise data.