Data Structures in Python

2. Data Structures in Python#

In this chapter we deal with different data structures that can hold data in Python. Specifically, we will focus on the follwoing data structures:

Lists
Tuples
Dictionaries
DataFrames

2.1. Lists#

List is a data structure that store a collection of elements/items within it. For example, in previous chapter we created a string variable named country that contained “South Africa” element:

country = "South Africa"

print(country)

South Africa

What if we wanted to create another country variable, named country_2 with “Zimbabwe” as an element? We can also do this:

country_2 = "Zimbabwe"
print(country_2)

Zimbabwe

What if we want to add country_3, country_4, etc? We can end up having many variables. This where lists come in. Lists are used to hold many items together. you can create lists in Python by using square barackets ([]):

southern_african_countries = ["Angola", "Botswana", "Lesotho", "Malawi", "Mozambique", "Namibia", "South Africa", "Swaziland", "Zambia", "Zimbabwe"]

print(southern_african_countries)

['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'Swaziland', 'Zambia', 'Zimbabwe']

We now have a list of all countries in the Southern African region. There are various functions that can be used to extract, analyse and manipulate elements in a list. For example, you may be interested on how many elements are in a lists; in our case, how many countries are in the southern_african_countries list. You can use the len() function:

southern_african_countries = ["Angola", "Botswana", "Lesotho", "Malawi", "Mozambique", "Namibia", "South Africa", "Swaziland", "Zambia", "Zimbabwe"]

print(len(southern_african_countries))

The list has 10 elements/items.

2.1.1. Subset a list#

You can extract the list item by using the [] and the index position of the item. Note that Python index start at 0, meaning the first element will be in the position 0. This is important especially when coming from R background. Let extract the first element:

print(southern_african_countries[0])

Angola

The first element is Angola.

You can use can access the last element by:

print(southern_african_countries[-1])

Zimbabwe

You can access more than 1 items; for example, extract the first, second and third items in the list by slicing:

print(southern_african_countries[0:3])

['Angola', 'Botswana', 'Lesotho']

Notice that we have included the index 3, which is the 4th item, although the element itself is not printed. When slicing a list, the last index mentioned is not included. This is very importnt to note. There are other slicing options:

print(southern_african_countries[:3])

['Angola', 'Botswana', 'Lesotho']

2.1.2. Manipulate a list#

List elements can be changed. For example, in 2018 Zwaziland changed their name to “eSwatini”. We can change this in a list. First re-create the list:

southern_african_countries = ["Angola", "Botswana", "Lesotho", "Malawi", "Mozambique", "Namibia", "South Africa", "Swaziland", "Zambia", "Zimbabwe"]

print(southern_african_countries)

['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'Swaziland', 'Zambia', 'Zimbabwe']

Then change the list element:

southern_african_countries[7] = "eSwatini"
print(southern_african_countries)

['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'eSwatini', 'Zambia', 'Zimbabwe']

We have changed the list element from “Swaziland” to “Eswatini”.

You can also add new elements in a list. Suppose a geographer told us that our list of Southern African countries missed 3 countries: Democratic Republic of the Congo, Mauritius, Madagascar and Seychelles. In Python, we can update our list and assign it to anew variable called southern_africa_updated. First find the length of the original southern_african_countries:

len(southern_african_countries)

Add new items:

southern_africa_updated = southern_african_countries + ["Democratic Republic of the Congo", "Mauritius", "Madagascar", "Seychelles"]

print(southern_africa_updated)

['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'eSwatini', 'Zambia', 'Zimbabwe', 'Democratic Republic of the Congo', 'Mauritius', 'Madagascar', 'Seychelles']

Find the lenght of the updated variable:

len(southern_africa_updated)

The updated variable has lenght of 14. This means we have added 4 items.

You can remove an element from the list by using del() function:

del southern_africa_updated[0]

print(southern_africa_updated)

['Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'eSwatini', 'Zambia', 'Zimbabwe', 'Democratic Republic of the Congo', 'Mauritius', 'Madagascar', 'Seychelles']

We have removed the element at the index 0, which is Botswana.

2.1.3. Manipulating list with numeric data#

In above examples we have worked with list that contain string data types: all out elements were string (country names). Suppose we have the life expectancy of those countries. Life expectancy is the average number of years is expected to live:

Let us create a numeric list, that we name life_expectancy, that has the average life expectancy of the countries of Southern Africa:

life_expectancy = [61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]

print(life_expectancy)

[61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]

We can find the minimum life expactancy:

print(min(life_expectancy))

53.1

Print the maximum life expectancy:

print(max(life_expectancy))

62.9

Not that a list can hold elements of of different data types: string, float, integer, boolean, etc.

person_1 = ["Name", "Aubrey", "Age", 32, "Height", 1.8, "Is male?", True]
print(person_1)

['Name', 'Aubrey', 'Age', 32, 'Height', 1.8, 'Is male?', True]

2.2. NumPy Arrays#

A NumPy array is a data structure that can hold numeic elements. It is short for Numeric Python. It is an important data structure if you want to manipulate numeric data. First, you will need to install the numpy package if not already installed: pip install numpy. Then load the library as:

import numpy as np

Suppose we have the lenght in kilometers of major South Africa rivers stored as a list and assigned it to river_lenght_km variable:

river_lenght_km = [2200, 1800, 1210, 502, 560, 645, 520, 480]
print(river_lenght_km)

[2200, 1800, 1210, 502, 560, 645, 520, 480]

We need to convert this lift into an numpy array:

river_lenght_km = np.array(river_lenght_km)

print(type(river_lenght_km))

<class 'numpy.ndarray'>

2.2.1. Summary statistics and mathematical operations#

There many functions within the numpy library. We can calculate summary statistics:

Get the mean/average:

print(np.mean(river_lenght_km))

989.625

Get the median:

print(np.median(river_lenght_km))

602.5

Get the standard deviation:

print(np.std(river_lenght_km))

631.6316049526021

There are other functions you can use.

What are the rivers that have lenght of greater than 1000 kilometers? Find these and assign the result to a variable named longest_rivers

longest_rivers = river_lenght_km[river_lenght_km > 1000]
print(longest_rivers)

[2200 1800 1210]

The river_lenght_km array is in kilometers. What if we want to convert to meters? Since 1 km = 1000 meters, you can convert kilometer to meter by multplying by 1000 since. Let’s do this and assign result to a new variable named river_lenght_meters

river_lenght_meters = river_lenght_km * 1000

print(river_lenght_meters)

[2200000 1800000 1210000  502000  560000  645000  520000  480000]

Just like lists, you can subsets numpy arrays using index positions of the element. To acces the second element of river_lenght_km:

print(river_lenght_km[1])

You can sort elements into ascending or descending order:

np.sort(river_lenght_km)

array([ 480,  502,  520,  560,  645, 1210, 1800, 2200])

Sort into descending

np.sort(river_lenght_km)[::-1]

array([2200, 1800, 1210,  645,  560,  520,  502,  480])

Note that NumPy arrays can also be 2 dimensional; 2D arrays is a matrix like data with rows and columns:

two_d_array = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

print(two_d_array)

[[1 2 3]
 [4 5 6]
 [7 8 9]]

2.3. Dictionaries#

Remeember we created two lists previously, southern_african_countries and life_expectancy.

southern_african_countries = ["Angola", "Botswana", "Lesotho", "Malawi", "Mozambique", "Namibia", "South Africa", "Swaziland", "Zambia", "Zimbabwe"]

print(southern_african_countries)

life_expectancy = [61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]

print(life_expectancy)

['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'Swaziland', 'Zambia', 'Zimbabwe']
[61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]

We can find the corresponding life expectancy of, for example, “Botswana”. First find the index position of Botswana:

botswana_index = southern_african_countries.index("Botswana")

print(botswana_index)

The index of Botswana is 1. We can access the corresposing life expectancy:

life_expectancy[botswana_index]

61.1

The life expectancy of Botswana. But this is not efficient if we have large data. This where dictionaries come in:

southern_africa = {"Angola":61.6, 
                   "Botswana":61.1,
                   "Lesotho":57.1,
                   "Malawi":53.1,
                   "Mozambique":62.9,
                   "Namibia":59.3}

print(southern_africa)

{'Angola': 61.6, 'Botswana': 61.1, 'Lesotho': 57.1, 'Malawi': 53.1, 'Mozambique': 62.9, 'Namibia': 59.3}

We have created a dictionary named southern_africa. A dictionary has keys and values and this is always in pairs. For example, we have Angola as a key with its corresponding value 61.6.

You can find values of each keys:

print(southern_africa["Malawi"])

53.1

Of Namibia:

print(southern_africa["Namibia"])

59.3

You can find keys of all keys:

print(southern_africa.keys())

dict_keys(['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia'])

If you look at the southern_africa dictionary, we did not include all other countries in the region. We may need to add, let say, Zimbabwe:

southern_africa["Zimbabwe"] = 59.3

print(southern_africa)

{'Angola': 61.6, 'Botswana': 61.1, 'Lesotho': 57.1, 'Malawi': 53.1, 'Mozambique': 62.9, 'Namibia': 59.3, 'Zimbabwe': 59.3}

You can see that the dictionary has been updated to include Zimbwabwe.

Suppose demographer pouint out that the values of Botswana is outdated; that the life expectancy of the country has increased from 61.1 to 63! We can update this information

southern_africa["Botswana"] = 63

print(southern_africa["Botswana"])

The value of Botswana has been changed.

What if we want to add new information in the dictionary? For example, we may want to add the capital cities of the countries in the southern_africa dicitionary:

southern_africa_2 = {"Angola": {"life_expectancy": 61.6, "capital":"Luanda"},
                     "Botswana": {"life_expectancy": 61.1, "capital":"Gaborone"},
                     "Lesotho": {"life_expectancy": 57.1, "capital":"Maseru"},
                     "Malawi": {"life_expectancy": 53.1, "capital":"Lilongwe"},
                     "Mozambique": {"life_expectancy": 62.9, "capital":"Maputo"},
                     "Namibia": {"life_expectancy": 59.3, "capital":"Windhoek"}}

print(southern_africa_2)

{'Angola': {'life_expectancy': 61.6, 'capital': 'Luanda'}, 'Botswana': {'life_expectancy': 61.1, 'capital': 'Gaborone'}, 'Lesotho': {'life_expectancy': 57.1, 'capital': 'Maseru'}, 'Malawi': {'life_expectancy': 53.1, 'capital': 'Lilongwe'}, 'Mozambique': {'life_expectancy': 62.9, 'capital': 'Maputo'}, 'Namibia': {'life_expectancy': 59.3, 'capital': 'Windhoek'}}

As you can see in the results, the dictionary has been updated.

2.4. DataFrames#

The previous data structures we have discussed (lists, np.arrays, dictionaries) can handly limited amount of data. In the real world, most data comes big, in a tabular format: with columns and rows. In Python, we use we use the pandas package to handle data in a tabular format. You need to install it first as pip install pandas

Then import pandas:

import pandas as pd

Let return to the previous lists that we have created: list of Southern African countries and corresponding life expectancy:

country = ["Angola", "Botswana", "Lesotho", "Malawi", "Mozambique", "Namibia", "South Africa", "Swaziland", "Zambia", "Zimbabwe"]

print(country)

['Angola', 'Botswana', 'Lesotho', 'Malawi', 'Mozambique', 'Namibia', 'South Africa', 'Swaziland', 'Zambia', 'Zimbabwe']

Create life_expectancy list:

life_expectancy = [61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]
print(life_expectancy)

[61.6, 61.1, 57.1, 53.1, 62.9, 59.3, 59.3, 62.3, 61.2, 59.3]

Let’s add one more list of, for example, population of each country:

population = [500, 600, 1000, 150, 490, 740, 300, 781, 610, 504]

print(population)

[500, 600, 1000, 150, 490, 740, 300, 781, 610, 504]

From these three lists, we can create a DataFrame using pandas:

southern_africa_df = pd.DataFrame({"country_name":country, "life_expect": life_expectancy, "pop":population})

print(southern_africa_df)

   country_name  life_expect   pop
      Angola         61.6   500
    Botswana         61.1   600
     Lesotho         57.1  1000
      Malawi         53.1   150
  Mozambique         62.9   490
     Namibia         59.3   740
South Africa         59.3   300
   Swaziland         62.3   781
      Zambia         61.2   610
    Zimbabwe         59.3   504

We have a DataFrame with three columns: country_name, life_expect, and pop and 10 rows (observations), where each row represent country. You can use the .head() function to view the first observations

print(southern_africa_df.head())

  country_name  life_expect   pop
     Angola         61.6   500
   Botswana         61.1   600
    Lesotho         57.1  1000
     Malawi         53.1   150
 Mozambique         62.9   490

You can check how many columns and rows in the DataFrame by using .info() function:

print(southern_africa_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   country_name  10 non-null     object 
 1   life_expect   10 non-null     float64
 2   pop           10 non-null     int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 368.0+ bytes
None

You can get more information of your DataFrame by using .describe() function, it will return the summary statistics of all numeric columns:

print(southern_africa_df.describe())

       life_expect          pop
count    10.000000    10.000000
mean     59.720000   567.500000
std       2.898582   241.687241
min      53.100000   150.000000
25%      59.300000   492.500000
50%      60.200000   552.000000
75%      61.500000   707.500000
max      62.900000  1000.000000

You can subset both rows and columns, to return only those you are interested in. Let’s say you want to select only country_name and pop columns, you can do this by wrapping the DataFrame within double square brackets ([[]]) and specify those coulumns:

print(southern_africa_df[["country_name", "pop"]])

   country_name   pop
      Angola   500
    Botswana   600
     Lesotho  1000
      Malawi   150
  Mozambique   490
     Namibia   740
South Africa   300
   Swaziland   781
      Zambia   610
    Zimbabwe   504

You can also select rows. For example, subset observations from Angola and save as a new DataFrame named angola:

angola = southern_africa_df[southern_africa_df["country_name"] == "Angola"]

print(angola)

  country_name  life_expect  pop
0       Angola         61.6  500

Select observations from Angola and Zimbabwe:

angola_zim = southern_africa_df[southern_africa_df["country_name"].isin(["Angola", "Zimbabwe"])]
print(angola_zim)

  country_name  life_expect  pop
0       Angola         61.6  500
9     Zimbabwe         59.3  504

Subset observations where life expectancy is below 60:

low_life_expect = southern_africa_df[southern_africa_df["life_expect"] < 60]

print(low_life_expect)

   country_name  life_expect   pop
     Lesotho         57.1  1000
      Malawi         53.1   150
     Namibia         59.3   740
South Africa         59.3   300
    Zimbabwe         59.3   504

2.5. Conclusion#

There are many other ways in which you can manipulate, transform and analyse DataFrame, and pandas provide many methods to handle DataFrames. We will dive deepr into DataFrames and Pandas in one of the chapter.

In this chapter we have discussed different data structures that can hold data:

Lists
Numpy Arrays
Dictionaries
DataFrames

In the next chapter, we explore various ways in which we can visualise data.