Why data structures matter

Based on what you need to do with data, different structures become more and less appropriate. Some structures are optimised for read operations, while some are optimised for writes. Some are inherently ordered, some are not. Each has their relative benefits and drawbacks. A good developer is aware of their options and can choose the most appropriate structure for their use-case.

Real world scenario

In one of my blogs, I need to scrape data from the Fantasy Premier League (FPL) API. I need to scrape the data for a large number of players for every gameweek that has occurred. There are 38 gameweeks in a season, and around 400 players in the league in total. This makes around 15,000 player-gameweeks i.e. rows of data that I may need to handle.

What is really important to me is being able to find the points score for a specific player in a specific gameweek. I want to calculate my total accumulated points for the players in my team. My FPL team has 16 players in it each week (not always the same players) and I need to find out how many points each has accrued in each week. In pseudo-code terms: For each gameweek, for each player in my team, I need to look up the points that player got on that gameweek and calculate a running total.

Setup

Here we will test two structures - one which is a flat list (like a spreadsheet) of all players, gameweeks and their scores, and the other will be a hashmap (python dict) of player_id -> gameweek -> score.

team_data = (
    # gameweek, list of player_ids in my team
    (1, (123, 234, 345, 456, 567, 678, 789, ...))
    (2, (123, 234, 345, 456, 567, 678, 789, ...))
    (3, (321, 234, 345, 456, 567, 678, 789, ...))  # player 123 swapped for player 321
)


player_gameweek_data_flat = (
    # player_id, gameweek, points
    (1, 1, 5),
    (1, 2, 10),
    (1, 3, 0),
    ...
    (2, 1, 8),
    (2, 2, 9),
    (2, 3, 1),
    ...
)

player_gameweek_data_nested = {
    # player_id at top level
    1: {
        # gameweek at next level
        1: 5,  # value is the points
        2: 10,
        3: 0,
        ...
    },
    2: {
        1: 8,
        2: 9,
        3: 1,
        ...
    },
    ...
}

Then, to calculate the total points accumulated, the code might look something like this:

total = 0
for gameweek, player_list in team_data:
    for player_id in player_list:
        points = get_player_points(player_id, gameweek)
        total += points

print(f"Total points: {total}")

Here's a simple script to generate some test data:

import random
random.seed(22)

team_data = []
for gameweek in range(1, 39):
    team_data.append((gameweek, [random.randint(1, 400) for _ in range(16)]))

player_gameweek_data_flat = []
player_gameweek_data_nested = {}

for player_id in range(1, 401):

    player_gameweek_data_nested[player_id] = {}

    for gameweek in range(1, 39):
        score = random.randint(0, 20)

        player_gameweek_data_flat.append((player_id, gameweek, score))
        player_gameweek_data_nested[player_id][gameweek] = score

Now, to run our calculations, let's implement the get_player_points functions from each structure:

def get_player_points_flat(player_id: int, gameweek: int) -> int:

    for row in player_gameweek_data_flat:
        if row[0] == player_id and row[1] == gameweek:
            return row[2]

    raise KeyError(f'Data not found for this player and this gameweek: {player_id=}, {gameweek=}')


def get_player_points_nested(player_id: int, gameweek: int) -> int:

    score = player_gameweek_data_nested.get(player_id, {}).get(gameweek)
    if score is None:
        raise KeyError(f'Data not found for this player and this gameweek: {player_id=}, {gameweek=}')

    return score

Runtime

Let's time how long these functions each take to run. What do you think - are either of them likely to be faster than the other, or will they be comparable in speed?

from datetime import datetime

# Run for FLAT approach
dt0 = datetime.now()
total_flat = 0
for gameweek, player_list in team_data:
    for player_id in player_list:
        points = get_player_points_flat(player_id, gameweek)
        total_flat += points

dt1 = datetime.now()
time_taken_flat = (dt1-dt0).total_seconds()

# Run for NESTED approach
dt0 = datetime.now()
total_nested = 0
for gameweek, player_list in team_data:
    for player_id in player_list:
        points = get_player_points_nested(player_id, gameweek)
        total_nested += points

dt1 = datetime.now()
time_taken_nested = (dt1-dt0).total_seconds()

# Results
print('-----     -----     -----')
print("Lookup results:")

print("FLAT")
print(f"Total points: {total_flat}")
print(f"Total time: {time_taken_flat}")

print("NESTED")
print(f"Total points: {total_nested}")
print(f"Total time: {time_taken_nested}")
print('-----     -----     -----')

Results

Output:

Lookup results:
FLAT
Total points: 6411
Total time: 0.054334
NESTED
Total points: 6411
Total time: 0.000376

They both produced the same value, so they're both correct, which is good to know. Also, the value looks about right - 16 players * 38 gameweeks * avg(10) points per week => 6080 points, so we're in the right ballpark at least.

But look at the time taken! The time for the nested approach was x144 times faster than the flat approach! That is incredible for a genuine real-world scenario.

Explanation

Anybody who has done some software engineering will of course know exactly why this is - hashmaps are an O(1) "constant-time" lookup i.e. the lookup time is independent of the size of the structure, while the flat structure is O(N) "linear time" i.e. the search time increases as the array gets bigger as we have to loop through the rows.

For the hashmap, the value is either there or it isn't - there is no searching. But for the flat structure, we must loop through however many rows to find our specific player_id and gameweek combination. Only when we have been through all rows can we be sure that the value has been found, or is not present.

But hashmaps are not a silver bullet that just 'works' in every single scenario. What if we wanted to know how many times players have scored more than 15 points in a single gameweek?

def count_players_above_flat(min_points: int) -> tuple[int, float]:
    dt0 = datetime.now()

    count = 0
    for _, _, points in player_gameweek_data_flat:
        if points > min_points:
            count += 1

    dt1 = datetime.now()
    time_taken = (dt1-dt0).total_seconds()

    return count, time_taken


def count_players_above_nested(min_points: int) -> tuple[int, float]:
    dt0 = datetime.now()

    count = 0
    for player_id, player_data in player_gameweek_data_nested.items():
        for gameweek, points in player_data.items():
            if points > min_points:
                count += 1

    dt1 = datetime.now()
    time_taken = (dt1-dt0).total_seconds()

    return count, time_taken

print('-----     -----     -----')
print("Filtering results:")

print("FLAT")
print(f"Total count: {result_flat[0]}")
print(f"Total time: {result_flat[1]}")

print("NESTED")
print(f"Total count: {result_nested[0]}")
print(f"Total time: {result_nested[1]}")
print('-----     -----     -----')

Results

Filtering results:
FLAT
Total count: 3657
Total time: 0.000226
NESTED
Total count: 3657
Total time: 0.000322

This time, we can see that the flat structure is 60% faster than the nested structure. This is because of the overhead of repeatedly accessing nested values inside the nested structure, while the data in the flat structure is just 'there'.

And, I may also add, the cognitive load on writing these examples was far higher for the nested structure than the flat structure. Maybe it's a skill issue, but where developer productivity is a factor, these things can matter.

Conclusion

These are just two of the more common data-structures that I use daily. There are other structures optimised for other use-cases e.g. insertions / deletions. It's on the developer to choose the right structure for the task at hand.

Full script file

I am aware that the code isn't structured well at all but it's just a blog.

from datetime import datetime
import random


# Generate test data

random.seed(22)
team_data = []
for gameweek in range(1, 39):
    team_data.append((gameweek, [random.randint(1, 400) for _ in range(16)]))

player_gameweek_data_flat = []
player_gameweek_data_nested = {}

for player_id in range(1, 401):

    player_gameweek_data_nested[player_id] = {}

    for gameweek in range(1, 39):
        score = random.randint(0, 20)

        player_gameweek_data_flat.append((player_id, gameweek, score))
        player_gameweek_data_nested[player_id][gameweek] = score


# Define lookup funcs
def get_player_points_flat(player_id: int, gameweek: int) -> int:

    for row in player_gameweek_data_flat:
        if row[0] == player_id and row[1] == gameweek:
            return row[2]

    raise KeyError(f'Data not found for this player and this gameweek: {player_id=}, {gameweek=}')


def get_player_points_nested(player_id: int, gameweek: int) -> int:

    score = player_gameweek_data_nested.get(player_id, {}).get(gameweek)
    if score is None:
        raise KeyError(f'Data not found for this player and this gameweek: {player_id=}, {gameweek=}')

    return score


# Run for FLAT approach
dt0 = datetime.now()
total_flat = 0
for gameweek, player_list in team_data:
    for player_id in player_list:
        points = get_player_points_flat(player_id, gameweek)
        total_flat += points

dt1 = datetime.now()
time_taken_flat = (dt1-dt0).total_seconds()

# Run for NESTED approach
dt0 = datetime.now()
total_nested = 0
for gameweek, player_list in team_data:
    for player_id in player_list:
        points = get_player_points_nested(player_id, gameweek)
        total_nested += points

dt1 = datetime.now()
time_taken_nested = (dt1-dt0).total_seconds()


print('-----     -----     -----')
print("Lookup results:")

print("FLAT")
print(f"Total points: {total_flat}")
print(f"Total time: {time_taken_flat}")

print("NESTED")
print(f"Total points: {total_nested}")
print(f"Total time: {time_taken_nested}")
print('-----     -----     -----')


# Define filtering example funcs

def count_players_above_flat(min_points: int) -> tuple[int, float]:
    dt0 = datetime.now()

    count = 0
    for _, _, points in player_gameweek_data_flat:
        if points > min_points:
            count += 1

    dt1 = datetime.now()
    time_taken = (dt1-dt0).total_seconds()

    return count, time_taken


def count_players_above_nested(min_points: int) -> tuple[int, float]:
    dt0 = datetime.now()

    count = 0
    for player_id, player_data in player_gameweek_data_nested.items():
        for gameweek, points in player_data.items():
            if points > min_points:
                count += 1

    dt1 = datetime.now()
    time_taken = (dt1-dt0).total_seconds()

    return count, time_taken

result_flat = count_players_above_flat(15)
result_nested = count_players_above_nested(15)

print('-----     -----     -----')
print("Filtering results:")

print("FLAT")
print(f"Total count: {result_flat[0]}")
print(f"Total time: {result_flat[1]}")

print("NESTED")
print(f"Total count: {result_nested[0]}")
print(f"Total time: {result_nested[1]}")
print('-----     -----     -----')

Comments

↪ Replying to

No comments yet.