4  Reshaping and Tidy Data

There’s a whole paper by Hadley Wickham about tidy data but it’s in a PDF so you’re probably not going to read it. Here’s the paper’s definition of tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

As Dr Wickham notes, this is Codd’s 3rd Normal Form but in statspeak rather than databasespeak.

The meaning of “variable” and “observation” depend on what we’re studying, so tidy data is a concept that you mostly learn through experience and ✨vibes✨.

Now we’ll explore what tidy data looks like for an NBA results dataset.

4.1 Get the data

from pathlib import Path
import polars as pl
import pandas as pd

pl.Config.set_tbl_rows(5)
pd.options.display.max_rows = 5

nba_dir = Path("../data/nba/")

column_names = {
    "Date": "date",
    "Visitor/Neutral": "away_team",
    "PTS": "away_points",
    "Home/Neutral": "home_team",
    "PTS.1": "home_points",
}

if not nba_dir.exists():
    nba_dir.mkdir()
    for month in (
        "october",
        "november",
        "december",
        "january",
        "february",
        "march",
        "april",
        "may",
        "june",
    ):
        # In practice we would do more data cleaning here, and save to parquet not CSV.
        # But we save messy data here so we can clean it later for pedagogical purposes.
        url = f"http://www.basketball-reference.com/leagues/NBA_2016_games-{month}.html"
        tables = pd.read_html(url)
        raw = (
            pl.from_pandas(tables[0].query("Date != 'Playoffs'"))
            .rename(column_names)
            .select(column_names.values())
        )
        raw.write_csv(nba_dir / f"{month}.csv")

nba_glob = nba_dir / "*.csv"
pl.scan_csv(nba_glob).head().collect()
shape: (5, 5)
dateaway_teamaway_pointshome_teamhome_points
strstri64stri64
"Fri, Apr 1, 20…"Philadelphia 7…91"Charlotte Horn…100
"Fri, Apr 1, 20…"Dallas Maveric…98"Detroit Piston…89
"Fri, Apr 1, 20…"Brooklyn Nets"91"New York Knick…105
"Fri, Apr 1, 20…"Cleveland Cava…110"Atlanta Hawks"108
"Fri, Apr 1, 20…"Toronto Raptor…99"Memphis Grizzl…95

4.2 Cleaning 🧹

Nothing super interesting here:

games_pl = (
    pl.scan_csv(nba_glob)
    .with_columns(
        pl.col("date").str.strptime(pl.Date, "%a, %b %d, %Y"),
    )
    .sort("date")
    .with_row_index("game_id")
)
games_pl.head().collect()
shape: (5, 6)
game_iddateaway_teamaway_pointshome_teamhome_points
u32datestri64stri64
02015-10-27"Cleveland Cava…95"Chicago Bulls"97
12015-10-27"Detroit Piston…106"Atlanta Hawks"94
22015-10-27"New Orleans Pe…95"Golden State W…111
32015-10-28"Washington Wiz…88"Orlando Magic"87
42015-10-28"Philadelphia 7…95"Boston Celtics…112
games_pd = (
    pl.read_csv(nba_glob)
    .to_pandas()
    .dropna(how="all")
    .assign(date=lambda x: pd.to_datetime(x["date"], format="%a, %b %d, %Y"))
    .sort_values("date")
    .reset_index(drop=True)
    .set_index("date", append=True)
    .rename_axis(["game_id", "date"])
    .sort_index()
)
games_pd.head()
away_team away_points home_team home_points
game_id date
0 2015-10-27 Cleveland Cavaliers 95 Chicago Bulls 97
1 2015-10-27 Detroit Pistons 106 Atlanta Hawks 94
2 2015-10-27 New Orleans Pelicans 95 Golden State Warriors 111
3 2015-10-28 Philadelphia 76ers 95 Boston Celtics 112
4 2015-10-28 Washington Wizards 88 Orlando Magic 87

Polars does have a drop_nulls method but the only parameter it takes is subset, which — like in Pandas — lets you consider null values just for a subset of the columns. Pandas additionally lets you specify how="all" to drop a row only if every value is null, but Polars drop_nulls has no such parameter and will drop the row if any values are null. If you only want to drop when all values are null, the docs recommend .filter(~pl.all(pl.all().is_null())).

Note

A previous version of the Polars example used pl.fold, which is for fast horizontal operations. It doesn’t come up anywhere else in this book, so consider this your warning that it exists.

4.3 Pivot and Melt

I recently came across someone who was doing advanced quantitative research in Python but had never heard of the Pandas .pivot method. I shudder to imagine the code he must have written in the absence of this knowledge, so here’s a simple explanation of pivoting and melting, lest anyone else suffer in ignorance. If you already know what pivot and melt are, feel free to scroll past this bit.

4.3.1 Pivot

Suppose you have a dataframe that looks like this:

Code
from datetime import date
prices = pl.DataFrame({
    "date": [*[date(2020, 1, 1)]*4, *[date(2020, 1, 2)]*4, *[date(2020, 1, 3)]*4],
    "ticker": [*["AAPL", "TSLA", "MSFT", "NFLX"]*3],
    "price": [100, 200, 300, 400, 110, 220, 330, 420, 105, 210, 315, 440],
})
prices
shape: (12, 3)
datetickerprice
datestri64
2020-01-01"AAPL"100
2020-01-01"TSLA"200
2020-01-01"MSFT"300
………
2020-01-03"MSFT"315
2020-01-03"NFLX"440

In both Polars and Pandas you can call df.pivot(index="date", values="price", columns="ticker") to get a dataframe that looks like this:

Code
pivoted = prices.pivot(index="date", values="price", columns="ticker")
pivoted
shape: (3, 5)
dateAAPLTSLAMSFTNFLX
datei64i64i64i64
2020-01-01100200300400
2020-01-02110220330420
2020-01-03105210315440

As you can see, .pivot creates a dataframe where the columns are the unique labels from one column (“ticker”), alongside the index column (“date”). The values for the non-index columns are taken from the corresponding rows of the values column (“price”).

If our dataframe had multiple prices for the same ticker on the same date, we would use the aggregate_fn parameter of the .pivot method, e.g.: prices.pivot(..., aggregate_fn="mean"). Pivoting with an aggregate function gives us similar behaviour to what Excel calls “pivot tables”.

4.3.2 Melt

Melt is the inverse of pivot. While pivot takes us from long data to wide data, melt goes from wide to long. If we call .melt(id_vars="date", value_name="price") on our pivoted dataframe we get our original dataframe back:

Code
pivoted.melt(id_vars="date", value_name="price")
shape: (12, 3)
datevariableprice
datestri64
2020-01-01"AAPL"100
2020-01-02"AAPL"110
2020-01-03"AAPL"105
………
2020-01-02"NFLX"420
2020-01-03"NFLX"440

4.4 Tidy NBA data

Suppose we want to calculate the days of rest each team had before each game. In the current structure this is difficult because we need to track both the home_team and away_team columns. We’ll use .melt so that there’s a single team column. This makes it easier to add a rest column with the per-team rest days between games.

tidy_pl = (
    games_pl
    .melt(
        id_vars=["game_id", "date"],
        value_vars=["away_team", "home_team"],
        value_name="team",
    )
    .sort("game_id")
    .with_columns((
        pl.col("date")
        .alias("rest")
        .diff().over("team")
        .dt.total_days() - 1).cast(pl.Int8))
    .drop_nulls("rest")
    .collect()
)
tidy_pl
shape: (2_602, 5)
game_iddatevariableteamrest
u32datestrstri8
52015-10-28"away_team""Chicago Bulls"0
62015-10-28"home_team""Detroit Piston…0
112015-10-28"away_team""Cleveland Cava…0
……………
13152016-06-19"away_team""Cleveland Cava…2
13152016-06-19"home_team""Golden State W…2
tidy_pd = (
    games_pd.reset_index()
    .melt(
        id_vars=["game_id", "date"],
        value_vars=["away_team", "home_team"],
        value_name="team",
    )
    .sort_values("game_id")
    .assign(
        rest=lambda df: (
            df
            .sort_values("date")
            .groupby("team")
            ["date"]
            .diff()
            .dt.days
            .sub(1)
        )
    )
    .dropna(subset=["rest"])
    .astype({"rest": pd.Int8Dtype()})
)
tidy_pd
game_id date variable team rest
7 7 2015-10-28 away_team New Orleans Pelicans 0
11 11 2015-10-28 away_team Chicago Bulls 0
... ... ... ... ... ...
1315 1315 2016-06-19 away_team Cleveland Cavaliers 2
2631 1315 2016-06-19 home_team Golden State Warriors 2

2602 rows Ă— 5 columns

Now we use .pivot so that this days-of-rest data can be added back to the original dataframe. We’ll also add columns for the spread between the home team’s rest and away team’s rest, and a flag for whether the home team won.

by_game_pl = (
    tidy_pl
    .pivot(
        values="rest",
        index=["game_id", "date"],
        columns="variable"
    )
    .rename({"away_team": "away_rest", "home_team": "home_rest"})
)
joined_pl = (
    by_game_pl
    .join(games_pl.collect(), on=["game_id", "date"])
    .with_columns([
        pl.col("home_points").alias("home_win") > pl.col("away_points"),
        pl.col("home_rest").alias("rest_spread") - pl.col("away_rest"),
    ])
)
joined_pl
shape: (1_303, 10)
game_iddateaway_resthome_restaway_teamaway_pointshome_teamhome_pointshome_winrest_spread
u32datei8i8stri64stri64booli8
52015-10-280null"Chicago Bulls"115"Brooklyn Nets"100falsenull
62015-10-28null0"Utah Jazz"87"Detroit Piston…92truenull
112015-10-280null"Cleveland Cava…106"Memphis Grizzl…76falsenull
…………………………
13142016-06-1622"Golden State W…101"Cleveland Cava…115true0
13152016-06-1922"Cleveland Cava…93"Golden State W…89false0
by_game_pd = (
    tidy_pd
    .pivot(
        values="rest",
        index=["game_id", "date"],
        columns="variable"
    )
    .rename(
        columns={"away_team": "away_rest", "home_team": "home_rest"}
    )
)
joined_pd = by_game_pd.join(games_pd).assign(
    home_win=lambda df: df["home_points"] > df["away_points"],
    rest_spread=lambda df: df["home_rest"] - df["away_rest"],
)
joined_pd
away_rest home_rest away_team away_points home_team home_points home_win rest_spread
game_id date
7 2015-10-28 0 <NA> New Orleans Pelicans 94 Portland Trail Blazers 112 True <NA>
11 2015-10-28 0 <NA> Chicago Bulls 115 Brooklyn Nets 100 False <NA>
... ... ... ... ... ... ... ... ... ...
1314 2016-06-16 2 2 Golden State Warriors 101 Cleveland Cavaliers 115 True 0
1315 2016-06-19 2 2 Cleveland Cavaliers 93 Golden State Warriors 89 False 0

1303 rows Ă— 8 columns

Here’s a lightly edited quote from Modern Pandas:

One somewhat subtle point: an “observation” depends on the question being asked. So really, we have two tidy datasets, tidy for answering team-level questions, and joined for answering game-level questions.

Let’s use the team-level dataframe to see each team’s average days of rest, both at home and away:

import seaborn as sns
sns.set_theme(font_scale=0.6)
sns.catplot(
    tidy_pl,
    x="variable",
    y="rest",
    col="team",
    col_wrap=5,
    kind="bar",
    height=1.5,
)
<seaborn.axisgrid.FacetGrid at 0x7f8613f1d6d0>

Plotting the distribution of rest_spread:

import numpy as np
delta_pl = joined_pl["rest_spread"]
ax = (
    delta_pl
    .value_counts()
    .drop_nulls()
    .to_pandas()
    .set_index("rest_spread")
    ["count"]
    .reindex(np.arange(delta_pl.min(), delta_pl.max() + 1), fill_value=0)
    .sort_index()
    .plot(kind="bar", color="k", width=0.9, rot=0, figsize=(9, 6))
)
ax.set(xlabel="Difference in Rest (Home - Away)", ylabel="Games")
[Text(0.5, 0, 'Difference in Rest (Home - Away)'), Text(0, 0.5, 'Games')]

delta_pd = joined_pd["rest_spread"]
ax = (
    delta_pd
    .value_counts()
    .reindex(np.arange(delta_pd.min(), delta_pd.max() + 1), fill_value=0)
    .sort_index()
    .plot(kind="bar", color="k", width=0.9, rot=0, figsize=(9, 6))
)
ax.set(xlabel="Difference in Rest (Home - Away)", ylabel="Games")
[Text(0.5, 0, 'Difference in Rest (Home - Away)'), Text(0, 0.5, 'Games')]

Plotting the win percent by rest_spread:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(9, 6))
sns.barplot(
    x="rest_spread",
    y="home_win",
    data=joined_pl.filter(pl.col("rest_spread").is_between(-3, 3, closed="both")),
    color="#4c72b0",
    ax=ax,
)
<Axes: xlabel='rest_spread', ylabel='home_win'>

fig, ax = plt.subplots(figsize=(9, 6))
sns.barplot(
    x="rest_spread",
    y="home_win",
    data=joined_pd.query('-3 <= rest_spread <= 3'),
    color="#4c72b0",
    ax=ax,
)
<Axes: xlabel='rest_spread', ylabel='home_win'>

4.5 Stack / Unstack vs Melt / Pivot

Pandas has special methods stack and unstack for reshaping data with a MultiIndex. Polars doesn’t have an index, so anywhere you see stack / unstack in Pandas, the equivalent Polars code will use melt / pivot.

rest_pl = (
    tidy_pl
    .group_by(["date", "variable"], maintain_order=True)
    .agg(pl.col("rest").mean())
)
rest_pl
shape: (418, 3)
datevariablerest
datestrf64
2015-10-28"away_team"0.0
2015-10-28"home_team"0.0
2015-10-29"away_team"0.333333
………
2016-06-19"away_team"2.0
2016-06-19"home_team"2.0
rest_pd = (
    tidy_pd
    .groupby(["date", "variable"])
    ["rest"]
    .mean()
)
rest_pd
date        variable 
2015-10-28  away_team    0.0
            home_team    0.0
                        ... 
2016-06-19  away_team    2.0
            home_team    2.0
Name: rest, Length: 418, dtype: Float64

In Polars we use .pivot to do what in Pandas would require .unstack:

rest_pl.pivot(index="date", columns="variable", values="rest")
shape: (209, 3)
dateaway_teamhome_team
datef64f64
2015-10-280.00.0
2015-10-290.3333330.0
2015-10-301.0833330.916667
………
2016-06-162.02.0
2016-06-192.02.0
rest_pd.unstack()
variable away_team home_team
date
2015-10-28 0.0 0.0
2015-10-29 0.333333 0.0
... ... ...
2016-06-16 2.0 2.0
2016-06-19 2.0 2.0

209 rows Ă— 2 columns

Plotting the moving average of rest days:

ax = (
    rest_pl.pivot(index="date", values="rest", columns="variable")
    .filter(pl.col("away_team") < 7)
    .sort("date")
    .select([pl.col("date"), pl.col(pl.Float64).rolling_mean(7)])
    .to_pandas()
    .set_index("date")
    .plot(figsize=(9, 6), linewidth=3)
)
ax.set(ylabel="Rest (7 day MA)")
[Text(0, 0.5, 'Rest (7 day MA)')]

ax = (
    rest_pd.unstack()
    .query('away_team < 7')
    .sort_index()
    .rolling(7)
    .mean()
    .plot(figsize=(9, 6), linewidth=3)
)
ax.set(ylabel="Rest (7 day MA)")
[Text(0, 0.5, 'Rest (7 day MA)')]

4.6 Mini Project: Home Court Advantage?

We may as well do some (not very rigorous) analysis: let’s see if home advantage is a real thing.

4.6.1 Find the win percent for each team

We want to control for the strength of the teams playing. The team’s victory percentage is probably not a good control but it’s what we’ll use:

win_col = pl.col("win")
wins_pl = (
    joined_pl.melt(
        id_vars=["game_id", "date", "home_win"],
        value_name="team",
        variable_name="is_home",
        value_vars=["home_team", "away_team"],
    )
    .with_columns(pl.col("home_win").alias("win") == (pl.col("is_home") == "home_team"))
    .group_by(["team", "is_home"])
    .agg(
        [
            win_col.sum().alias("n_wins"),
            win_col.count().alias("n_games"),
            win_col.mean().alias("win_pct"),
        ]
    )
    .sort(["team", "is_home"])
)
wins_pl
shape: (60, 5)
teamis_homen_winsn_gameswin_pct
strstru32u32f64
"Atlanta Hawks""away_team"22460.478261
"Atlanta Hawks""home_team"30450.666667
"Boston Celtics…"away_team"20440.454545
……………
"Washington Wiz…"away_team"18400.45
"Washington Wiz…"home_team"22410.536585
wins_pd = (
    joined_pd
    .reset_index()
    .melt(
        id_vars=["game_id", "date", "home_win"],
        value_name="team",
        var_name="is_home",
        value_vars=["home_team", "away_team"],
    )
    .assign(win=lambda df: df["home_win"] == (df["is_home"] == "home_team"))
    .groupby(["team", "is_home"])["win"]
    .agg(['sum', 'count', 'mean'])
    .rename(columns={
        "sum": 'n_wins',
        "count": 'n_games',
        "mean": 'win_pct'
    })
)
wins_pd
n_wins n_games win_pct
team is_home
Atlanta Hawks away_team 22 46 0.478261
home_team 30 45 0.666667
... ... ... ... ...
Washington Wizards away_team 18 40 0.450000
home_team 22 41 0.536585

60 rows Ă— 3 columns

4.7 Some visualisations

g = sns.FacetGrid(wins_pl, hue="team", aspect=0.8, palette=["k"], height=5)
g.map(
    sns.pointplot,
    "is_home",
    "win_pct",
    order=["away_team", "home_team"]).set(ylim=(0, 1))
<seaborn.axisgrid.FacetGrid at 0x7f8611e92090>

sns.catplot(
    wins_pl,
    x="is_home",
    y="win_pct",
    col="team",
    col_wrap=5,
    hue="team",
    kind="point",
    height=1.5,
)
<seaborn.axisgrid.FacetGrid at 0x7f8610a3e090>

Now we calculate the win percent by team, regardless of whether they’re home or away:

win_percent_pl = (
    wins_pl.group_by("team", maintain_order=True).agg(
        pl.col("n_wins").sum().alias("win_pct") / pl.col("n_games").sum()
    )
)
win_percent_pl
shape: (30, 2)
teamwin_pct
strf64
"Atlanta Hawks"0.571429
"Boston Celtics…0.563218
"Brooklyn Nets"0.256098
……
"Utah Jazz"0.487805
"Washington Wiz…0.493827
win_percent_pd = (
    wins_pd
    .groupby(level="team", as_index=True)
    .apply(lambda x: x["n_wins"].sum() / x["n_games"].sum())
)
win_percent_pd
team
Atlanta Hawks         0.571429
Boston Celtics        0.563218
                        ...   
Utah Jazz             0.487805
Washington Wizards    0.493827
Length: 30, dtype: float64
(
    win_percent_pl
    .sort("win_pct")
    .to_pandas()
    .set_index("team")
    .plot.barh(figsize=(6, 12), width=0.85, color="k")
)
plt.xlabel("Win Percent")
Text(0.5, 0, 'Win Percent')

Here’s a plot of team home court advantage against team overall win percentage:

wins_to_plot_pl = (
    wins_pl.pivot(index="team", columns="is_home", values="win_pct")
    .with_columns(
        [
            pl.col("home_team").alias("Home Win % - Away %") - pl.col("away_team"),
            (pl.col("home_team").alias("Overall %") + pl.col("away_team")) / 2,
        ]
    )
)
sns.regplot(data=wins_to_plot_pl, x='Overall %', y='Home Win % - Away %')
<Axes: >

wins_to_plot_pd = (
    wins_pd
    ["win_pct"]
    .unstack()
    .assign(**{'Home Win % - Away %': lambda x: x["home_team"] - x["away_team"],
               'Overall %': lambda x: (x["home_team"] + x["away_team"]) / 2})
)
sns.regplot(data=wins_to_plot_pd, x='Overall %', y='Home Win % - Away %')
<Axes: xlabel='Overall %', ylabel='Home Win % - Away %'>

Let’s add the win percent back to the dataframe and run a regression:

reg_df_pl = (
    joined_pl.join(win_percent_pl, left_on="home_team", right_on="team")
    .rename({"win_pct": "home_strength"})
    .join(win_percent_pl, left_on="away_team", right_on="team")
    .rename({"win_pct": "away_strength"})
    .with_columns(
        [
            pl.col("home_points").alias("point_diff") - pl.col("away_points"),
            pl.col("home_rest").alias("rest_diff") - pl.col("away_rest"),
            pl.col("home_win").cast(pl.UInt8),  # for statsmodels
        ]
    )
)
reg_df_pl.head()
shape: (5, 14)
game_iddateaway_resthome_restaway_teamaway_pointshome_teamhome_pointshome_winrest_spreadhome_strengthaway_strengthpoint_diffrest_diff
u32datei8i8stri64stri64u8i8f64f64i64i8
52015-10-280null"Chicago Bulls"115"Brooklyn Nets"1000null0.2560980.506173-15null
62015-10-28null0"Utah Jazz"87"Detroit Piston…921null0.5058820.4878055null
112015-10-280null"Cleveland Cava…106"Memphis Grizzl…760null0.4883720.715686-30null
142015-10-280null"New Orleans Pe…94"Portland Trail…1121null0.5268820.3703718null
172015-10-2900"Memphis Grizzl…112"Indiana Pacers…103000.5454550.488372-90
reg_df_pd = (
    joined_pd.assign(
        away_strength=joined_pd['away_team'].map(win_percent_pd),
        home_strength=joined_pd['home_team'].map(win_percent_pd),
        point_diff=joined_pd['home_points'] - joined_pd['away_points'],
        rest_diff=joined_pd['home_rest'] - joined_pd['away_rest'])
)
reg_df_pd.head()
away_rest home_rest away_team away_points home_team home_points home_win rest_spread away_strength home_strength point_diff rest_diff
game_id date
7 2015-10-28 0 <NA> New Orleans Pelicans 94 Portland Trail Blazers 112 True <NA> 0.370370 0.526882 18 <NA>
11 2015-10-28 0 <NA> Chicago Bulls 115 Brooklyn Nets 100 False <NA> 0.506173 0.256098 -15 <NA>
15 2015-10-28 <NA> 0 Utah Jazz 87 Detroit Pistons 92 True <NA> 0.487805 0.505882 5 <NA>
16 2015-10-28 0 <NA> Cleveland Cavaliers 106 Memphis Grizzlies 76 False <NA> 0.715686 0.488372 -30 <NA>
17 2015-10-29 1 0 Atlanta Hawks 112 New York Knicks 101 False -1 0.571429 0.382716 -11 -1
import statsmodels.formula.api as sm

mod = sm.logit(
    "home_win ~ home_strength + away_strength + home_rest + away_rest",
    reg_df_pl.to_pandas(),
)
res = mod.fit()
res.summary()
Optimization terminated successfully.
         Current function value: 0.554797
         Iterations 6
Logit Regression Results
Dep. Variable: home_win No. Observations: 1299
Model: Logit Df Residuals: 1294
Method: MLE Df Model: 4
Date: Sun, 24 Mar 2024 Pseudo R-squ.: 0.1777
Time: 21:59:11 Log-Likelihood: -720.68
converged: True LL-Null: -876.38
Covariance Type: nonrobust LLR p-value: 3.748e-66
coef std err z P>|z| [0.025 0.975]
Intercept -0.0019 0.304 -0.006 0.995 -0.597 0.593
home_strength 5.7161 0.466 12.272 0.000 4.803 6.629
away_strength -4.9133 0.456 -10.786 0.000 -5.806 -4.020
home_rest 0.1045 0.076 1.381 0.167 -0.044 0.253
away_rest -0.0347 0.066 -0.526 0.599 -0.164 0.095

You can play around with the regressions yourself but we’ll end them here.

4.8 Summary

This was mostly a demonstration of .pivot and .melt, with several different examples of reshaping data in Polars and Pandas.