4  Reshaping and Tidy Data

There’s a whole paper by Hadley Wickham about tidy data but it’s in a PDF so you’re probably not going to read it. Here’s the paper’s definition of tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

As Dr Wickham notes, this is Codd’s 3rd Normal Form but in statspeak rather than databasespeak.

The meaning of “variable” and “observation” depend on what we’re studying, so tidy data is a concept that you mostly learn through experience and ✨vibes✨.

Now we’ll explore what tidy data looks like for an NBA results dataset.

4.1 Get the data

from pathlib import Path
import polars as pl
import pandas as pd

pl.Config.set_tbl_rows(5)
pd.options.display.max_rows = 5

nba_dir = Path("../data/nba/")

column_names = {
    "Date": "date",
    "Visitor/Neutral": "away_team",
    "PTS": "away_points",
    "Home/Neutral": "home_team",
    "PTS.1": "home_points",
}

if not nba_dir.exists():
    nba_dir.mkdir()
    for month in (
        "october",
        "november",
        "december",
        "january",
        "february",
        "march",
        "april",
        "may",
        "june",
    ):
        # In practice we would do more data cleaning here, and save to parquet not CSV.
        # But we save messy data here so we can clean it later for pedagogical purposes.
        url = f"http://www.basketball-reference.com/leagues/NBA_2016_games-{month}.html"
        tables = pd.read_html(url)
        raw = (
            pl.from_pandas(tables[0].query("Date != 'Playoffs'"))
            .rename(column_names)
            .select(column_names.values())
        )
        raw.write_csv(nba_dir / f"{month}.csv")

nba_glob = nba_dir / "*.csv"
pl.scan_csv(nba_glob).head().collect()
shape: (5, 5)
date away_team away_points home_team home_points
str str i64 str i64
"Fri, Apr 1, 20... "Philadelphia 7... 91 "Charlotte Horn... 100
"Fri, Apr 1, 20... "Dallas Maveric... 98 "Detroit Piston... 89
"Fri, Apr 1, 20... "Brooklyn Nets" 91 "New York Knick... 105
"Fri, Apr 1, 20... "Cleveland Cava... 110 "Atlanta Hawks" 108
"Fri, Apr 1, 20... "Toronto Raptor... 99 "Memphis Grizzl... 95

4.2 Cleaning 🧹

Nothing super interesting here:

games_pl = (
    pl.scan_csv(nba_glob)
    .filter(~pl.all(pl.all().is_null()))
    .with_columns(
        pl.col("date").str.strptime(pl.Date, "%a, %b %d, %Y"),
    )
    .sort("date")
    .with_row_count("game_id")
)
games_pl.head().collect()
shape: (5, 6)
game_id date away_team away_points home_team home_points
u32 date str i64 str i64
0 2015-10-27 "Cleveland Cava... 95 "Chicago Bulls" 97
1 2015-10-27 "Detroit Piston... 106 "Atlanta Hawks" 94
2 2015-10-27 "New Orleans Pe... 95 "Golden State W... 111
3 2015-10-28 "Washington Wiz... 88 "Orlando Magic" 87
4 2015-10-28 "Philadelphia 7... 95 "Boston Celtics... 112
games_pd = (
    pl.read_csv(nba_glob)
    .to_pandas()
    .dropna(how="all")
    .assign(date=lambda x: pd.to_datetime(x["date"], format="%a, %b %d, %Y"))
    .sort_values("date")
    .reset_index(drop=True)
    .set_index("date", append=True)
    .rename_axis(["game_id", "date"])
    .sort_index()
)
games_pd.head()
away_team away_points home_team home_points
game_id date
0 2015-10-27 Cleveland Cavaliers 95 Chicago Bulls 97
1 2015-10-27 Detroit Pistons 106 Atlanta Hawks 94
2 2015-10-27 New Orleans Pelicans 95 Golden State Warriors 111
3 2015-10-28 Philadelphia 76ers 95 Boston Celtics 112
4 2015-10-28 Washington Wizards 88 Orlando Magic 87

Polars does have a drop_nulls method but the only parameter it takes is subset, which — like in Pandas — lets you consider null values just for a subset of the columns. Pandas additionally lets you specify how="all" to drop a row only if every value is null, but Polars drop_nulls has no such parameter and will drop the row if any values are null. If you only want to drop when all values are null, the docs recommend .filter(~pl.all(pl.all().is_null())).

Note

A previous version of the Polars example used pl.fold, which is for fast horizontal operations. It doesn’t come up anywhere else in this book, so consider this your warning that it exists.

4.3 Pivot and Melt

I recently came across someone who was doing advanced quantitative research in Python but had never heard of the Pandas .pivot method. I shudder to imagine the code he must have written in the absence of this knowledge, so here’s a simple explanation of pivoting and melting, lest anyone else suffer in ignorance. If you already know what pivot and melt are, feel free to scroll past this bit.

4.3.1 Pivot

Suppose you have a dataframe that looks like this:

Code
from datetime import date
prices = pl.DataFrame({
    "date": [*[date(2020, 1, 1)]*4, *[date(2020, 1, 2)]*4, *[date(2020, 1, 3)]*4],
    "ticker": [*["AAPL", "TSLA", "MSFT", "NFLX"]*3],
    "price": [100, 200, 300, 400, 110, 220, 330, 420, 105, 210, 315, 440],
})
prices
shape: (12, 3)
date ticker price
date str i64
2020-01-01 "AAPL" 100
2020-01-01 "TSLA" 200
... ... ...
2020-01-03 "MSFT" 315
2020-01-03 "NFLX" 440

In both Polars and Pandas you can call df.pivot(index="date", values="price", columns="ticker") to get a dataframe that looks like this:

Code
pivoted = prices.pivot(index="date", values="price", columns="ticker")
pivoted
shape: (3, 5)
date AAPL TSLA MSFT NFLX
date i64 i64 i64 i64
2020-01-01 100 200 300 400
2020-01-02 110 220 330 420
2020-01-03 105 210 315 440

As you can see, .pivot creates a dataframe where the columns are the unique labels from one column (“ticker”), alongside the index column (“date”). The values for the non-index columns are taken from the corresponding rows of the values column (“price”).

If our dataframe had multiple prices for the same ticker on the same date, we would use the aggregate_fn parameter of the .pivot method, e.g.: prices.pivot(..., aggregate_fn="mean"). Pivoting with an aggregate function gives us similar behaviour to what Excel calls “pivot tables”.

4.3.2 Melt

Melt is the inverse of pivot. While pivot takes us from long data to wide data, melt goes from wide to long. If we call .melt(id_vars="date", value_name="price") on our pivoted dataframe we get our original dataframe back:

Code
pivoted.melt(id_vars="date", value_name="price")
shape: (12, 3)
date variable price
date str i64
2020-01-01 "AAPL" 100
2020-01-02 "AAPL" 110
... ... ...
2020-01-02 "NFLX" 420
2020-01-03 "NFLX" 440

4.4 Tidy NBA data

Suppose we want to calculate the days of rest each team had before each game. In the current structure this is difficult because we need to track both the home_team and away_team columns. We’ll use .melt so that there’s a single team column. This makes it easier to add a rest column with the per-team rest days between games.

tidy_pl = (
    games_pl
    .melt(
        id_vars=["game_id", "date"],
        value_vars=["away_team", "home_team"],
        value_name="team",
    )
    .sort("game_id")
    .with_columns((
        pl.col("date")
        .alias("rest")
        .diff().over("team")
        .dt.days() - 1).cast(pl.Int8))
    .drop_nulls("rest")
    .collect()
)
tidy_pl
shape: (2602, 5)
game_id date variable team rest
u32 date str str i8
5 2015-10-28 "away_team" "Chicago Bulls" 0
6 2015-10-28 "home_team" "Detroit Piston... 0
... ... ... ... ...
1315 2016-06-19 "away_team" "Cleveland Cava... 2
1315 2016-06-19 "home_team" "Golden State W... 2
tidy_pd = (
    games_pd.reset_index()
    .melt(
        id_vars=["game_id", "date"],
        value_vars=["away_team", "home_team"],
        value_name="team",
    )
    .sort_values("game_id")
    .assign(
        rest=lambda df: (
            df
            .sort_values("date")
            .groupby("team")
            ["date"]
            .diff()
            .dt.days
            .sub(1)
        )
    )
    .dropna(subset=["rest"])
    .astype({"rest": pd.Int8Dtype()})
)
tidy_pd
game_id date variable team rest
7 7 2015-10-28 away_team New Orleans Pelicans 0
11 11 2015-10-28 away_team Chicago Bulls 0
... ... ... ... ... ...
1315 1315 2016-06-19 away_team Cleveland Cavaliers 2
2631 1315 2016-06-19 home_team Golden State Warriors 2

2602 rows × 5 columns

Now we use .pivot so that this days-of-rest data can be added back to the original dataframe. We’ll also add columns for the spread between the home team’s rest and away team’s rest, and a flag for whether the home team won.

by_game_pl = (
    tidy_pl
    .pivot(
        values="rest",
        index=["game_id", "date"],
        columns="variable"
    )
    .rename({"away_team": "away_rest", "home_team": "home_rest"})
)
joined_pl = (
    by_game_pl
    .join(games_pl.collect(), on=["game_id", "date"])
    .with_columns([
        pl.col("home_points").alias("home_win") > pl.col("away_points"),
        pl.col("home_rest").alias("rest_spread") - pl.col("away_rest"),
    ])
)
joined_pl
shape: (1303, 10)
game_id date away_rest home_rest away_team away_points home_team home_points home_win rest_spread
u32 date i8 i8 str i64 str i64 bool i8
5 2015-10-28 0 null "Chicago Bulls" 115 "Brooklyn Nets" 100 false null
6 2015-10-28 null 0 "Utah Jazz" 87 "Detroit Piston... 92 true null
... ... ... ... ... ... ... ... ... ...
1314 2016-06-16 2 2 "Golden State W... 101 "Cleveland Cava... 115 true 0
1315 2016-06-19 2 2 "Cleveland Cava... 93 "Golden State W... 89 false 0
by_game_pd = (
    tidy_pd
    .pivot(
        values="rest",
        index=["game_id", "date"],
        columns="variable"
    )
    .rename(
        columns={"away_team": "away_rest", "home_team": "home_rest"}
    )
)
joined_pd = by_game_pd.join(games_pd).assign(
    home_win=lambda df: df["home_points"] > df["away_points"],
    rest_spread=lambda df: df["home_rest"] - df["away_rest"],
)
joined_pd
away_rest home_rest away_team away_points home_team home_points home_win rest_spread
game_id date
7 2015-10-28 0 <NA> New Orleans Pelicans 94 Portland Trail Blazers 112 True <NA>
11 2015-10-28 0 <NA> Chicago Bulls 115 Brooklyn Nets 100 False <NA>
... ... ... ... ... ... ... ... ... ...
1314 2016-06-16 2 2 Golden State Warriors 101 Cleveland Cavaliers 115 True 0
1315 2016-06-19 2 2 Cleveland Cavaliers 93 Golden State Warriors 89 False 0

1303 rows × 8 columns

Here’s a lightly edited quote from Modern Pandas:

One somewhat subtle point: an “observation” depends on the question being asked. So really, we have two tidy datasets, tidy_pl for answering team-level questions, and joined for answering game-level questions.

Let’s use the team-level dataframe to see each team’s average days of rest, both at home and away:

import seaborn as sns
sns.set_theme(font_scale=0.6)
sns.catplot(
    tidy_pl.to_pandas(),
    x="variable",
    y="rest",
    col="team",
    col_wrap=5,
    kind="bar",
    height=1.5,
)
<seaborn.axisgrid.FacetGrid at 0x7f56f40ad310>

Plotting the distribution of rest_spread:

import numpy as np
delta_pl = joined_pl["rest_spread"]
ax = (
    delta_pl
    .value_counts()
    .drop_nulls()
    .to_pandas()
    .set_index("rest_spread")
    ["counts"]
    .reindex(np.arange(delta_pl.min(), delta_pl.max() + 1), fill_value=0)
    .sort_index()
    .plot(kind="bar", color="k", width=0.9, rot=0, figsize=(9, 6))
)
ax.set(xlabel="Difference in Rest (Home - Away)", ylabel="Games")
[Text(0.5, 0, 'Difference in Rest (Home - Away)'), Text(0, 0.5, 'Games')]

delta_pd = joined_pd["rest_spread"]
ax = (
    delta_pd
    .value_counts()
    .reindex(np.arange(delta_pd.min(), delta_pd.max() + 1), fill_value=0)
    .sort_index()
    .plot(kind="bar", color="k", width=0.9, rot=0, figsize=(9, 6))
)
ax.set(xlabel="Difference in Rest (Home - Away)", ylabel="Games")
[Text(0.5, 0, 'Difference in Rest (Home - Away)'), Text(0, 0.5, 'Games')]

Plotting the win percent by rest_spread:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(9, 6))
sns.barplot(
    x="rest_spread",
    y="home_win",
    data=joined_pl.filter(pl.col("rest_spread").is_between(-3, 3, closed="both")).to_pandas(),
    color="#4c72b0",
    ax=ax,
)
<AxesSubplot: xlabel='rest_spread', ylabel='home_win'>

fig, ax = plt.subplots(figsize=(9, 6))
sns.barplot(
    x="rest_spread",
    y="home_win",
    data=joined_pd.query('-3 <= rest_spread <= 3'),
    color="#4c72b0",
    ax=ax,
)
<AxesSubplot: xlabel='rest_spread', ylabel='home_win'>

4.5 Stack / Unstack vs Melt / Pivot

Pandas has special methods stack and unstack for reshaping data with a MultiIndex. Polars doesn’t have an index, so anywhere you see stack / unstack in Pandas, the equivalent Polars code will use melt / pivot.

rest_pl = (
    tidy_pl
    .groupby(["date", "variable"], maintain_order=True)
    .agg(pl.col("rest").mean())
)
rest_pl
shape: (418, 3)
date variable rest
date str f64
2015-10-28 "away_team" 0.0
2015-10-28 "home_team" 0.0
... ... ...
2016-06-19 "away_team" 2.0
2016-06-19 "home_team" 2.0
rest_pd = (
    tidy_pd
    .groupby(["date", "variable"])
    ["rest"]
    .mean()
)
rest_pd
date        variable 
2015-10-28  away_team    0.0
            home_team    0.0
                        ... 
2016-06-19  away_team    2.0
            home_team    2.0
Name: rest, Length: 418, dtype: Float64

In Polars we use .pivot to do what in Pandas would require .unstack:

rest_pl.pivot(index="date", columns="variable", values="rest")
shape: (209, 3)
date away_team home_team
date f64 f64
2015-10-28 0.0 0.0
2015-10-29 0.333333 0.0
... ... ...
2016-06-16 2.0 2.0
2016-06-19 2.0 2.0
rest_pd.unstack()
variable away_team home_team
date
2015-10-28 0.0 0.0
2015-10-29 0.333333 0.0
... ... ...
2016-06-16 2.0 2.0
2016-06-19 2.0 2.0

209 rows × 2 columns

Plotting the moving average of rest days:

ax = (
    rest_pl.pivot(index="date", values="rest", columns="variable")
    .filter(pl.col("away_team") < 7)
    .sort("date")
    .select([pl.col("date"), pl.col(pl.Float64).rolling_mean(7)])
    .to_pandas()
    .set_index("date")
    .plot(figsize=(9, 6), linewidth=3)
)
ax.set(ylabel="Rest (7 day MA)")
[Text(0, 0.5, 'Rest (7 day MA)')]

ax = (
    rest_pd.unstack()
    .query('away_team < 7')
    .sort_index()
    .rolling(7)
    .mean()
    .plot(figsize=(9, 6), linewidth=3)
)
ax.set(ylabel="Rest (7 day MA)")
[Text(0, 0.5, 'Rest (7 day MA)')]

4.6 Mini Project: Home Court Advantage?

We may as well do some (not very rigorous) analysis: let’s see if home advantage is a real thing.

4.6.1 Find the win percent for each team

We want to control for the strength of the teams playing. The team’s victory percentage is probably not a good control but it’s what we’ll use:

win_col = pl.col("win")
wins_pl = (
    joined_pl.melt(
        id_vars=["game_id", "date", "home_win"],
        value_name="team",
        variable_name="is_home",
        value_vars=["home_team", "away_team"],
    )
    .with_columns(pl.col("home_win").alias("win") == (pl.col("is_home") == "home_team"))
    .groupby(["team", "is_home"])
    .agg(
        [
            win_col.sum().alias("n_wins"),
            win_col.count().alias("n_games"),
            win_col.mean().alias("win_pct"),
        ]
    )
    .sort(["team", "is_home"])
)
wins_pl
shape: (60, 5)
team is_home n_wins n_games win_pct
str str u32 u32 f64
"Atlanta Hawks" "away_team" 22 46 0.478261
"Atlanta Hawks" "home_team" 30 45 0.666667
... ... ... ... ...
"Washington Wiz... "away_team" 18 40 0.45
"Washington Wiz... "home_team" 22 41 0.536585
wins_pd = (
    joined_pd
    .reset_index()
    .melt(
        id_vars=["game_id", "date", "home_win"],
        value_name="team",
        var_name="is_home",
        value_vars=["home_team", "away_team"],
    )
    .assign(win=lambda df: df["home_win"] == (df["is_home"] == "home_team"))
    .groupby(["team", "is_home"])["win"]
    .agg(['sum', 'count', 'mean'])
    .rename(columns={
        "sum": 'n_wins',
        "count": 'n_games',
        "mean": 'win_pct'
    })
)
wins_pd
n_wins n_games win_pct
team is_home
Atlanta Hawks away_team 22 46 0.478261
home_team 30 45 0.666667
... ... ... ... ...
Washington Wizards away_team 18 40 0.450000
home_team 22 41 0.536585

60 rows × 3 columns

4.7 Some visualisations

g = sns.FacetGrid(wins_pl.to_pandas(), hue="team", aspect=0.8, palette=["k"], height=5)
g.map(
    sns.pointplot,
    "is_home",
    "win_pct",
    order=["away_team", "home_team"]).set(ylim=(0, 1))
<seaborn.axisgrid.FacetGrid at 0x7f56e56394d0>

sns.catplot(
    wins_pl.to_pandas(),
    x="is_home",
    y="win_pct",
    col="team",
    col_wrap=5,
    hue="team",
    kind="point",
    height=1.5,
)
<seaborn.axisgrid.FacetGrid at 0x7f56f40acc50>

Now we calculate the win percent by team, regardless of whether they’re home or away:

win_percent_pl = (
    wins_pl.groupby("team", maintain_order=True).agg(
        pl.col("n_wins").sum().alias("win_pct") / pl.col("n_games").sum()
    )
)
win_percent_pl
shape: (30, 2)
team win_pct
str f64
"Atlanta Hawks" 0.571429
"Boston Celtics... 0.563218
... ...
"Utah Jazz" 0.487805
"Washington Wiz... 0.493827
win_percent_pd = (
    wins_pd
    .groupby(level="team", as_index=True)
    .apply(lambda x: x["n_wins"].sum() / x["n_games"].sum())
)
win_percent_pd
team
Atlanta Hawks         0.571429
Boston Celtics        0.563218
                        ...   
Utah Jazz             0.487805
Washington Wizards    0.493827
Length: 30, dtype: float64
(
    win_percent_pl
    .sort("win_pct")
    .to_pandas()
    .set_index("team")
    .plot.barh(figsize=(6, 12), width=0.85, color="k")
)
plt.xlabel("Win Percent")
Text(0.5, 0, 'Win Percent')

Here’s a plot of team home court advantage against team overall win percentage:

(
    wins_pl.pivot(index="team", columns="is_home", values="win_pct")
    .with_columns(
        [
            pl.col("home_team").alias("Home Win % - Away %") - pl.col("away_team"),
            (pl.col("home_team").alias("Overall %") + pl.col("away_team")) / 2,
        ]
    )
    .to_pandas()
    .pipe((sns.regplot, "data"), x="Overall %", y="Home Win % - Away %")
)
<AxesSubplot: xlabel='Overall %', ylabel='Home Win % - Away %'>

(
    wins_pd
    ["win_pct"]
    .unstack()
    .assign(**{'Home Win % - Away %': lambda x: x["home_team"] - x["away_team"],
               'Overall %': lambda x: (x["home_team"] + x["away_team"]) / 2})
     .pipe((sns.regplot, 'data'), x='Overall %', y='Home Win % - Away %')
)
<AxesSubplot: xlabel='Overall %', ylabel='Home Win % - Away %'>

Let’s add the win percent back to the dataframe and run a regression:

reg_df_pl = (
    joined_pl.join(win_percent_pl, left_on="home_team", right_on="team")
    .rename({"win_pct": "home_strength"})
    .join(win_percent_pl, left_on="away_team", right_on="team")
    .rename({"win_pct": "away_strength"})
    .with_columns(
        [
            pl.col("home_points").alias("point_diff") - pl.col("away_points"),
            pl.col("home_rest").alias("rest_diff") - pl.col("away_rest"),
            pl.col("home_win").cast(pl.UInt8),  # for statsmodels
        ]
    )
)
reg_df_pl.head()
shape: (5, 14)
game_id date away_rest home_rest away_team away_points home_team home_points home_win rest_spread home_strength away_strength point_diff rest_diff
u32 date i8 i8 str i64 str i64 u8 i8 f64 f64 i64 i8
5 2015-10-28 0 null "Chicago Bulls" 115 "Brooklyn Nets" 100 0 null 0.256098 0.506173 -15 null
6 2015-10-28 null 0 "Utah Jazz" 87 "Detroit Piston... 92 1 null 0.505882 0.487805 5 null
11 2015-10-28 0 null "Cleveland Cava... 106 "Memphis Grizzl... 76 0 null 0.488372 0.715686 -30 null
14 2015-10-28 0 null "New Orleans Pe... 94 "Portland Trail... 112 1 null 0.526882 0.37037 18 null
17 2015-10-29 0 0 "Memphis Grizzl... 112 "Indiana Pacers... 103 0 0 0.545455 0.488372 -9 0
reg_df_pd = (
    joined_pd.assign(
        away_strength=joined_pd['away_team'].map(win_percent_pd),
        home_strength=joined_pd['home_team'].map(win_percent_pd),
        point_diff=joined_pd['home_points'] - joined_pd['away_points'],
        rest_diff=joined_pd['home_rest'] - joined_pd['away_rest'])
)
reg_df_pd.head()
away_rest home_rest away_team away_points home_team home_points home_win rest_spread away_strength home_strength point_diff rest_diff
game_id date
7 2015-10-28 0 <NA> New Orleans Pelicans 94 Portland Trail Blazers 112 True <NA> 0.370370 0.526882 18 <NA>
11 2015-10-28 0 <NA> Chicago Bulls 115 Brooklyn Nets 100 False <NA> 0.506173 0.256098 -15 <NA>
15 2015-10-28 <NA> 0 Utah Jazz 87 Detroit Pistons 92 True <NA> 0.487805 0.505882 5 <NA>
16 2015-10-28 0 <NA> Cleveland Cavaliers 106 Memphis Grizzlies 76 False <NA> 0.715686 0.488372 -30 <NA>
17 2015-10-29 1 0 Atlanta Hawks 112 New York Knicks 101 False -1 0.571429 0.382716 -11 -1
import statsmodels.formula.api as sm

mod = sm.logit(
    "home_win ~ home_strength + away_strength + home_rest + away_rest",
    reg_df_pl.to_pandas(),
)
res = mod.fit()
res.summary()
Optimization terminated successfully.
         Current function value: 0.554797
         Iterations 6
Logit Regression Results
Dep. Variable: home_win No. Observations: 1299
Model: Logit Df Residuals: 1294
Method: MLE Df Model: 4
Date: Sat, 18 Mar 2023 Pseudo R-squ.: 0.1777
Time: 12:29:30 Log-Likelihood: -720.68
converged: True LL-Null: -876.38
Covariance Type: nonrobust LLR p-value: 3.748e-66
coef std err z P>|z| [0.025 0.975]
Intercept -0.0019 0.304 -0.006 0.995 -0.597 0.593
home_strength 5.7161 0.466 12.272 0.000 4.803 6.629
away_strength -4.9133 0.456 -10.786 0.000 -5.806 -4.020
home_rest 0.1045 0.076 1.381 0.167 -0.044 0.253
away_rest -0.0347 0.066 -0.526 0.599 -0.164 0.095

You can play around with the regressions yourself but we’ll end them here.

4.8 Summary

This was mostly a demonstration of .pivot and .melt, with several different examples of reshaping data in Polars and Pandas.