3  Performance

As you’ve probably heard by now, Polars is very very fast. Well-written Polars is quicker than well-written Pandas, and it’s easier to write Polars well. With that in mind…

3.1 Six fairly obvious performance rules

Here are some tips that are almost always a good idea:

  • Use the lazy API.
  • Use Exprs, and don’t use .apply unless you really have to.
  • Use the smallest necessary numeric types (so if you have an integer between 0 and 255, use pl.UInt8, not pl.Int64). This will save both time and space.
  • Use efficient storage (if you’re dumping stuff in files, Parquet is a good choice).
  • Use categoricals for recurring strings (but note that it may not be worth it if there’s not much repetition).
  • Only select the columns you need.
Tip

If your colleagues are happy with CSVs and can’t be convinced to use something else, tell them that the Modern Polars book says they should feel bad.

These are basically the same rules you’d follow when using Pandas, except for the one about the lazy API. Now for some comparisons between the performance of idiomatic Pandas and Polars.

3.2 Polars is faster at the boring stuff

Here we’ll clean up a messy dataset, kindly provided by Kaggle user Rachit Toshniwal as a deliberate example of a really crap CSV. Most of the cleanup involves extracting numeric data from awkward strings.

Also, the data is too small so I’ve concatenated it to itself 20 times. We’re not doing anything that will care about the duplication. Here’s how the raw table looks:

Code
import pandas as pd
pd.read_csv("../data/fifa21_raw_big.csv", dtype="string", nrows=2)
ID Name LongName photoUrl playerUrl Nationality Age ↓OVA POT Club ... A/W D/W IR PAC SHO PAS DRI DEF PHY Hits
0 158023 L. Messi Lionel Messi https://cdn.sofifa.com/players/158/023/21_60.png http://sofifa.com/player/158023/lionel-messi/2... Argentina 33 93 93 FC Barcelona ... Medium Low 5 ★ 85 92 91 95 38 65 771
1 20801 Cristiano Ronaldo C. Ronaldo dos Santos Aveiro https://cdn.sofifa.com/players/020/801/21_60.png http://sofifa.com/player/20801/c-ronaldo-dos-s... Portugal 35 92 92 Juventus ... High Low 5 ★ 89 93 81 89 35 77 562

2 rows × 77 columns

For this exercise we’ll assume we want to make use of all the columns. First some boilerplate where we map out the different data types:

Code
import pandas as pd
import polars as pl
import numpy as np
import math
str_cols = [
    "Name",
    "LongName",
    "playerUrl",
    "photoUrl",
]
initial_category_cols_pl = [
    "Nationality",
    "Preferred Foot",
    "Best Position",
    "A/W",
    "D/W"
]
category_cols = [*initial_category_cols_pl, "Club"]
date_cols = [
    "Joined",
    "Loan Date End"
]
# these all start with the euro symbol and end with 0, M or K
money_cols = [
    "Value",
    "Wage",
    "Release Clause"
]
star_cols = [
    "W/F",
    "SM",
    "IR",
]
# Contract col is a range of years
# Positions is a list of positions
# Height is in cm
# Weight is in kg
# Hits is numbers with K and M 
messy_cols = [
    "Contract",
    "Positions",
    "Height",
    "Weight",
    "Hits"
]
initially_str_cols = str_cols + date_cols + money_cols + star_cols + messy_cols
initially_str_cols_pl = [*initially_str_cols, "Club"]
u32_cols = [
    "ID",
    "Total Stats"
]
u8_cols = [
    'Age',
    '↓OVA',
    'POT',
    'BOV',
    'Crossing',
    'Finishing',
    'Heading Accuracy',
    'Short Passing',
    'Volleys',
    'Dribbling',
    'Curve',
    'FK Accuracy',
    'Long Passing',
    'Ball Control',
    'Acceleration',
    'Sprint Speed',
    'Agility',
    'Reactions',
    'Balance',
    'Shot Power',
    'Jumping',
    'Stamina',
    'Strength',
    'Long Shots',
    'Aggression',
    'Interceptions',
    'Positioning',
    'Vision',
    'Penalties',
    'Composure',
    'Marking',
    'Standing Tackle',
    'Sliding Tackle',
    'GK Diving',
    'GK Handling',
    'GK Kicking',
    'GK Positioning',
    'GK Reflexes',
    'PAC',
    'SHO',
    'PAS',
    'DRI',
    'DEF',
    'PHY'
]

u16_cols = [
    'Attacking',
    'Skill',
    'Movement',
    'Power',
    'Mentality',
    'Defending',
    'Goalkeeping',
    'Total Stats',
    'Base Stats'
]

3.2.1 Dtypes

Here are the initial dtypes for the two dataframes:

# can't use UInt8/16 in scan_csv
dtypes_pl = (
    {col: pl.Utf8 for col in initially_str_cols_pl}
    | {col: pl.Categorical for col in initial_category_cols_pl}
    | {col: pl.UInt32 for col in [*u32_cols, *u16_cols, *u8_cols]}
)
dtypes_pd = (
    {col: pd.StringDtype() for col in initially_str_cols}
    | {col: pd.CategoricalDtype() for col in category_cols}
    | {col: "uint32" for col in u32_cols}
    | {col: "uint8" for col in u8_cols}
    | {col: "uint16" for col in u16_cols}
)

One thing I’ll note here is that Pandas numeric types are somewhat confusing: "uint32" means np.uint32 which is not the same thing as pd.UInt32Dtype(). Only the latter is nullable. On the other hand, Polars has just one unsigned 32-bit integer type, and it’s nullable.

Tip

Polars expressions have a shrink_dtype method that can be more convenient than manually specifying the dtypes yourself. It’s not magic though, and it has to spend time finding the min and max of the column.

3.2.2 Data cleaning

There’s not much that you haven’t seen here already, so we won’t explain the code line by line. The main new thing here is pl.when for ternary expressions.

def parse_date_pl(col: pl.Expr) -> pl.Expr:
    return col.str.strptime(pl.Date, format="%b %d, %Y")

def parse_suffixed_num_pl(col: pl.Expr) -> pl.Expr:
    suffix = col.str.slice(-1, 1)
    suffix_value = (
        pl.when(suffix == "K")
        .then(1_000)
        .when(suffix == "M")
        .then(1_000_000)
        .otherwise(1)
        .cast(pl.UInt32)
    )
    without_suffix = (
        col
        .str.replace("K", "", literal=True)
        .str.replace("M", "", literal=True)
        .cast(pl.Float32)
    )
    original_name = col.meta.output_name()
    return (suffix_value * without_suffix).alias(original_name)

def parse_money_pl(col: pl.Expr) -> pl.Expr:
    return parse_suffixed_num_pl(col.str.slice(1)).cast(pl.UInt32)

def parse_star_pl(col: pl.Expr) -> pl.Expr:
    return col.str.slice(0, 1).cast(pl.UInt8)

def feet_to_cm_pl(col: pl.Expr) -> pl.Expr:
    feet_inches_split = col.str.split_exact("'", 1)
    total_inches = (
        (feet_inches_split.struct.field("field_0").cast(pl.UInt8, strict=False) * 12)
        + feet_inches_split.struct.field("field_1").str.strip_chars_end('"').cast(pl.UInt8, strict=False)
    )
    return (total_inches * 2.54).round(0).cast(pl.UInt8)

def parse_height_pl(col: pl.Expr) -> pl.Expr:
    is_cm = col.str.ends_with("cm")
    return (
        pl.when(is_cm)
        .then(col.str.slice(0, 3).cast(pl.UInt8, strict=False))
        .otherwise(feet_to_cm_pl(col))
    )

def parse_weight_pl(col: pl.Expr) -> pl.Expr:
    is_kg = col.str.ends_with("kg")
    without_unit = col.str.extract(r"(\d+)").cast(pl.UInt8)
    return (
        pl.when(is_kg)
        .then(without_unit)
        .otherwise((without_unit * 0.453592).round(0).cast(pl.UInt8))
    )

def parse_contract_pl(col: pl.Expr) -> list[pl.Expr]:
    contains_tilde = col.str.contains(" ~ ", literal=True)
    loan_str = " On Loan"
    loan_col = col.str.ends_with(loan_str)
    split = (
        pl.when(contains_tilde)
        .then(col)
        .otherwise(None)
        .str.split_exact(" ~ ", 1)
    )
    start = split.struct.field("field_0").cast(pl.UInt16).alias("contract_start")
    end = split.struct.field("field_1").cast(pl.UInt16).alias("contract_end")
    free_agent = (col == "Free").alias("free_agent").fill_null(False)
    loan_date = (
        pl.when(loan_col)
        .then(col)
        .otherwise(None)
        .str.split_exact(" On Loan", 1)
        .struct.field("field_0")
        .alias("loan_date_start")
    )
    return [start, end, free_agent, parse_date_pl(loan_date)]
def parse_date_pd(col: pd.Series) -> pd.Series:
    return pd.to_datetime(col, format="%b %d, %Y")

def parse_suffixed_num_pd(col: pd.Series) -> pd.Series:
    suffix_value = (
        col
        .str[-1]
        .map({"K": 1_000, "M": 1_000_000})
        .fillna(1)
        .astype("uint32")
    )
    without_suffix = (
        col
        .str.replace("K", "", regex=False)
        .str.replace("M", "", regex=False)
        .astype("float")
    )
    return suffix_value * without_suffix

def parse_money_pd(col: pd.Series) -> pd.Series:
    return parse_suffixed_num_pd(col.str[1:]).astype("uint32")

def parse_star_pd(col: pd.Series) -> pd.Series:
    return col.str[0].astype("uint8")

def feet_to_cm_pd(col: pd.Series) -> pd.Series:
    feet_inches_split = col.str.split("'", expand=True)
    total_inches = (
        feet_inches_split[0].astype("uint8").mul(12)
        + feet_inches_split[1].str[:-1].astype("uint8")
    )
    return total_inches.mul(2.54).round().astype("uint8")

def parse_height_pd(col: pd.Series) -> pd.Series:
    is_cm = col.str.endswith("cm")
    cm_values = col.loc[is_cm].str[:-2].astype("uint8")
    inches_as_cm = feet_to_cm_pd(col.loc[~is_cm])
    return pd.concat([cm_values, inches_as_cm])

def parse_weight_pd(col: pd.Series) -> pd.Series:
    is_kg = col.str.endswith("kg")
    without_unit = col.where(is_kg, col.str[:-3]).mask(is_kg, col.str[:-2]).astype("uint8")
    return without_unit.where(is_kg, without_unit.mul(0.453592).round().astype("uint8"))

def parse_contract_pd(df: pd.DataFrame) -> pd.DataFrame:
    contract_col = df["Contract"]
    contains_tilde = contract_col.str.contains(" ~ ", regex=False)
    split = (
        contract_col.loc[contains_tilde].str.split(" ~ ", expand=True).astype(pd.UInt16Dtype())
    )
    split.columns = ["contract_start", "contract_end"]
    not_tilde = contract_col.loc[~contains_tilde]
    free_agent = (contract_col == "Free").rename("free_agent").fillna(False)
    loan_date = parse_date_pd(not_tilde.loc[~free_agent].str[:-8]).rename("loan_date_start")
    return pd.concat([df.drop("Contract", axis=1), split, free_agent, loan_date], axis=1)

3.2.3 Performance comparison

In this example, Polars is ~150x faster than Pandas:

%%time
new_cols_pl = ([
    pl.col("Club").str.strip_chars().cast(pl.Categorical),
    parse_suffixed_num_pl(pl.col("Hits")).cast(pl.UInt32),
    pl.col("Positions").str.split(","),
    parse_height_pl(pl.col("Height")),
    parse_weight_pl(pl.col("Weight")),
]
+ [parse_date_pl(pl.col(col)) for col in date_cols]
+ [parse_money_pl(pl.col(col)) for col in money_cols]
+ [parse_star_pl(pl.col(col)) for col in star_cols]
+ parse_contract_pl(pl.col("Contract"))
+ [pl.col(col).cast(pl.UInt16) for col in u16_cols]
+ [pl.col(col).cast(pl.UInt8) for col in u8_cols]
)
fifa_pl = (
    pl.scan_csv("../data/fifa21_raw_big.csv", schema_overrides=dtypes_pl)
    .with_columns(new_cols_pl)
    .drop("Contract")
    .rename({"↓OVA": "OVA"})
    .collect()
)
CPU times: user 1.78 s, sys: 197 ms, total: 1.98 s
Wall time: 216 ms
<timed exec>:20: CategoricalRemappingWarning: Local categoricals have different encodings, expensive re-encoding is done to perform this merge operation. Consider using a StringCache or an Enum type if the categories are known in advance
%%time
fifa_pd = (
    pd.read_csv("../data/fifa21_raw_big.csv", dtype=dtypes_pd)
    .assign(Club=lambda df: df["Club"].cat.rename_categories(lambda c: c.strip()),
        **{col: lambda df: parse_date_pd(df[col]) for col in date_cols},
        **{col: lambda df: parse_money_pd(df[col]) for col in money_cols},
        **{col: lambda df: parse_star_pd(df[col]) for col in star_cols},
        Hits=lambda df: parse_suffixed_num_pd(df["Hits"]).astype(pd.UInt32Dtype()),
        Positions=lambda df: df["Positions"].str.split(","),
        Height=lambda df: parse_height_pd(df["Height"]),
        Weight=lambda df: parse_weight_pd(df["Weight"])
    )
    .pipe(parse_contract_pd)
    .rename(columns={"↓OVA": "OVA"})
)
CPU times: user 3.55 s, sys: 347 ms, total: 3.9 s
Wall time: 3.9 s

Output:

fifa_pl.head()
shape: (5, 80)
ID Name LongName photoUrl playerUrl Nationality Age OVA POT Club Positions Height Weight Preferred Foot BOV Best Position Joined Loan Date End Value Wage Release Clause Attacking Crossing Finishing Heading Accuracy Short Passing Volleys Skill Dribbling Curve FK Accuracy Long Passing Ball Control Movement Acceleration Sprint Speed Agility Strength Long Shots Mentality Aggression Interceptions Positioning Vision Penalties Composure Defending Marking Standing Tackle Sliding Tackle Goalkeeping GK Diving GK Handling GK Kicking GK Positioning GK Reflexes Total Stats Base Stats W/F SM A/W D/W IR PAC SHO PAS DRI DEF PHY Hits contract_start contract_end free_agent loan_date_start
u32 str str str str cat u8 u8 u8 cat list[str] u8 u8 cat u8 cat date date u32 u32 u32 u16 u8 u8 u8 u8 u8 u16 u8 u8 u8 u8 u8 u16 u8 u8 u8 u8 u8 u16 u8 u8 u8 u8 u8 u8 u16 u8 u8 u8 u16 u8 u8 u8 u8 u8 u16 u16 u8 u8 cat cat u8 u8 u8 u8 u8 u8 u8 u32 u16 u16 bool date
158023 "L. Messi" "Lionel Messi" "https://cdn.sofifa.com/players… "http://sofifa.com/player/15802… "Argentina" 33 93 93 "FC Barcelona" ["RW", " ST", " CF"] 170 72 "Left" 93 "RW" 2004-07-01 null 103500000 560000 138399993 429 85 95 70 91 88 470 96 93 94 91 96 451 91 80 91 69 94 347 44 40 93 95 75 96 91 32 35 24 54 6 11 15 14 8 2231 466 4 4 "Medium" "Low" 5 85 92 91 95 38 65 771 2004 2021 false null
20801 "Cristiano Ronaldo" "C. Ronaldo dos Santos Aveiro" "https://cdn.sofifa.com/players… "http://sofifa.com/player/20801… "Portugal" 35 92 92 "Juventus" ["ST", " LW"] 187 83 "Right" 92 "ST" 2018-07-10 null 63000000 220000 75900001 437 84 95 90 82 86 414 88 81 76 77 92 431 87 91 87 78 93 353 63 29 95 82 84 95 84 28 32 24 58 7 11 15 14 11 2221 464 4 5 "High" "Low" 5 89 93 81 89 35 77 562 2018 2022 false null
200389 "J. Oblak" "Jan Oblak" "https://cdn.sofifa.com/players… "http://sofifa.com/player/20038… "Slovenia" 27 91 93 "Atlético Madrid" ["GK"] 188 87 "Right" 91 "GK" 2014-07-16 null 120000000 125000 159399993 95 13 11 15 43 13 109 12 13 14 40 30 307 43 60 67 78 12 140 34 19 11 65 11 68 57 27 12 18 437 87 92 78 90 90 1413 489 3 1 "Medium" "Medium" 3 87 92 78 90 52 90 150 2014 2023 false null
192985 "K. De Bruyne" "Kevin De Bruyne" "https://cdn.sofifa.com/players… "http://sofifa.com/player/19298… "Belgium" 29 91 91 "Manchester City" ["CAM", " CM"] 181 70 "Right" 91 "CAM" 2015-08-30 null 129000000 370000 161000000 407 94 82 55 94 82 441 88 85 83 93 92 398 77 76 78 74 91 408 76 66 88 94 84 91 186 68 65 53 56 15 13 5 10 13 2304 485 5 4 "High" "High" 4 76 86 93 88 64 78 207 2015 2023 false null
190871 "Neymar Jr" "Neymar da Silva Santos Jr." "https://cdn.sofifa.com/players… "http://sofifa.com/player/19087… "Brazil" 28 91 91 "Paris Saint-Germain" ["LW", " CAM"] 175 68 "Right" 91 "LW" 2017-08-03 null 132000000 270000 166500000 408 85 87 62 87 87 448 95 88 89 81 95 453 94 89 96 50 84 356 51 36 87 90 92 93 94 35 30 29 59 9 9 15 15 11 2175 451 5 5 "High" "Medium" 5 91 85 86 94 36 59 595 2017 2022 false null
fifa_pd.head()
ID Name LongName photoUrl playerUrl Nationality Age OVA POT Club ... SHO PAS DRI DEF PHY Hits contract_start contract_end free_agent loan_date_start
0 158023 L. Messi Lionel Messi https://cdn.sofifa.com/players/158/023/21_60.png http://sofifa.com/player/158023/lionel-messi/2... Argentina 33 93 93 FC Barcelona ... 92 91 95 38 65 771 2004 2021 False NaT
1 20801 Cristiano Ronaldo C. Ronaldo dos Santos Aveiro https://cdn.sofifa.com/players/020/801/21_60.png http://sofifa.com/player/20801/c-ronaldo-dos-s... Portugal 35 92 92 Juventus ... 93 81 89 35 77 562 2018 2022 False NaT
2 200389 J. Oblak Jan Oblak https://cdn.sofifa.com/players/200/389/21_60.png http://sofifa.com/player/200389/jan-oblak/210006/ Slovenia 27 91 93 Atlético Madrid ... 92 78 90 52 90 150 2014 2023 False NaT
3 192985 K. De Bruyne Kevin De Bruyne https://cdn.sofifa.com/players/192/985/21_60.png http://sofifa.com/player/192985/kevin-de-bruyn... Belgium 29 91 91 Manchester City ... 86 93 88 64 78 207 2015 2023 False NaT
4 190871 Neymar Jr Neymar da Silva Santos Jr. https://cdn.sofifa.com/players/190/871/21_60.png http://sofifa.com/player/190871/neymar-da-silv... Brazil 28 91 91 Paris Saint-Germain ... 85 86 94 36 59 595 2017 2022 False NaT

5 rows × 80 columns

You could play around with the timings here and even try the .profile method to see what Polars spends its time on. In this scenario the speed advantage of Polars likely comes down to three things:

  1. It is much faster at reading CSVs.
  2. It is much faster at processing strings.
  3. It can select/assign columns in parallel.

3.3 NumPy might make Polars faster sometimes

Polars gets along well with NumPy ufuncs, even in lazy mode (which is interesting because NumPy has no lazy API). Let’s see how this looks by calculating the great-circle distance between a bunch of coordinates.

3.3.1 Get the data

We create a lazy dataframe containing pairs of airports and their coordinates:

airports = pl.scan_csv("../data/airports.csv").drop_nulls().unique(subset=["AIRPORT"])
pairs = airports.join(airports, on="AIRPORT", how="cross").filter(
    (pl.col("AIRPORT") != pl.col("AIRPORT_right"))
    & (pl.col("LATITUDE") != pl.col("LATITUDE_right"))
    & (pl.col("LONGITUDE") != pl.col("LONGITUDE_right"))
)

3.3.2 Calculate great-circle distance

One use case for NumPy ufuncs is doing computations that Polars expressions don’t support. In this example Polars can do everything we need, though the ufunc version ends up being slightly faster:

def deg2rad_pl(degrees: pl.Expr) -> pl.Expr:
    return degrees * math.pi / 180

def gcd_pl(lat1: pl.Expr, lng1: pl.Expr, lat2: pl.Expr, lng2: pl.Expr):
    ϕ1 = deg2rad_pl(90 - lat1)
    ϕ2 = deg2rad_pl(90 - lat2)

    θ1 = deg2rad_pl(lng1)
    θ2 = deg2rad_pl(lng2)

    cos = ϕ1.sin() * ϕ2.sin() *1 - θ2).cos() + ϕ1.cos() * ϕ2.cos()
    arc = cos.arccos()
    return arc * 6373
def gcd_np(lat1, lng1, lat2, lng2):
    ϕ1 = np.deg2rad(90 - lat1)
    ϕ2 = np.deg2rad(90 - lat2)

    θ1 = np.deg2rad(lng1)
    θ2 = np.deg2rad(lng2)

    cos = np.sin(ϕ1) * np.sin(ϕ2) * np.cos(θ1 - θ2) + np.cos(ϕ1) * np.cos(ϕ2)
    arc = np.arccos(cos)
    return arc * 6373

We can pass Polars expressions directly to our gcd_np function, which is pretty nice since these things don’t even store the data themselves:

%%timeit
pairs.select(
    gcd_np(
        pl.col("LATITUDE"),
        pl.col("LONGITUDE"),
        pl.col("LATITUDE_right"),
        pl.col("LONGITUDE_right")
    )
).collect()
3.22 s ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

On my machine the NumPy version used to be 5-20% faster than the pure Polars version, but this is no longer the case. Still you may want to see if it helps you:

%%timeit
pairs.select(
    gcd_pl(
        pl.col("LATITUDE"),
        pl.col("LONGITUDE"),
        pl.col("LATITUDE_right"),
        pl.col("LONGITUDE_right")
    )
).collect()
3.87 s ± 79.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This may not be a huge performance difference, but it at least means you don’t sacrifice speed when relying on NumPy. There are some gotchas though so watch out for those.

Also watch out for .to_numpy() - you don’t always need to call this and it can slow things down:

%%timeit
collected = pairs.collect()
gcd_np(
    collected["LATITUDE"].to_numpy(),
    collected["LONGITUDE"].to_numpy(),
    collected["LATITUDE_right"].to_numpy(),
    collected["LONGITUDE_right"].to_numpy()
)
3.94 s ± 47.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3.4 Polars can be slower than Pandas sometimes, maybe

Here’s an example where we calculate z-scores, using window functions in Polars and using groupby-transform in Pandas:

def create_frame(n, n_groups):
    return pl.DataFrame(
        {"name": np.random.randint(0, n_groups, size=n), "value2": np.random.randn(n)}
    )

def pandas_transform(df: pd.DataFrame) -> pd.DataFrame:
    g = df.groupby("name")["value2"]
    v = df["value2"]
    return (v - g.transform("mean")) / g.transform("std")


def polars_transform() -> pl.Expr:
    v = pl.col("value2")
    return (v - v.mean().over("name")) / v.std().over("name")

rand_df_pl = create_frame(50_000_000, 50_000)
rand_df_pd = rand_df_pl.to_pandas()

The Polars version tends to be 10-100% slower on my machine:

%timeit rand_df_pl.select(polars_transform())
1.81 s ± 13.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pandas_transform(rand_df_pd)
1.35 s ± 62.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This example isn’t telling you to use Pandas in this specific situation. Once you add in the time spent reading a file, Polars likely wins.

And even here, if you sort by the name col, Polars wins again. It has fast-track algorithms for sorted data.

3.5 Summary

  • Polars is really fast. Pandas was already respectably fast and Polars wipes the floor with it.
  • You can still make Polars slow if you do silly things with it, but compared to Pandas it’s easier to do the right thing in the first place.
  • Polars works well with NumPy ufuncs.
  • There are still some situations where Pandas can be faster. They are probably not compelling, but we shouldn’t pretend they don’t exist.