3  Performance

As you’ve probably heard by now, Polars is very very fast. Well-written Polars is quicker than well-written Pandas, and it’s easier to write Polars well. With that in mind…

3.1 Six fairly obvious performance rules

Here are some tips that are almost always a good idea:

  • Use the lazy API.
  • Use Exprs, and don’t use .apply unless you really have to.
  • Use the smallest necessary numeric types (so if you have an integer between 0 and 255, use pl.UInt8, not pl.Int64). This will save both time and space.
  • Use efficient storage (if you’re dumping stuff in files, Parquet is a good choice).
  • Use categoricals for recurring strings (but note that it may not be worth it if there’s not much repetition).
  • Only select the columns you need.
Tip

If your colleagues are happy with CSVs and can’t be convinced to use something else, tell them that the Modern Polars book says they should feel bad.

These are basically the same rules you’d follow when using Pandas, except for the one about the lazy API. Now for some comparisons between the performance of idiomatic Pandas and Polars.

3.2 Polars is faster at the boring stuff

Here we’ll clean up a messy dataset, kindly provided by Kaggle user Rachit Toshniwal as a deliberate example of a really crap CSV. Most of the cleanup involves extracting numeric data from awkward strings.

Also, the data is too small so I’ve concatenated it to itself 20 times. We’re not doing anything that will care about the duplication. Here’s how the raw table looks:

Code
import pandas as pd
pd.read_csv("../data/fifa21_raw_big.csv", dtype="string", nrows=2)
ID Name LongName photoUrl playerUrl Nationality Age ↓OVA POT Club ... A/W D/W IR PAC SHO PAS DRI DEF PHY Hits
0 158023 L. Messi Lionel Messi https://cdn.sofifa.com/players/158/023/21_60.png http://sofifa.com/player/158023/lionel-messi/2... Argentina 33 93 93 FC Barcelona ... Medium Low 5 ★ 85 92 91 95 38 65 771
1 20801 Cristiano Ronaldo C. Ronaldo dos Santos Aveiro https://cdn.sofifa.com/players/020/801/21_60.png http://sofifa.com/player/20801/c-ronaldo-dos-s... Portugal 35 92 92 Juventus ... High Low 5 ★ 89 93 81 89 35 77 562

2 rows × 77 columns

For this exercise we’ll assume we want to make use of all the columns. First some boilerplate where we map out the different data types:

Code
import pandas as pd
import polars as pl
import numpy as np
import math
str_cols = [
    "Name",
    "LongName",
    "playerUrl",
    "photoUrl",
]
initial_category_cols_pl = [
    "Nationality",
    "Preferred Foot",
    "Best Position",
    "A/W",
    "D/W"
]
category_cols = [*initial_category_cols_pl, "Club"]
date_cols = [
    "Joined",
    "Loan Date End"
]
# these all start with the euro symbol and end with 0, M or K
money_cols = [
    "Value",
    "Wage",
    "Release Clause"
]
star_cols = [
    "W/F",
    "SM",
    "IR",
]
# Contract col is a range of years
# Positions is a list of positions
# Height is in cm
# Weight is in kg
# Hits is numbers with K and M 
messy_cols = [
    "Contract",
    "Positions",
    "Height",
    "Weight",
    "Hits"
]
initially_str_cols = str_cols + date_cols + money_cols + star_cols + messy_cols
initially_str_cols_pl = [*initially_str_cols, "Club"]
u32_cols = [
    "ID",
    "Total Stats"
]
u8_cols = [
    'Age',
    '↓OVA',
    'POT',
    'BOV',
    'Crossing',
    'Finishing',
    'Heading Accuracy',
    'Short Passing',
    'Volleys',
    'Dribbling',
    'Curve',
    'FK Accuracy',
    'Long Passing',
    'Ball Control',
    'Acceleration',
    'Sprint Speed',
    'Agility',
    'Reactions',
    'Balance',
    'Shot Power',
    'Jumping',
    'Stamina',
    'Strength',
    'Long Shots',
    'Aggression',
    'Interceptions',
    'Positioning',
    'Vision',
    'Penalties',
    'Composure',
    'Marking',
    'Standing Tackle',
    'Sliding Tackle',
    'GK Diving',
    'GK Handling',
    'GK Kicking',
    'GK Positioning',
    'GK Reflexes',
    'PAC',
    'SHO',
    'PAS',
    'DRI',
    'DEF',
    'PHY'
]

u16_cols = [
    'Attacking',
    'Skill',
    'Movement',
    'Power',
    'Mentality',
    'Defending',
    'Goalkeeping',
    'Total Stats',
    'Base Stats'
]

3.2.1 Dtypes

Here are the initial dtypes for the two dataframes:

# can't use UInt8/16 in scan_csv
dtypes_pl = (
    {col: pl.Utf8 for col in initially_str_cols_pl}
    | {col: pl.Categorical for col in initial_category_cols_pl}
    | {col: pl.UInt32 for col in [*u32_cols, *u16_cols, *u8_cols]}
)
dtypes_pd = (
    {col: pd.StringDtype() for col in initially_str_cols}
    | {col: pd.CategoricalDtype() for col in category_cols}
    | {col: "uint32" for col in u32_cols}
    | {col: "uint8" for col in u8_cols}
    | {col: "uint16" for col in u16_cols}
)

One thing I’ll note here is that Pandas numeric types are somewhat confusing: "uint32" means np.uint32 which is not the same thing as pd.UInt32Dtype(). Only the latter is nullable. On the other hand, Polars has just one unsigned 32-bit integer type, and it’s nullable.

Tip

Polars expressions have a shrink_dtype method that can be more convenient than manually specifying the dtypes yourself. It’s not magic though, and it has to spend time finding the min and max of the column.

3.2.2 Data cleaning

There’s not much that you haven’t seen here already, so we won’t explain the code line by line. The main new thing here is pl.when for ternary expressions.

def parse_date_pl(col: pl.Expr) -> pl.Expr:
    return col.str.strptime(pl.Date, format="%b %d, %Y")

def parse_suffixed_num_pl(col: pl.Expr) -> pl.Expr:
    suffix = col.str.slice(-1, 1)
    suffix_value = (
        pl.when(suffix == "K")
        .then(1_000)
        .when(suffix == "M")
        .then(1_000_000)
        .otherwise(1)
        .cast(pl.UInt32)
    )
    without_suffix = (
        col
        .str.replace("K", "", literal=True)
        .str.replace("M", "", literal=True)
        .cast(pl.Float32)
    )
    original_name = col.meta.output_name()
    return (suffix_value * without_suffix).alias(original_name)

def parse_money_pl(col: pl.Expr) -> pl.Expr:
    return parse_suffixed_num_pl(col.str.slice(1)).cast(pl.UInt32)

def parse_star_pl(col: pl.Expr) -> pl.Expr:
    return col.str.slice(0, 1).cast(pl.UInt8)

def feet_to_cm_pl(col: pl.Expr) -> pl.Expr:
    feet_inches_split = col.str.split_exact("'", 1)
    total_inches = (
        (feet_inches_split.struct.field("field_0").cast(pl.UInt8, strict=False) * 12)
        + feet_inches_split.struct.field("field_1").str.strip_chars_end('"').cast(pl.UInt8, strict=False)
    )
    return (total_inches * 2.54).round(0).cast(pl.UInt8)

def parse_height_pl(col: pl.Expr) -> pl.Expr:
    is_cm = col.str.ends_with("cm")
    return (
        pl.when(is_cm)
        .then(col.str.slice(0, 3).cast(pl.UInt8, strict=False))
        .otherwise(feet_to_cm_pl(col))
    )

def parse_weight_pl(col: pl.Expr) -> pl.Expr:
    is_kg = col.str.ends_with("kg")
    without_unit = col.str.extract(r"(\d+)").cast(pl.UInt8)
    return (
        pl.when(is_kg)
        .then(without_unit)
        .otherwise((without_unit * 0.453592).round(0).cast(pl.UInt8))
    )

def parse_contract_pl(col: pl.Expr) -> list[pl.Expr]:
    contains_tilde = col.str.contains(" ~ ", literal=True)
    loan_str = " On Loan"
    loan_col = col.str.ends_with(loan_str)
    split = (
        pl.when(contains_tilde)
        .then(col)
        .otherwise(None)
        .str.split_exact(" ~ ", 1)
    )
    start = split.struct.field("field_0").cast(pl.UInt16).alias("contract_start")
    end = split.struct.field("field_1").cast(pl.UInt16).alias("contract_end")
    free_agent = (col == "Free").alias("free_agent").fill_null(False)
    loan_date = (
        pl.when(loan_col)
        .then(col)
        .otherwise(None)
        .str.split_exact(" On Loan", 1)
        .struct.field("field_0")
        .alias("loan_date_start")
    )
    return [start, end, free_agent, parse_date_pl(loan_date)]
def parse_date_pd(col: pd.Series) -> pd.Series:
    return pd.to_datetime(col, format="%b %d, %Y")

def parse_suffixed_num_pd(col: pd.Series) -> pd.Series:
    suffix_value = (
        col
        .str[-1]
        .map({"K": 1_000, "M": 1_000_000})
        .fillna(1)
        .astype("uint32")
    )
    without_suffix = (
        col
        .str.replace("K", "", regex=False)
        .str.replace("M", "", regex=False)
        .astype("float")
    )
    return suffix_value * without_suffix

def parse_money_pd(col: pd.Series) -> pd.Series:
    return parse_suffixed_num_pd(col.str[1:]).astype("uint32")

def parse_star_pd(col: pd.Series) -> pd.Series:
    return col.str[0].astype("uint8")

def feet_to_cm_pd(col: pd.Series) -> pd.Series:
    feet_inches_split = col.str.split("'", expand=True)
    total_inches = (
        feet_inches_split[0].astype("uint8").mul(12)
        + feet_inches_split[1].str[:-1].astype("uint8")
    )
    return total_inches.mul(2.54).round().astype("uint8")

def parse_height_pd(col: pd.Series) -> pd.Series:
    is_cm = col.str.endswith("cm")
    cm_values = col.loc[is_cm].str[:-2].astype("uint8")
    inches_as_cm = feet_to_cm_pd(col.loc[~is_cm])
    return pd.concat([cm_values, inches_as_cm])

def parse_weight_pd(col: pd.Series) -> pd.Series:
    is_kg = col.str.endswith("kg")
    without_unit = col.where(is_kg, col.str[:-3]).mask(is_kg, col.str[:-2]).astype("uint8")
    return without_unit.where(is_kg, without_unit.mul(0.453592).round().astype("uint8"))

def parse_contract_pd(df: pd.DataFrame) -> pd.DataFrame:
    contract_col = df["Contract"]
    contains_tilde = contract_col.str.contains(" ~ ", regex=False)
    split = (
        contract_col.loc[contains_tilde].str.split(" ~ ", expand=True).astype(pd.UInt16Dtype())
    )
    split.columns = ["contract_start", "contract_end"]
    not_tilde = contract_col.loc[~contains_tilde]
    free_agent = (contract_col == "Free").rename("free_agent").fillna(False)
    loan_date = parse_date_pd(not_tilde.loc[~free_agent].str[:-8]).rename("loan_date_start")
    return pd.concat([df.drop("Contract", axis=1), split, free_agent, loan_date], axis=1)

3.2.3 Performance comparison

In this example, Polars is ~150x faster than Pandas:

%%time
new_cols_pl = ([
    pl.col("Club").str.strip_chars().cast(pl.Categorical),
    parse_suffixed_num_pl(pl.col("Hits")).cast(pl.UInt32),
    pl.col("Positions").str.split(","),
    parse_height_pl(pl.col("Height")),
    parse_weight_pl(pl.col("Weight")),
]
+ [parse_date_pl(pl.col(col)) for col in date_cols]
+ [parse_money_pl(pl.col(col)) for col in money_cols]
+ [parse_star_pl(pl.col(col)) for col in star_cols]
+ parse_contract_pl(pl.col("Contract"))
+ [pl.col(col).cast(pl.UInt16) for col in u16_cols]
+ [pl.col(col).cast(pl.UInt8) for col in u8_cols]
)
fifa_pl = (
    pl.scan_csv("../data/fifa21_raw_v2.csv", dtypes=dtypes_pl)
    .with_columns(new_cols_pl)
    .drop("Contract")
    .rename({"↓OVA": "OVA"})
    .collect()
)
CPU times: user 149 ms, sys: 20.8 ms, total: 170 ms
Wall time: 42.8 ms
%%time
fifa_pd = (
    pd.read_csv("../data/fifa21_raw_big.csv", dtype=dtypes_pd)
    .assign(Club=lambda df: df["Club"].cat.rename_categories(lambda c: c.strip()),
        **{col: lambda df: parse_date_pd(df[col]) for col in date_cols},
        **{col: lambda df: parse_money_pd(df[col]) for col in money_cols},
        **{col: lambda df: parse_star_pd(df[col]) for col in star_cols},
        Hits=lambda df: parse_suffixed_num_pd(df["Hits"]).astype(pd.UInt32Dtype()),
        Positions=lambda df: df["Positions"].str.split(","),
        Height=lambda df: parse_height_pd(df["Height"]),
        Weight=lambda df: parse_weight_pd(df["Weight"])
    )
    .pipe(parse_contract_pd)
    .rename(columns={"↓OVA": "OVA"})
)
CPU times: user 6.53 s, sys: 336 ms, total: 6.87 s
Wall time: 6.86 s

Output:

fifa_pl.head()
shape: (5, 80)
IDNameLongNamephotoUrlplayerUrlNationalityAgeOVAPOTClubPositionsHeightWeightPreferred FootBOVBest PositionJoinedLoan Date EndValueWageRelease ClauseAttackingCrossingFinishingHeading AccuracyShort PassingVolleysSkillDribblingCurveFK AccuracyLong PassingBall ControlMovementAccelerationSprint SpeedAgilityStrengthLong ShotsMentalityAggressionInterceptionsPositioningVisionPenaltiesComposureDefendingMarkingStanding TackleSliding TackleGoalkeepingGK DivingGK HandlingGK KickingGK PositioningGK ReflexesTotal StatsBase StatsW/FSMA/WD/WIRPACSHOPASDRIDEFPHYHitscontract_startcontract_endfree_agentloan_date_start
u32strstrstrstrcatu8u8u8catlist[str]u8u8catu8catdatedateu32u32u32u16u8u8u8u8u8u16u8u8u8u8u8u16u8u8u8u8u8u16u8u8u8u8u8u8u16u8u8u8u16u8u8u8u8u8u16u16u8u8catcatu8u8u8u8u8u8u8u32u16u16booldate
158023"L. Messi""Lionel Messi""https://cdn.so…"http://sofifa.…"Argentina"339393"FC Barcelona"["RW", " ST", " CF"]17072"Left"93"RW"2004-07-01null103500000560000138399993429859570918847096939491964519180916994347444093957596913235245461115148223146644"Medium""Low"585929195386577120042021falsenull
20801"Cristiano Rona…"C. Ronaldo dos…"https://cdn.so…"http://sofifa.…"Portugal"359292"Juventus"["ST", " LW"]18783"Right"92"ST"2018-07-10null63000000220000759000014378495908286414888176779243187918778933536329958284958428322458711151411222146445"High""Low"589938189357756220182022falsenull
200389"J. Oblak""Jan Oblak""https://cdn.so…"http://sofifa.…"Slovenia"279193"Atlético Madri…["GK"]18887"Right"91"GK"2014-07-16null12000000012500015939999395131115431310912131440303074360677812140341911651168572712184378792789090141348931"Medium""Medium"387927890529015020142023falsenull
192985"K. De Bruyne""Kevin De Bruyn…"https://cdn.so…"http://sofifa.…"Belgium"299191"Manchester Cit…["CAM", " CM"]18170"Right"91"CAM"2015-08-30null12900000037000016100000040794825594824418885839392398777678749140876668894849118668655356151351013230448554"High""High"476869388647820720152023falsenull
190871"Neymar Jr""Neymar da Silv…"https://cdn.so…"http://sofifa.…"Brazil"289191"Paris Saint-Ge…["LW", " CAM"]17568"Right"91"LW"2017-08-03null132000000270000166500000408858762878744895888981954539489965084356513687909293943530295999151511217545155"High""Medium"591858694365959520172022falsenull
fifa_pd.head()
ID Name LongName photoUrl playerUrl Nationality Age OVA POT Club ... SHO PAS DRI DEF PHY Hits contract_start contract_end free_agent loan_date_start
0 158023 L. Messi Lionel Messi https://cdn.sofifa.com/players/158/023/21_60.png http://sofifa.com/player/158023/lionel-messi/2... Argentina 33 93 93 FC Barcelona ... 92 91 95 38 65 771 2004 2021 False NaT
1 20801 Cristiano Ronaldo C. Ronaldo dos Santos Aveiro https://cdn.sofifa.com/players/020/801/21_60.png http://sofifa.com/player/20801/c-ronaldo-dos-s... Portugal 35 92 92 Juventus ... 93 81 89 35 77 562 2018 2022 False NaT
2 200389 J. Oblak Jan Oblak https://cdn.sofifa.com/players/200/389/21_60.png http://sofifa.com/player/200389/jan-oblak/210006/ Slovenia 27 91 93 Atlético Madrid ... 92 78 90 52 90 150 2014 2023 False NaT
3 192985 K. De Bruyne Kevin De Bruyne https://cdn.sofifa.com/players/192/985/21_60.png http://sofifa.com/player/192985/kevin-de-bruyn... Belgium 29 91 91 Manchester City ... 86 93 88 64 78 207 2015 2023 False NaT
4 190871 Neymar Jr Neymar da Silva Santos Jr. https://cdn.sofifa.com/players/190/871/21_60.png http://sofifa.com/player/190871/neymar-da-silv... Brazil 28 91 91 Paris Saint-Germain ... 85 86 94 36 59 595 2017 2022 False NaT

5 rows × 80 columns

You could play around with the timings here and even try the .profile method to see what Polars spends its time on. In this scenario the speed advantage of Polars likely comes down to three things:

  1. It is much faster at reading CSVs.
  2. It is much faster at processing strings.
  3. It can select/assign columns in parallel.

3.3 NumPy can make Polars faster

Polars gets along well with NumPy ufuncs, even in lazy mode (which is interesting because NumPy has no lazy API). Let’s see how this looks by calculating the great-circle distance between a bunch of coordinates.

3.3.1 Get the data

We create a lazy dataframe containing pairs of airports and their coordinates:

airports = pl.scan_csv("../data/airports.csv").drop_nulls().unique(subset=["AIRPORT"])
pairs = airports.join(airports, on="AIRPORT", how="cross").filter(
    (pl.col("AIRPORT") != pl.col("AIRPORT_right"))
    & (pl.col("LATITUDE") != pl.col("LATITUDE_right"))
    & (pl.col("LONGITUDE") != pl.col("LONGITUDE_right"))
)

3.3.2 Calculate great-circle distance

One use case for NumPy ufuncs is doing computations that Polars expressions don’t support. In this example Polars can do everything we need, though the ufunc version ends up being slightly faster:

def deg2rad_pl(degrees: pl.Expr) -> pl.Expr:
    return degrees * math.pi / 180

def gcd_pl(lat1: pl.Expr, lng1: pl.Expr, lat2: pl.Expr, lng2: pl.Expr):
    ϕ1 = deg2rad_pl(90 - lat1)
    ϕ2 = deg2rad_pl(90 - lat2)

    θ1 = deg2rad_pl(lng1)
    θ2 = deg2rad_pl(lng2)

    cos = ϕ1.sin() * ϕ2.sin() *1 - θ2).cos() + ϕ1.cos() * ϕ2.cos()
    arc = cos.arccos()
    return arc * 6373
def gcd_np(lat1, lng1, lat2, lng2):
    ϕ1 = np.deg2rad(90 - lat1)
    ϕ2 = np.deg2rad(90 - lat2)

    θ1 = np.deg2rad(lng1)
    θ2 = np.deg2rad(lng2)

    cos = np.sin(ϕ1) * np.sin(ϕ2) * np.cos(θ1 - θ2) + np.cos(ϕ1) * np.cos(ϕ2)
    arc = np.arccos(cos)
    return arc * 6373

We can pass Polars expressions directly to our gcd_np function, which is pretty nice since these things don’t even store the data themselves:

%%timeit
pairs.select(
    gcd_np(
        pl.col("LATITUDE"),
        pl.col("LONGITUDE"),
        pl.col("LATITUDE_right"),
        pl.col("LONGITUDE_right")
    )
).collect()
3.91 s ± 53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

On my machine the NumPy version tends to be 5-20% faster than the pure Polars version:

%%timeit
pairs.select(
    gcd_pl(
        pl.col("LATITUDE"),
        pl.col("LONGITUDE"),
        pl.col("LATITUDE_right"),
        pl.col("LONGITUDE_right")
    )
).collect()
4.64 s ± 51.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This may not be a huge performance difference, but it at least means you don’t sacrifice speed when relying on NumPy. There are some gotchas though so watch out for those.

Also watch out for .to_numpy() - you don’t always need to call this and it can slow things down:

%%timeit
collected = pairs.collect()
gcd_np(
    collected["LATITUDE"].to_numpy(),
    collected["LONGITUDE"].to_numpy(),
    collected["LATITUDE_right"].to_numpy(),
    collected["LONGITUDE_right"].to_numpy()
)
5.22 s ± 187 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3.4 Polars can be slower than Pandas sometimes, maybe

Here’s an example where we calculate z-scores, using window functions in Polars and using groupby-transform in Pandas:

def create_frame(n, n_groups):
    return pl.DataFrame(
        {"name": np.random.randint(0, n_groups, size=n), "value2": np.random.randn(n)}
    )

def pandas_transform(df: pd.DataFrame) -> pd.DataFrame:
    g = df.groupby("name")["value2"]
    v = df["value2"]
    return (v - g.transform("mean")) / g.transform("std")


def polars_transform() -> pl.Expr:
    v = pl.col("value2")
    return (v - v.mean().over("name")) / v.std().over("name")

rand_df_pl = create_frame(50_000_000, 50_000)
rand_df_pd = rand_df_pl.to_pandas()

The Polars version tends to be 10-100% slower on my machine:

%timeit rand_df_pl.select(polars_transform())
3.08 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pandas_transform(rand_df_pd)
2.18 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This example isn’t telling you to use Pandas in this specific situation. Once you add in the time spent reading a file, Polars likely wins.

And even here, if you sort by the name col, Polars wins again. It has fast-track algorithms for sorted data.

3.5 Summary

  • Polars is really fast. Pandas was already respectably fast and Polars wipes the floor with it.
  • You can still make Polars slow if you do silly things with it, but compared to Pandas it’s easier to do the right thing in the first place.
  • Polars works well with NumPy ufuncs.
  • There are still some situations where Pandas can be faster. They are probably not compelling, but we shouldn’t pretend they don’t exist.