As you’ve probably heard by now, Polars is very very fast. Well-written Polars is quicker than well-written Pandas, and it’s easier to write Polars well. With that in mind…
3.1 Six fairly obvious performance rules
Here are some tips that are almost always a good idea:
Use the lazy API.
Use Exprs, and don’t use .apply unless you really have to.
Use the smallest necessary numeric types (so if you have an integer between 0 and 255, use pl.UInt8, not pl.Int64). This will save both time and space.
Use efficient storage (if you’re dumping stuff in files, Parquet is a good choice).
Use categoricals for recurring strings (but note that it may not be worth it if there’s not much repetition).
Only select the columns you need.
Tip
If your colleagues are happy with CSVs and can’t be convinced to use something else, tell them that the Modern Polars book says they should feel bad.
These are basically the same rules you’d follow when using Pandas, except for the one about the lazy API. Now for some comparisons between the performance of idiomatic Pandas and Polars.
3.2 Polars is faster at the boring stuff
Here we’ll clean up a messy dataset, kindly provided by Kaggle user Rachit Toshniwal as a deliberate example of a really crap CSV. Most of the cleanup involves extracting numeric data from awkward strings.
Also, the data is too small so I’ve concatenated it to itself 20 times. We’re not doing anything that will care about the duplication. Here’s how the raw table looks:
Code
import pandas as pdpd.read_csv("../data/fifa21_raw_big.csv", dtype="string", nrows=2)
ID
Name
LongName
photoUrl
playerUrl
Nationality
Age
↓OVA
POT
Club
...
A/W
D/W
IR
PAC
SHO
PAS
DRI
DEF
PHY
Hits
0
158023
L. Messi
Lionel Messi
https://cdn.sofifa.com/players/158/023/21_60.png
http://sofifa.com/player/158023/lionel-messi/2...
Argentina
33
93
93
FC Barcelona
...
Medium
Low
5 ★
85
92
91
95
38
65
771
1
20801
Cristiano Ronaldo
C. Ronaldo dos Santos Aveiro
https://cdn.sofifa.com/players/020/801/21_60.png
http://sofifa.com/player/20801/c-ronaldo-dos-s...
Portugal
35
92
92
Juventus
...
High
Low
5 ★
89
93
81
89
35
77
562
2 rows × 77 columns
For this exercise we’ll assume we want to make use of all the columns. First some boilerplate where we map out the different data types:
Code
import pandas as pdimport polars as plimport numpy as npimport mathstr_cols = ["Name","LongName","playerUrl","photoUrl",]initial_category_cols_pl = ["Nationality","Preferred Foot","Best Position","A/W","D/W"]category_cols = [*initial_category_cols_pl, "Club"]date_cols = ["Joined","Loan Date End"]# these all start with the euro symbol and end with 0, M or Kmoney_cols = ["Value","Wage","Release Clause"]star_cols = ["W/F","SM","IR",]# Contract col is a range of years# Positions is a list of positions# Height is in cm# Weight is in kg# Hits is numbers with K and M messy_cols = ["Contract","Positions","Height","Weight","Hits"]initially_str_cols = str_cols + date_cols + money_cols + star_cols + messy_colsinitially_str_cols_pl = [*initially_str_cols, "Club"]u32_cols = ["ID","Total Stats"]u8_cols = ['Age','↓OVA','POT','BOV','Crossing','Finishing','Heading Accuracy','Short Passing','Volleys','Dribbling','Curve','FK Accuracy','Long Passing','Ball Control','Acceleration','Sprint Speed','Agility','Reactions','Balance','Shot Power','Jumping','Stamina','Strength','Long Shots','Aggression','Interceptions','Positioning','Vision','Penalties','Composure','Marking','Standing Tackle','Sliding Tackle','GK Diving','GK Handling','GK Kicking','GK Positioning','GK Reflexes','PAC','SHO','PAS','DRI','DEF','PHY']u16_cols = ['Attacking','Skill','Movement','Power','Mentality','Defending','Goalkeeping','Total Stats','Base Stats']
3.2.1 Dtypes
Here are the initial dtypes for the two dataframes:
# can't use UInt8/16 in scan_csvdtypes_pl = ( {col: pl.Utf8 for col in initially_str_cols_pl}| {col: pl.Categorical for col in initial_category_cols_pl}| {col: pl.UInt32 for col in [*u32_cols, *u16_cols, *u8_cols]})
dtypes_pd = ( {col: pd.StringDtype() for col in initially_str_cols}| {col: pd.CategoricalDtype() for col in category_cols}| {col: "uint32"for col in u32_cols}| {col: "uint8"for col in u8_cols}| {col: "uint16"for col in u16_cols})
One thing I’ll note here is that Pandas numeric types are somewhat confusing: "uint32" means np.uint32 which is not the same thing as pd.UInt32Dtype(). Only the latter is nullable. On the other hand, Polars has just one unsigned 32-bit integer type, and it’s nullable.
Tip
Polars expressions have a shrink_dtype method that can be more convenient than manually specifying the dtypes yourself. It’s not magic though, and it has to spend time finding the min and max of the column.
3.2.2 Data cleaning
There’s not much that you haven’t seen here already, so we won’t explain the code line by line. The main new thing here is pl.when for ternary expressions.
%%timenew_cols_pl = ([ pl.col("Club").str.strip_chars().cast(pl.Categorical), parse_suffixed_num_pl(pl.col("Hits")).cast(pl.UInt32), pl.col("Positions").str.split(","), parse_height_pl(pl.col("Height")), parse_weight_pl(pl.col("Weight")),]+ [parse_date_pl(pl.col(col)) for col in date_cols]+ [parse_money_pl(pl.col(col)) for col in money_cols]+ [parse_star_pl(pl.col(col)) for col in star_cols]+ parse_contract_pl(pl.col("Contract"))+ [pl.col(col).cast(pl.UInt16) for col in u16_cols]+ [pl.col(col).cast(pl.UInt8) for col in u8_cols])fifa_pl = ( pl.scan_csv("../data/fifa21_raw_big.csv", schema_overrides=dtypes_pl) .with_columns(new_cols_pl) .drop("Contract") .rename({"↓OVA": "OVA"}) .collect())
CPU times: user 1.78 s, sys: 197 ms, total: 1.98 s
Wall time: 216 ms
<timed exec>:20: CategoricalRemappingWarning: Local categoricals have different encodings, expensive re-encoding is done to perform this merge operation. Consider using a StringCache or an Enum type if the categories are known in advance
%%timefifa_pd = ( pd.read_csv("../data/fifa21_raw_big.csv", dtype=dtypes_pd) .assign(Club=lambda df: df["Club"].cat.rename_categories(lambda c: c.strip()),**{col: lambda df: parse_date_pd(df[col]) for col in date_cols},**{col: lambda df: parse_money_pd(df[col]) for col in money_cols},**{col: lambda df: parse_star_pd(df[col]) for col in star_cols}, Hits=lambda df: parse_suffixed_num_pd(df["Hits"]).astype(pd.UInt32Dtype()), Positions=lambda df: df["Positions"].str.split(","), Height=lambda df: parse_height_pd(df["Height"]), Weight=lambda df: parse_weight_pd(df["Weight"]) ) .pipe(parse_contract_pd) .rename(columns={"↓OVA": "OVA"}))
CPU times: user 3.55 s, sys: 347 ms, total: 3.9 s
Wall time: 3.9 s
You could play around with the timings here and even try the .profile method to see what Polars spends its time on. In this scenario the speed advantage of Polars likely comes down to three things:
It is much faster at reading CSVs.
It is much faster at processing strings.
It can select/assign columns in parallel.
3.3 NumPy might make Polars faster sometimes
Polars gets along well with NumPy ufuncs, even in lazy mode (which is interesting because NumPy has no lazy API). Let’s see how this looks by calculating the great-circle distance between a bunch of coordinates.
3.3.1 Get the data
We create a lazy dataframe containing pairs of airports and their coordinates:
One use case for NumPy ufuncs is doing computations that Polars expressions don’t support. In this example Polars can do everything we need, though the ufunc version ends up being slightly faster:
3.22 s ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
On my machine the NumPy version used to be 5-20% faster than the pure Polars version, but this is no longer the case. Still you may want to see if it helps you:
3.87 s ± 79.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This may not be a huge performance difference, but it at least means you don’t sacrifice speed when relying on NumPy. There are some gotchas though so watch out for those.
Also watch out for .to_numpy() - you don’t always need to call this and it can slow things down: