This chapter is about indexes, which are the source of much confusion in Pandas and simply don’t exist in Polars.
1.1 Fetch Data (No Dataframes Here)
First we fetch some flight delay data. This part isn’t about dataframes so feel free to skip the code.
Code
from pathlib import Pathfrom zipfile import ZipFileimport requestsdata_dir = Path("../data") # replace this with a directory of your choicedest = data_dir /"flights.csv.zip"ifnot dest.exists(): r = requests.get("https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2022_1.zip", verify=False, stream=True, ) data_dir.mkdir(exist_ok=True)with dest.open("wb") as f:for chunk in r.iter_content(chunk_size=102400):if chunk: f.write(chunk)with ZipFile(dest) as zf: zf.extract(zf.filelist[0].filename, path=data_dir)extracted = data_dir /"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv"
1.2 Read the data
Tip
The examples in this book use the lazy evaluation feature of Polars less than you should. It’s just inconvenient to use the lazy API when displaying dozens of intermediate results for educational purposes.
import polars as plpl.Config.set_tbl_rows(5) # don't print too many rows in the bookdf_pl = pl.read_csv(extracted, truncate_ragged_lines=True)df_pl
shape: (537_902, 110)
Year
Quarter
Month
DayofMonth
DayOfWeek
FlightDate
Reporting_Airline
DOT_ID_Reporting_Airline
IATA_CODE_Reporting_Airline
Tail_Number
Flight_Number_Reporting_Airline
OriginAirportID
OriginAirportSeqID
OriginCityMarketID
Origin
OriginCityName
OriginState
OriginStateFips
OriginStateName
OriginWac
DestAirportID
DestAirportSeqID
DestCityMarketID
Dest
DestCityName
DestState
DestStateFips
DestStateName
DestWac
CRSDepTime
DepTime
DepDelay
DepDelayMinutes
DepDel15
DepartureDelayGroups
DepTimeBlk
TaxiOut
…
Div1TotalGTime
Div1LongestGTime
Div1WheelsOff
Div1TailNum
Div2Airport
Div2AirportID
Div2AirportSeqID
Div2WheelsOn
Div2TotalGTime
Div2LongestGTime
Div2WheelsOff
Div2TailNum
Div3Airport
Div3AirportID
Div3AirportSeqID
Div3WheelsOn
Div3TotalGTime
Div3LongestGTime
Div3WheelsOff
Div3TailNum
Div4Airport
Div4AirportID
Div4AirportSeqID
Div4WheelsOn
Div4TotalGTime
Div4LongestGTime
Div4WheelsOff
Div4TailNum
Div5Airport
Div5AirportID
Div5AirportSeqID
Div5WheelsOn
Div5TotalGTime
Div5LongestGTime
Div5WheelsOff
Div5TailNum
i64
i64
i64
i64
i64
str
str
i64
str
str
i64
i64
i64
i64
str
str
str
i64
str
i64
i64
i64
i64
str
str
str
i64
str
i64
i64
str
f64
f64
f64
i64
str
f64
…
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
str
2022
1
1
14
5
"2022-01-14"
"YX"
20452
"YX"
"N119HQ"
4879
11066
1106606
31066
"CMH"
"Columbus, OH"
"OH"
39
"Ohio"
44
11278
1127805
30852
"DCA"
"Washington, DC"
"VA"
51
"Virginia"
38
1224
"1221"
-3.0
0.0
0.0
-1
"1200-1259"
28.0
…
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
null
2022
1
1
15
6
"2022-01-15"
"YX"
20452
"YX"
"N122HQ"
4879
11066
1106606
31066
"CMH"
"Columbus, OH"
"OH"
39
"Ohio"
44
11278
1127805
30852
"DCA"
"Washington, DC"
"VA"
51
"Virginia"
38
1224
"1214"
-10.0
0.0
0.0
-1
"1200-1259"
19.0
…
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
null
2022
1
1
16
7
"2022-01-16"
"YX"
20452
"YX"
"N412YX"
4879
11066
1106606
31066
"CMH"
"Columbus, OH"
"OH"
39
"Ohio"
44
11278
1127805
30852
"DCA"
"Washington, DC"
"VA"
51
"Virginia"
38
1224
"1218"
-6.0
0.0
0.0
-1
"1200-1259"
16.0
…
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
null
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
2022
1
1
6
4
"2022-01-06"
"DL"
19790
"DL"
"N989AT"
1579
11057
1105703
31057
"CLT"
"Charlotte, NC"
"NC"
37
"North Carolina"
36
10397
1039707
30397
"ATL"
"Atlanta, GA"
"GA"
13
"Georgia"
34
1258
"1257"
-1.0
0.0
0.0
-1
"1200-1259"
15.0
…
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
null
2022
1
1
6
4
"2022-01-06"
"DL"
19790
"DL"
"N815DN"
1580
14869
1486903
34614
"SLC"
"Salt Lake City, UT"
"UT"
49
"Utah"
87
14057
1405702
34057
"PDX"
"Portland, OR"
"OR"
41
"Oregon"
92
2240
"2231"
-9.0
0.0
0.0
-1
"2200-2259"
10.0
…
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
""
null
null
""
null
null
""
""
null
import pandas as pdpd.options.display.max_rows =5df_pd = pd.read_csv(extracted)df_pd
/tmp/ipykernel_667767/2805799744.py:3: DtypeWarning: Columns (76,77,84) have mixed types. Specify dtype option on import or set low_memory=False.
df_pd = pd.read_csv(extracted)
Year
Quarter
Month
DayofMonth
DayOfWeek
FlightDate
Reporting_Airline
DOT_ID_Reporting_Airline
IATA_CODE_Reporting_Airline
Tail_Number
...
Div4TailNum
Div5Airport
Div5AirportID
Div5AirportSeqID
Div5WheelsOn
Div5TotalGTime
Div5LongestGTime
Div5WheelsOff
Div5TailNum
Unnamed: 109
0
2022
1
1
14
5
2022-01-14
YX
20452
YX
N119HQ
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1
2022
1
1
15
6
2022-01-15
YX
20452
YX
N122HQ
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
537900
2022
1
1
6
4
2022-01-06
DL
19790
DL
N989AT
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
537901
2022
1
1
6
4
2022-01-06
DL
19790
DL
N815DN
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
537902 rows × 110 columns
1.3 Indexing
Pandas uses a special index type that is quite powerful for selecting rows and columns but is also very complicated. To quote Modern Pandas:
The complexity of pandas’ indexing is a microcosm for the complexity of the pandas API in general. There’s a reason for the complexity (well, most of it), but that’s not much consolation while you’re learning. Still, all of these ways of indexing really are useful enough to justify their inclusion in the library.
It’s true that Pandas indexing is quite useful, but it’s also true that everyone always forgets how to do anything non-trivial with it. The benefits of being able to put df.loc[pd.IndexSlice[:, 'B0':'B1'], :]] in your code base are somewhat dubious.
Polars avoids this complexity by simply not having an index. It just has ordinary methods like .select, .filter and .head for accessing a subset of rows or columns.
1.4 Slicing vs Selecting
In Pandas you can subset a dataframe with .loc[], .iloc[] or just []. In Polars you select rows and columns with expressions as noted above.
You can also use square bracket indexing in Polars, but it doesn’t work in lazy mode so is only to be used when convenience is most important.
The Polars docs recommend doing this the evil way with square brackets, so make of that what you will. Selecting columns by number isn’t a very common operation anyway.
Pandas has this cute thing where if you assign values to some subset of the dataframe with square bracket indexing, it doesn’t work and gives the notorious SettingWithCopyWarning. To be fair, this warning also tells you to assign using .loc. Unfortunately many people in the Pandas community can’t read and instead just ignore the warning.
Polars is not yet popular enough to attract the same crowd, but when it does it should not run into the same problem, as the only way to add or overwrite columns in Polars is the with_columns method.
f = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})f[f['a'] <=3]['b'] = f['b'] //10f
/tmp/ipykernel_667767/1317853993.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
f[f['a'] <= 3]['b'] = f['b'] // 10
f = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})f.assign(b=f["b"].mask(f["a"] <=3, f["b"] //10))
a
b
0
1
1
1
2
2
2
3
3
3
4
40
4
5
50
1.6 Summary
Basically, there’s no index in Polars and square brackets are bad most of the time. I think the lack of an index is quite acceptable even if there are cases where it’s useful. Most Pandas users just call .reset_index() all the time anyway.