1 Indexing (Or Lack Thereof)

1.1 Fetch Data (No Dataframes Here)

First we fetch some flight delay data. This part isn’t about dataframes so feel free to skip the code.

Code

from pathlib import Path
from zipfile import ZipFile
import requests

data_dir = Path("../data") # replace this with a directory of your choice
dest = data_dir / "flights.csv.zip"

if not dest.exists():
    r = requests.get(
        "https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2022_1.zip",
        verify=False,
        stream=True,
    )

    data_dir.mkdir(exist_ok=True)
    with dest.open("wb") as f:
        for chunk in r.iter_content(chunk_size=102400):
            if chunk:
                f.write(chunk)

    with ZipFile(dest) as zf:
        zf.extract(zf.filelist[0].filename, path=data_dir)

extracted = data_dir / "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv"

1.2 Read the data

Tip

The examples in this book use the lazy evaluation feature of Polars less than you should. It’s just inconvenient to use the lazy API when displaying dozens of intermediate results for educational purposes.

Polars
Pandas

import polars as pl
pl.Config.set_tbl_rows(5) # don't print too many rows in the book
df_pl = pl.read_csv(extracted, truncate_ragged_lines=True)
df_pl

shape: (537_902, 110)

Year	Quarter	Month	DayofMonth	DayOfWeek	FlightDate	Reporting_Airline	DOT_ID_Reporting_Airline	IATA_CODE_Reporting_Airline	Tail_Number	Flight_Number_Reporting_Airline	OriginAirportID	OriginAirportSeqID	OriginCityMarketID	Origin	OriginCityName	OriginState	OriginStateFips	OriginStateName	OriginWac	DestAirportID	DestAirportSeqID	DestCityMarketID	Dest	DestCityName	DestState	DestStateFips	DestStateName	DestWac	CRSDepTime	DepTime	DepDelay	DepDelayMinutes	DepDel15	DepartureDelayGroups	DepTimeBlk	TaxiOut	…	Div1TotalGTime	Div1LongestGTime	Div1WheelsOff	Div1TailNum	Div2Airport	Div2AirportID	Div2AirportSeqID	Div2WheelsOn	Div2TotalGTime	Div2LongestGTime	Div2WheelsOff	Div2TailNum	Div3Airport	Div3AirportID	Div3AirportSeqID	Div3WheelsOn	Div3TotalGTime	Div3LongestGTime	Div3WheelsOff	Div3TailNum	Div4Airport	Div4AirportID	Div4AirportSeqID	Div4WheelsOn	Div4TotalGTime	Div4LongestGTime	Div4WheelsOff	Div4TailNum	Div5Airport	Div5AirportID	Div5AirportSeqID	Div5WheelsOn	Div5TotalGTime	Div5LongestGTime	Div5WheelsOff	Div5TailNum
i64	i64	i64	i64	i64	str	str	i64	str	str	i64	i64	i64	i64	str	str	str	i64	str	i64	i64	i64	i64	str	str	str	i64	str	i64	i64	str	f64	f64	f64	i64	str	f64	…	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str
2022	1	1	14	5	"2022-01-14"	"YX"	20452	"YX"	"N119HQ"	4879	11066	1106606	31066	"CMH"	"Columbus, OH"	"OH"	39	"Ohio"	44	11278	1127805	30852	"DCA"	"Washington, DC"	"VA"	51	"Virginia"	38	1224	"1221"	-3.0	0.0	0.0	-1	"1200-1259"	28.0	…	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	null
2022	1	1	15	6	"2022-01-15"	"YX"	20452	"YX"	"N122HQ"	4879	11066	1106606	31066	"CMH"	"Columbus, OH"	"OH"	39	"Ohio"	44	11278	1127805	30852	"DCA"	"Washington, DC"	"VA"	51	"Virginia"	38	1224	"1214"	-10.0	0.0	0.0	-1	"1200-1259"	19.0	…	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	null
2022	1	1	16	7	"2022-01-16"	"YX"	20452	"YX"	"N412YX"	4879	11066	1106606	31066	"CMH"	"Columbus, OH"	"OH"	39	"Ohio"	44	11278	1127805	30852	"DCA"	"Washington, DC"	"VA"	51	"Virginia"	38	1224	"1218"	-6.0	0.0	0.0	-1	"1200-1259"	16.0	…	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	null
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2022	1	1	6	4	"2022-01-06"	"DL"	19790	"DL"	"N989AT"	1579	11057	1105703	31057	"CLT"	"Charlotte, NC"	"NC"	37	"North Carolina"	36	10397	1039707	30397	"ATL"	"Atlanta, GA"	"GA"	13	"Georgia"	34	1258	"1257"	-1.0	0.0	0.0	-1	"1200-1259"	15.0	…	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	null
2022	1	1	6	4	"2022-01-06"	"DL"	19790	"DL"	"N815DN"	1580	14869	1486903	34614	"SLC"	"Salt Lake City, UT"	"UT"	49	"Utah"	87	14057	1405702	34057	"PDX"	"Portland, OR"	"OR"	41	"Oregon"	92	2240	"2231"	-9.0	0.0	0.0	-1	"2200-2259"	10.0	…	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	""	null	null	""	null	null	""	""	null

import pandas as pd
pd.options.display.max_rows = 5
df_pd = pd.read_csv(extracted)
df_pd

/tmp/ipykernel_667767/2805799744.py:3: DtypeWarning: Columns (76,77,84) have mixed types. Specify dtype option on import or set low_memory=False.
  df_pd = pd.read_csv(extracted)

	Year	Quarter	Month	DayofMonth	DayOfWeek	FlightDate	Reporting_Airline	DOT_ID_Reporting_Airline	IATA_CODE_Reporting_Airline	Tail_Number	...	Div4TailNum	Div5Airport	Div5AirportID	Div5AirportSeqID	Div5WheelsOn	Div5TotalGTime	Div5LongestGTime	Div5WheelsOff	Div5TailNum	Unnamed: 109
0	2022	1	1	14	5	2022-01-14	YX	20452	YX	N119HQ	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2022	1	1	15	6	2022-01-15	YX	20452	YX	N122HQ	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
537900	2022	1	1	6	4	2022-01-06	DL	19790	DL	N989AT	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
537901	2022	1	1	6	4	2022-01-06	DL	19790	DL	N815DN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

537902 rows × 110 columns

1.3 Indexing

Pandas uses a special index type that is quite powerful for selecting rows and columns but is also very complicated. To quote Modern Pandas:

The complexity of pandas’ indexing is a microcosm for the complexity of the pandas API in general. There’s a reason for the complexity (well, most of it), but that’s not much consolation while you’re learning. Still, all of these ways of indexing really are useful enough to justify their inclusion in the library.

It’s true that Pandas indexing is quite useful, but it’s also true that everyone always forgets how to do anything non-trivial with it. The benefits of being able to put df.loc[pd.IndexSlice[:, 'B0':'B1'], :]] in your code base are somewhat dubious.

Polars avoids this complexity by simply not having an index. It just has ordinary methods like .select, .filter and .head for accessing a subset of rows or columns.

1.4 Slicing vs Selecting

In Pandas you can subset a dataframe with .loc[], .iloc[] or just []. In Polars you select rows and columns with expressions as noted above.

You can also use square bracket indexing in Polars, but it doesn’t work in lazy mode so is only to be used when convenience is most important.

Here are some examples:

1.4.1 Rows by number, columns by name

Using head and tail:

df_pl.select(["Dest", "Tail_Number"]).head(16).tail(4)

shape: (4, 2)

Dest	Tail_Number
str	str
"DCA"	"N132HQ"
"DCA"	"N109HQ"
"DCA"	"N421YX"
"DCA"	"N137HQ"

Or using gather:

df_pl.select(pl.col(["Dest", "Tail_Number"]).gather(list(range(12, 16))))

shape: (4, 2)

Dest	Tail_Number
str	str
"DCA"	"N132HQ"
"DCA"	"N109HQ"
"DCA"	"N421YX"
"DCA"	"N137HQ"

df_pl[12:16, ["Dest", "Tail_Number"]]

shape: (4, 2)

Dest	Tail_Number
str	str
"DCA"	"N132HQ"
"DCA"	"N109HQ"
"DCA"	"N421YX"
"DCA"	"N137HQ"

df_pd.loc[12:15, ["Dest", "Tail_Number"]]

	Dest	Tail_Number
12	DCA	N132HQ
13	DCA	N109HQ
14	DCA	N421YX
15	DCA	N137HQ

1.4.2 Rows by string index, columns by name

Since there’s no such thing as an index in Polars, so we just use .filter:

Polars
Pandas

(
    df_pl
    .filter(pl.col("IATA_CODE_Reporting_Airline").is_in(['AA', 'DL']))
    .select(["IATA_CODE_Reporting_Airline", "Dest", "Tail_Number"])
)

shape: (138_363, 3)

IATA_CODE_Reporting_Airline	Dest	Tail_Number
str	str	str
"DL"	"LGA"	"N315DN"
"DL"	"FLL"	"N545US"
"DL"	"ATL"	"N545US"
…	…	…
"DL"	"ATL"	"N989AT"
"DL"	"PDX"	"N815DN"

(
    df_pd
    .set_index("IATA_CODE_Reporting_Airline")
    .loc[['AA', 'DL'], ["Dest", "Tail_Number"]]
)

	Dest	Tail_Number
IATA_CODE_Reporting_Airline
AA	LAX	N106NN
AA	LAX	N112AN
...	...	...
DL	ATL	N989AT
DL	PDX	N815DN

138363 rows × 2 columns

1.4.3 Rows by number, columns by number

The Polars docs recommend doing this the evil way with square brackets, so make of that what you will. Selecting columns by number isn’t a very common operation anyway.

Polars
Pandas

df_pl[[0, 1, 3], [0, 1]]

shape: (3, 2)

Year	Quarter
i64	i64
2022	1
2022	1
2022	1

df_pd.iloc[[0, 1, 3], [0, 1]]

	Year	Quarter
0	2022	1
1	2022	1
3	2022	1

1.5 SettingWithCopy

Pandas has this cute thing where if you assign values to some subset of the dataframe with square bracket indexing, it doesn’t work and gives the notorious SettingWithCopyWarning. To be fair, this warning also tells you to assign using .loc. Unfortunately many people in the Pandas community can’t read and instead just ignore the warning.

Polars is not yet popular enough to attract the same crowd, but when it does it should not run into the same problem, as the only way to add or overwrite columns in Polars is the with_columns method.

f = pl.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f.with_columns(
    pl.when(pl.col("a") <= 3)
    .then(pl.col("b") // 10)
    .otherwise(pl.col("b"))
)

shape: (5, 2)

a	b
i64	i64
1	1
2	2
3	3
4	40
5	50

f = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f[f['a'] <= 3]['b'] = f['b'] // 10
f

/tmp/ipykernel_667767/1317853993.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  f[f['a'] <= 3]['b'] = f['b'] // 10

	a	b
0	1	10
1	2	20
2	3	30
3	4	40
4	5	50

f = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f.loc[f['a'] <= 3, "b"] = f['b'] // 10
f

	a	b
0	1	1
1	2	2
2	3	3
3	4	40
4	5	50

f = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f.assign(b=f["b"].mask(f["a"] <=3, f["b"] // 10))

	a	b
0	1	1
1	2	2
2	3	3
3	4	40
4	5	50

1.6 Summary

Basically, there’s no index in Polars and square brackets are bad most of the time. I think the lack of an index is quite acceptable even if there are cases where it’s useful. Most Pandas users just call .reset_index() all the time anyway.