1  Indexing (Or Lack Thereof)

This chapter is about indexes, which are the source of much confusion in Pandas and simply don’t exist in Polars.

1.1 Fetch Data (No Dataframes Here)

First we fetch some flight delay data. This part isn’t about dataframes so feel free to skip the code.

Code
from pathlib import Path
from zipfile import ZipFile
import requests

data_dir = Path("../data") # replace this with a directory of your choice
dest = data_dir / "flights.csv.zip"

if not dest.exists():
    r = requests.get(
        "https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2022_1.zip",
        verify=False,
        stream=True,
    )

    data_dir.mkdir(exist_ok=True)
    with dest.open("wb") as f:
        for chunk in r.iter_content(chunk_size=102400):
            if chunk:
                f.write(chunk)

    with ZipFile(dest) as zf:
        zf.extract(zf.filelist[0].filename, path=data_dir)

extracted = data_dir / "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv"

1.2 Read the data

Tip

The examples in this book use the lazy evaluation feature of Polars less than you should. It’s just inconvenient to use the lazy API when displaying dozens of intermediate results for educational purposes.

import polars as pl
pl.Config.set_tbl_rows(5) # don't print too many rows in the book
df_pl = pl.read_csv(extracted, truncate_ragged_lines=True)
df_pl
shape: (537_902, 110)
YearQuarterMonthDayofMonthDayOfWeekFlightDateReporting_AirlineDOT_ID_Reporting_AirlineIATA_CODE_Reporting_AirlineTail_NumberFlight_Number_Reporting_AirlineOriginAirportIDOriginAirportSeqIDOriginCityMarketIDOriginOriginCityNameOriginStateOriginStateFipsOriginStateNameOriginWacDestAirportIDDestAirportSeqIDDestCityMarketIDDestDestCityNameDestStateDestStateFipsDestStateNameDestWacCRSDepTimeDepTimeDepDelayDepDelayMinutesDepDel15DepartureDelayGroupsDepTimeBlkTaxiOutDiv1TotalGTimeDiv1LongestGTimeDiv1WheelsOffDiv1TailNumDiv2AirportDiv2AirportIDDiv2AirportSeqIDDiv2WheelsOnDiv2TotalGTimeDiv2LongestGTimeDiv2WheelsOffDiv2TailNumDiv3AirportDiv3AirportIDDiv3AirportSeqIDDiv3WheelsOnDiv3TotalGTimeDiv3LongestGTimeDiv3WheelsOffDiv3TailNumDiv4AirportDiv4AirportIDDiv4AirportSeqIDDiv4WheelsOnDiv4TotalGTimeDiv4LongestGTimeDiv4WheelsOffDiv4TailNumDiv5AirportDiv5AirportIDDiv5AirportSeqIDDiv5WheelsOnDiv5TotalGTimeDiv5LongestGTimeDiv5WheelsOffDiv5TailNum
i64i64i64i64i64strstri64strstri64i64i64i64strstrstri64stri64i64i64i64strstrstri64stri64i64strf64f64f64i64strf64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr
202211145"2022-01-14""YX"20452"YX""N119HQ"487911066110660631066"CMH""Columbus, OH""OH"39"Ohio"4411278112780530852"DCA""Washington, DC…"VA"51"Virginia"381224"1221"-3.00.00.0-1"1200-1259"28.0nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""null
202211156"2022-01-15""YX"20452"YX""N122HQ"487911066110660631066"CMH""Columbus, OH""OH"39"Ohio"4411278112780530852"DCA""Washington, DC…"VA"51"Virginia"381224"1214"-10.00.00.0-1"1200-1259"19.0nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""null
202211167"2022-01-16""YX"20452"YX""N412YX"487911066110660631066"CMH""Columbus, OH""OH"39"Ohio"4411278112780530852"DCA""Washington, DC…"VA"51"Virginia"381224"1218"-6.00.00.0-1"1200-1259"16.0nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""null
20221164"2022-01-06""DL"19790"DL""N989AT"157911057110570331057"CLT""Charlotte, NC""NC"37"North Carolina…3610397103970730397"ATL""Atlanta, GA""GA"13"Georgia"341258"1257"-1.00.00.0-1"1200-1259"15.0nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""null
20221164"2022-01-06""DL"19790"DL""N815DN"158014869148690334614"SLC""Salt Lake City…"UT"49"Utah"8714057140570234057"PDX""Portland, OR""OR"41"Oregon"922240"2231"-9.00.00.0-1"2200-2259"10.0nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""""nullnull""nullnull""""null
import pandas as pd
pd.options.display.max_rows = 5
df_pd = pd.read_csv(extracted)
df_pd
/tmp/ipykernel_28258/2805799744.py:3: DtypeWarning:

Columns (76,77,84) have mixed types. Specify dtype option on import or set low_memory=False.
Year Quarter Month DayofMonth DayOfWeek FlightDate Reporting_Airline DOT_ID_Reporting_Airline IATA_CODE_Reporting_Airline Tail_Number ... Div4TailNum Div5Airport Div5AirportID Div5AirportSeqID Div5WheelsOn Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum Unnamed: 109
0 2022 1 1 14 5 2022-01-14 YX 20452 YX N119HQ ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2022 1 1 15 6 2022-01-15 YX 20452 YX N122HQ ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
537900 2022 1 1 6 4 2022-01-06 DL 19790 DL N989AT ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
537901 2022 1 1 6 4 2022-01-06 DL 19790 DL N815DN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

537902 rows × 110 columns

1.3 Indexing

Pandas uses a special index type that is quite powerful for selecting rows and columns but is also very complicated. To quote Modern Pandas:

The complexity of pandas’ indexing is a microcosm for the complexity of the pandas API in general. There’s a reason for the complexity (well, most of it), but that’s not much consolation while you’re learning. Still, all of these ways of indexing really are useful enough to justify their inclusion in the library.

It’s true that Pandas indexing is quite useful, but it’s also true that everyone always forgets how do to anything non-trivial with it. The benefits of being able to put df.loc[pd.IndexSlice[:, 'B0':'B1'], :]] in your code base are somewhat dubious.

Polars avoids this complexity by simply not having an index. It just has ordinary methods like .select, .filter and .head for accessing a subset of rows or columns.

1.4 Slicing vs Selecting

In Pandas you can subset a dataframe with .loc[], .iloc[] or just []. In Polars you select rows and columns with expressions as noted above.

You can also use square bracket indexing in Polars, but it doesn’t work in lazy mode so is only to be used when convenience is most important.

Here are some examples:

1.4.1 Rows by number, columns by name

Using head and tail:

df_pl.select(["Dest", "Tail_Number"]).head(16).tail(4)
shape: (4, 2)
DestTail_Number
strstr
"DCA""N132HQ"
"DCA""N109HQ"
"DCA""N421YX"
"DCA""N137HQ"

Or using take:

df_pl.select(pl.col(["Dest", "Tail_Number"]).gather(list(range(12, 16))))
shape: (4, 2)
DestTail_Number
strstr
"DCA""N132HQ"
"DCA""N109HQ"
"DCA""N421YX"
"DCA""N137HQ"
df_pl[12:16, ["Dest", "Tail_Number"]]
shape: (4, 2)
DestTail_Number
strstr
"DCA""N132HQ"
"DCA""N109HQ"
"DCA""N421YX"
"DCA""N137HQ"
df_pd.loc[12:15, ["Dest", "Tail_Number"]]
Dest Tail_Number
12 DCA N132HQ
13 DCA N109HQ
14 DCA N421YX
15 DCA N137HQ

1.4.2 Rows by string index, columns by name

Since there’s no such thing as an index in Polars, so we just use .filter:

(
    df_pl
    .filter(pl.col("IATA_CODE_Reporting_Airline").is_in(['AA', 'DL']))
    .select(["IATA_CODE_Reporting_Airline", "Dest", "Tail_Number"])
)
shape: (138_363, 3)
IATA_CODE_Reporting_AirlineDestTail_Number
strstrstr
"DL""LGA""N315DN"
"DL""FLL""N545US"
"DL""ATL""N545US"
"DL""ATL""N989AT"
"DL""PDX""N815DN"
(
    df_pd
    .set_index("IATA_CODE_Reporting_Airline")
    .loc[['AA', 'DL'], ["Dest", "Tail_Number"]]
)
Dest Tail_Number
IATA_CODE_Reporting_Airline
AA LAX N106NN
AA LAX N112AN
... ... ...
DL ATL N989AT
DL PDX N815DN

138363 rows × 2 columns

1.4.3 Rows by number, columns by number

The Polars docs recommend doing this the evil way with square brackets, so make of that what you will. Selecting columns by number isn’t a very common operation anyway.

df_pl[[0, 1, 3], [0, 1]]
shape: (3, 2)
YearQuarter
i64i64
20221
20221
20221
df_pd.iloc[[0, 1, 3], [0, 1]]
Year Quarter
0 2022 1
1 2022 1
3 2022 1

1.5 SettingWithCopy

Pandas has this cute thing where if you assign values to some subset of the dataframe with square bracket indexing, it doesn’t work and gives the notorious SettingWithCopyWarning. To be fair, this warning also tells you to assign using .loc. Unfortunately many people in the Pandas community can’t read and instead just ignore the warning.

Polars is not yet popular enough to attact the same crowd, but when it does it should not run into the same problem, as the only way to add or overwrite columns in Polars is the with_columns method.

f = pl.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f.with_columns(
    pl.when(pl.col("a") <= 3)
    .then(pl.col("b") // 10)
    .otherwise(pl.col("b"))
)
shape: (5, 2)
ab
i64i64
11
22
33
440
550
f = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f[f['a'] <= 3]['b'] = f['b'] // 10
f
/tmp/ipykernel_28258/1317853993.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
a b
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
f = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f.loc[f['a'] <= 3, "b"] = f['b'] // 10
f
a b
0 1 1
1 2 2
2 3 3
3 4 40
4 5 50
f = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f.assign(b=f["b"].mask(f["a"] <=3, f["b"] // 10))
a b
0 1 1
1 2 2
2 3 3
3 4 40
4 5 50

1.6 Summary

Basically, there’s no index in Polars and square brackets are bad most of the time. I think the lack of an index is quite acceptable even if there are cases where it’s useful. Most Pandas users just call .reset_index() all the time anyway.