1  Indexing (Or Lack Thereof)

This chapter is about indexes, which are the source of much confusion in Pandas and simply don’t exist in Polars.

1.1 Fetch Data (No Dataframes Here)

First we fetch some flight delay data. This part isn’t about dataframes so feel free to skip the code.

Code
from pathlib import Path
from zipfile import ZipFile
import requests

data_dir = Path("../data") # replace this with a directory of your choice
dest = data_dir / "flights.csv.zip"

if not dest.exists():
    r = requests.get(
        "https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2022_1.zip",
        verify=False,
        stream=True,
    )

    data_dir.mkdir(exist_ok=True)
    with dest.open("wb") as f:
        for chunk in r.iter_content(chunk_size=102400):
            if chunk:
                f.write(chunk)

    with ZipFile(dest) as zf:
        zf.extract(zf.filelist[0].filename, path=data_dir)

extracted = data_dir / "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv"

1.2 Read the data

Tip

The examples in this book use the lazy evaluation feature of Polars less than you should. It’s just inconvenient to use the lazy API when displaying dozens of intermediate results for educational purposes.

import polars as pl
pl.Config.set_tbl_rows(5) # don't print too many rows in the book
df_pl = pl.read_csv(extracted, truncate_ragged_lines=True)
df_pl
shape: (537_902, 110)
Year Quarter Month DayofMonth DayOfWeek FlightDate Reporting_Airline DOT_ID_Reporting_Airline IATA_CODE_Reporting_Airline Tail_Number Flight_Number_Reporting_Airline OriginAirportID OriginAirportSeqID OriginCityMarketID Origin OriginCityName OriginState OriginStateFips OriginStateName OriginWac DestAirportID DestAirportSeqID DestCityMarketID Dest DestCityName DestState DestStateFips DestStateName DestWac CRSDepTime DepTime DepDelay DepDelayMinutes DepDel15 DepartureDelayGroups DepTimeBlk TaxiOut Div1TotalGTime Div1LongestGTime Div1WheelsOff Div1TailNum Div2Airport Div2AirportID Div2AirportSeqID Div2WheelsOn Div2TotalGTime Div2LongestGTime Div2WheelsOff Div2TailNum Div3Airport Div3AirportID Div3AirportSeqID Div3WheelsOn Div3TotalGTime Div3LongestGTime Div3WheelsOff Div3TailNum Div4Airport Div4AirportID Div4AirportSeqID Div4WheelsOn Div4TotalGTime Div4LongestGTime Div4WheelsOff Div4TailNum Div5Airport Div5AirportID Div5AirportSeqID Div5WheelsOn Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum
i64 i64 i64 i64 i64 str str i64 str str i64 i64 i64 i64 str str str i64 str i64 i64 i64 i64 str str str i64 str i64 i64 str f64 f64 f64 i64 str f64 str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str str
2022 1 1 14 5 "2022-01-14" "YX" 20452 "YX" "N119HQ" 4879 11066 1106606 31066 "CMH" "Columbus, OH" "OH" 39 "Ohio" 44 11278 1127805 30852 "DCA" "Washington, DC" "VA" 51 "Virginia" 38 1224 "1221" -3.0 0.0 0.0 -1 "1200-1259" 28.0 null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" null
2022 1 1 15 6 "2022-01-15" "YX" 20452 "YX" "N122HQ" 4879 11066 1106606 31066 "CMH" "Columbus, OH" "OH" 39 "Ohio" 44 11278 1127805 30852 "DCA" "Washington, DC" "VA" 51 "Virginia" 38 1224 "1214" -10.0 0.0 0.0 -1 "1200-1259" 19.0 null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" null
2022 1 1 16 7 "2022-01-16" "YX" 20452 "YX" "N412YX" 4879 11066 1106606 31066 "CMH" "Columbus, OH" "OH" 39 "Ohio" 44 11278 1127805 30852 "DCA" "Washington, DC" "VA" 51 "Virginia" 38 1224 "1218" -6.0 0.0 0.0 -1 "1200-1259" 16.0 null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" null
2022 1 1 6 4 "2022-01-06" "DL" 19790 "DL" "N989AT" 1579 11057 1105703 31057 "CLT" "Charlotte, NC" "NC" 37 "North Carolina" 36 10397 1039707 30397 "ATL" "Atlanta, GA" "GA" 13 "Georgia" 34 1258 "1257" -1.0 0.0 0.0 -1 "1200-1259" 15.0 null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" null
2022 1 1 6 4 "2022-01-06" "DL" 19790 "DL" "N815DN" 1580 14869 1486903 34614 "SLC" "Salt Lake City, UT" "UT" 49 "Utah" 87 14057 1405702 34057 "PDX" "Portland, OR" "OR" 41 "Oregon" 92 2240 "2231" -9.0 0.0 0.0 -1 "2200-2259" 10.0 null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" "" null null "" null null "" "" null
import pandas as pd
pd.options.display.max_rows = 5
df_pd = pd.read_csv(extracted)
df_pd
/tmp/ipykernel_667767/2805799744.py:3: DtypeWarning: Columns (76,77,84) have mixed types. Specify dtype option on import or set low_memory=False.
  df_pd = pd.read_csv(extracted)
Year Quarter Month DayofMonth DayOfWeek FlightDate Reporting_Airline DOT_ID_Reporting_Airline IATA_CODE_Reporting_Airline Tail_Number ... Div4TailNum Div5Airport Div5AirportID Div5AirportSeqID Div5WheelsOn Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum Unnamed: 109
0 2022 1 1 14 5 2022-01-14 YX 20452 YX N119HQ ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2022 1 1 15 6 2022-01-15 YX 20452 YX N122HQ ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
537900 2022 1 1 6 4 2022-01-06 DL 19790 DL N989AT ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
537901 2022 1 1 6 4 2022-01-06 DL 19790 DL N815DN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

537902 rows × 110 columns

1.3 Indexing

Pandas uses a special index type that is quite powerful for selecting rows and columns but is also very complicated. To quote Modern Pandas:

The complexity of pandas’ indexing is a microcosm for the complexity of the pandas API in general. There’s a reason for the complexity (well, most of it), but that’s not much consolation while you’re learning. Still, all of these ways of indexing really are useful enough to justify their inclusion in the library.

It’s true that Pandas indexing is quite useful, but it’s also true that everyone always forgets how to do anything non-trivial with it. The benefits of being able to put df.loc[pd.IndexSlice[:, 'B0':'B1'], :]] in your code base are somewhat dubious.

Polars avoids this complexity by simply not having an index. It just has ordinary methods like .select, .filter and .head for accessing a subset of rows or columns.

1.4 Slicing vs Selecting

In Pandas you can subset a dataframe with .loc[], .iloc[] or just []. In Polars you select rows and columns with expressions as noted above.

You can also use square bracket indexing in Polars, but it doesn’t work in lazy mode so is only to be used when convenience is most important.

Here are some examples:

1.4.1 Rows by number, columns by name

Using head and tail:

df_pl.select(["Dest", "Tail_Number"]).head(16).tail(4)
shape: (4, 2)
Dest Tail_Number
str str
"DCA" "N132HQ"
"DCA" "N109HQ"
"DCA" "N421YX"
"DCA" "N137HQ"

Or using gather:

df_pl.select(pl.col(["Dest", "Tail_Number"]).gather(list(range(12, 16))))
shape: (4, 2)
Dest Tail_Number
str str
"DCA" "N132HQ"
"DCA" "N109HQ"
"DCA" "N421YX"
"DCA" "N137HQ"
df_pl[12:16, ["Dest", "Tail_Number"]]
shape: (4, 2)
Dest Tail_Number
str str
"DCA" "N132HQ"
"DCA" "N109HQ"
"DCA" "N421YX"
"DCA" "N137HQ"
df_pd.loc[12:15, ["Dest", "Tail_Number"]]
Dest Tail_Number
12 DCA N132HQ
13 DCA N109HQ
14 DCA N421YX
15 DCA N137HQ

1.4.2 Rows by string index, columns by name

Since there’s no such thing as an index in Polars, so we just use .filter:

(
    df_pl
    .filter(pl.col("IATA_CODE_Reporting_Airline").is_in(['AA', 'DL']))
    .select(["IATA_CODE_Reporting_Airline", "Dest", "Tail_Number"])
)
shape: (138_363, 3)
IATA_CODE_Reporting_Airline Dest Tail_Number
str str str
"DL" "LGA" "N315DN"
"DL" "FLL" "N545US"
"DL" "ATL" "N545US"
"DL" "ATL" "N989AT"
"DL" "PDX" "N815DN"
(
    df_pd
    .set_index("IATA_CODE_Reporting_Airline")
    .loc[['AA', 'DL'], ["Dest", "Tail_Number"]]
)
Dest Tail_Number
IATA_CODE_Reporting_Airline
AA LAX N106NN
AA LAX N112AN
... ... ...
DL ATL N989AT
DL PDX N815DN

138363 rows × 2 columns

1.4.3 Rows by number, columns by number

The Polars docs recommend doing this the evil way with square brackets, so make of that what you will. Selecting columns by number isn’t a very common operation anyway.

df_pl[[0, 1, 3], [0, 1]]
shape: (3, 2)
Year Quarter
i64 i64
2022 1
2022 1
2022 1
df_pd.iloc[[0, 1, 3], [0, 1]]
Year Quarter
0 2022 1
1 2022 1
3 2022 1

1.5 SettingWithCopy

Pandas has this cute thing where if you assign values to some subset of the dataframe with square bracket indexing, it doesn’t work and gives the notorious SettingWithCopyWarning. To be fair, this warning also tells you to assign using .loc. Unfortunately many people in the Pandas community can’t read and instead just ignore the warning.

Polars is not yet popular enough to attract the same crowd, but when it does it should not run into the same problem, as the only way to add or overwrite columns in Polars is the with_columns method.

f = pl.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f.with_columns(
    pl.when(pl.col("a") <= 3)
    .then(pl.col("b") // 10)
    .otherwise(pl.col("b"))
)
shape: (5, 2)
a b
i64 i64
1 1
2 2
3 3
4 40
5 50
f = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f[f['a'] <= 3]['b'] = f['b'] // 10
f
/tmp/ipykernel_667767/1317853993.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  f[f['a'] <= 3]['b'] = f['b'] // 10
a b
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
f = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f.loc[f['a'] <= 3, "b"] = f['b'] // 10
f
a b
0 1 1
1 2 2
2 3 3
3 4 40
4 5 50
f = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
f.assign(b=f["b"].mask(f["a"] <=3, f["b"] // 10))
a b
0 1 1
1 2 2
2 3 3
3 4 40
4 5 50

1.6 Summary

Basically, there’s no index in Polars and square brackets are bad most of the time. I think the lack of an index is quite acceptable even if there are cases where it’s useful. Most Pandas users just call .reset_index() all the time anyway.