5 Timeseries

Temporal data is an area in which Pandas actually far outshines R’s dataframe libraries. Things like resampling and rolling calculations are baked into the dataframe library and work quite well. Fortunately this is also true of Polars!

5.1 Get the data

We’ll download a year’s worth of daily price and volume data for Bitcoin:

from pathlib import Path
from io import StringIO
from datetime import datetime, date
import requests
import polars as pl
import pandas as pd
import matplotlib.pyplot as plt

pl.Config.set_tbl_rows(5)
pd.options.display.max_rows = 5
data_path = Path("../data/ohlcv.pq")


def epoch_ms(dt: datetime) -> int:
    return int(dt.timestamp()) * 1000


if data_path.exists():
    ohlcv_pl = pl.read_parquet(data_path).set_sorted("time")

else:
    start = epoch_ms(datetime(2021, 1, 1))
    end = epoch_ms(datetime(2022, 1, 1))
    url = (
        "https://api.binance.com/api/v3/klines?symbol=BTCUSDT&"
        f"interval=1d&startTime={start}&endTime={end}"
    )
    resp = requests.get(url)
    time_col = "time"
    ohlcv_cols = [
        "open",
        "high",
        "low",
        "close",
        "volume",
    ]
    cols_to_use = [time_col, *ohlcv_cols] 
    cols = cols_to_use + [f"ignore_{i}" for i in range(6)]
    ohlcv_pl = pl.from_records(resp.json(), orient="row", schema=cols).select(
        [
            pl.col(time_col).cast(pl.Datetime).dt.with_time_unit("ms").cast(pl.Date),
            pl.col(ohlcv_cols).cast(pl.Float64),
        ]
    ).set_sorted("time")
    ohlcv_pl.write_parquet(data_path)

ohlcv_pd = ohlcv_pl.with_columns(pl.col("time").cast(pl.Datetime)).to_pandas().set_index("time")

5.2 Filtering

Pandas has special methods for filtering data with a DatetimeIndex. Since Polars doesn’t have an index, we just use .filter. I will admit the Pandas code is more convenient for things like filtering for a specific month:

Polars
Pandas

ohlcv_pl.filter(
    pl.col("time").is_between(
        date(2021, 2, 1),
        date(2021, 3, 1),
        closed="left"
    )
)

shape: (28, 6)

time	open	high	low	close	volume
date	f64	f64	f64	f64	f64
2021-02-01	33092.97	34717.27	32296.16	33526.37	82718.276882
2021-02-02	33517.09	35984.33	33418.0	35466.24	78056.65988
2021-02-03	35472.71	37662.63	35362.38	37618.87	80784.333663
…	…	…	…	…	…
2021-02-27	46276.88	48394.0	45000.0	46106.43	66060.834292
2021-02-28	46103.67	46638.46	43000.0	45135.66	83055.369042

ohlcv_pd.loc["2021-02"]

	open	high	low	close	volume
time
2021-02-01	33092.97	34717.27	32296.16	33526.37	82718.276882
2021-02-02	33517.09	35984.33	33418.00	35466.24	78056.659880
...	...	...	...	...	...
2021-02-27	46276.88	48394.00	45000.00	46106.43	66060.834292
2021-02-28	46103.67	46638.46	43000.00	45135.66	83055.369042

28 rows × 5 columns

5.3 Resampling

Resampling is like a special case of groupby for a time column. You can of course use regular .groupby with a time column, but it won’t be as powerful because it doesn’t understand time like resampling methods do.

There are two kinds of resampling: downsampling and upsampling.

5.3.1 Downsampling

Downsampling moves from a higher time frequency to a lower time frequency. This requires some aggregation or subsetting, since we’re reducing the number of rows in our data.

In Polars we use the .groupby_dynamic method for downsampling (we also use groupby_dynamic when we want to combine resampling with regular groupby logic).

Polars
Pandas

(
    ohlcv_pl
    .group_by_dynamic("time", every="5d")
    .agg(pl.col(pl.Float64).mean())
)

shape: (74, 6)

time	open	high	low	close	volume
date	f64	f64	f64	f64	f64
2020-12-29	29127.665	31450.0	28785.55	30755.01	92088.399186
2021-01-03	33577.028	36008.464	31916.198	35027.986	127574.470245
2021-01-08	38733.61	39914.548	34656.422	37655.352	143777.954392
…	…	…	…	…	…
2021-12-24	50707.08	51407.656	49540.152	50048.066	29607.160572
2021-12-29	46836.5525	48135.4925	45970.84	46881.2775	31098.406725

ohlcv_pd.resample("5d").mean()

	open	high	low	close	volume
time
2021-01-01	31084.316	33127.622	29512.818	32089.662	112416.849570
2021-01-06	38165.310	40396.842	35983.822	39004.538	118750.076685
...	...	...	...	...	...
2021-12-27	48521.240	49475.878	47087.400	47609.528	35886.943710
2022-01-01	46216.930	47954.630	46208.370	47722.650	19604.463250

74 rows × 5 columns

Resampling and performing multiple aggregations to each column:

Polars
Pandas

(
    ohlcv_pl
    .group_by_dynamic("time", every="1w", start_by="friday")
    .agg([
        pl.col(pl.Float64).mean().name.suffix("_mean"),
        pl.col(pl.Float64).sum().name.suffix("_sum")
    ])
)

shape: (53, 11)

time	open_mean	high_mean	low_mean	close_mean	volume_mean	open_sum	high_sum	low_sum	close_sum	volume_sum
date	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
2021-01-01	32305.781429	34706.045714	31021.727143	33807.135714	117435.5928	226140.47	242942.32	217152.09	236649.95	822049.149598
2021-01-08	37869.797143	39646.105714	34623.334286	37827.52	135188.296617	265088.58	277522.74	242363.34	264792.64	946318.076319
2021-01-15	36527.891429	37412.2	33961.551429	35343.847143	94212.715129	255695.24	261885.4	237730.86	247406.93	659489.005903
…	…	…	…	…	…	…	…	…	…	…
2021-12-24	49649.114286	50439.622857	48528.25	49117.98	31126.709793	347543.8	353077.36	339697.75	343825.86	217886.96855
2021-12-31	46668.905	48251.445	45943.185	46969.79	27271.230605	93337.81	96502.89	91886.37	93939.58	54542.46121

ohlcv_pd.resample("W-Fri", closed="left", label="left").agg(['mean', 'sum'])

	open		high		low		close		volume
	mean	sum	mean	sum	mean	sum	mean	sum	mean	sum
time
2021-01-01	32305.781429	226140.47	34706.045714	242942.32	31021.727143	217152.09	33807.135714	236649.95	117435.592800	822049.149598
2021-01-08	37869.797143	265088.58	39646.105714	277522.74	34623.334286	242363.34	37827.520000	264792.64	135188.296617	946318.076319
...	...	...	...	...	...	...	...	...	...	...
2021-12-24	49649.114286	347543.80	50439.622857	353077.36	48528.250000	339697.75	49117.980000	343825.86	31126.709793	217886.968550
2021-12-31	46668.905000	93337.81	48251.445000	96502.89	45943.185000	91886.37	46969.790000	93939.58	27271.230605	54542.461210

53 rows × 10 columns

5.3.2 Upsampling

Upsampling moves in the opposite direction, from low-frequency data to high frequency data. Since we can’t create new data by magic, upsampling defaults to filling the new rows with nulls (which we could then interpolate, perhaps). In Polars we have a special upsample method for this, while Pandas reuses its resample method.

Polars
Pandas

ohlcv_pl.upsample("time", every="6h")

shape: (1_461, 6)

time	open	high	low	close	volume
date	f64	f64	f64	f64	f64
2021-01-01	28923.63	29600.0	28624.57	29331.69	54182.925011
2021-01-01	null	null	null	null	null
2021-01-01	null	null	null	null	null
…	…	…	…	…	…
2021-12-31	null	null	null	null	null
2022-01-01	46216.93	47954.63	46208.37	47722.65	19604.46325

ohlcv_pd.resample("6h").mean()

	open	high	low	close	volume
time
2021-01-01 00:00:00	28923.63	29600.00	28624.57	29331.69	54182.925011
2021-01-01 06:00:00	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...
2021-12-31 18:00:00	NaN	NaN	NaN	NaN	NaN
2022-01-01 00:00:00	46216.93	47954.63	46208.37	47722.65	19604.463250

1461 rows × 5 columns

5.4 Rolling / Expanding / EW

Polars supports all three of these but they’re not quite as powerful as in Pandas, since they don’t have as many different methods. The expanding support is more limited again, though there are workarounds for this (see below):

Polars
Pandas

close = pl.col("close")
ohlcv_pl.select(
    [
        pl.col("time"),
        close.alias("Raw"),
        close.rolling_mean(28).alias("28D MA"),
        close.alias("Expanding Average").cum_sum() / (close.cum_count() + 1),
        close.ewm_mean(alpha=0.03).alias("EWMA($\\alpha=.03$)"),
    ]
).to_pandas().set_index("time").plot()

plt.ylabel("Close ($)")

Text(0, 0.5, 'Close ($)')

ohlcv_pd["close"].plot(label="Raw")
ohlcv_pd["close"].rolling(28).mean().plot(label="28D MA")
ohlcv_pd["close"].expanding().mean().plot(label="Expanding Average")
ohlcv_pd["close"].ewm(alpha=0.03).mean().plot(label="EWMA($\\alpha=.03$)")

plt.legend(bbox_to_anchor=(0.63, 0.27))
plt.ylabel("Close ($)")

Text(0, 0.5, 'Close ($)')

Polars doesn’t have an expanding_mean yet so we make do by combining cumsum and cumcount.

mean_std_pl = ohlcv_pl.select(
    [
        "time",
        pl.col("close").rolling_mean(30, center=True).alias("mean"),
        pl.col("close").rolling_std(30, center=True).alias("std"),
    ]
)
ax = mean_std_pl.to_pandas().set_index("time")["mean"].plot()
ax.fill_between(
    mean_std_pl["time"].to_numpy(),
    mean_std_pl["mean"] - mean_std_pl["std"],
    mean_std_pl["mean"] + mean_std_pl["std"],
    alpha=0.25,
)
plt.tight_layout()
plt.ylabel("Close ($)")

Text(26.83333333333334, 0.5, 'Close ($)')

roll_pd = ohlcv_pd["close"].rolling(30, center=True)
mean_std_pd = roll_pd.agg(["mean", "std"])
ax = mean_std_pd["mean"].plot()
ax.fill_between(
    mean_std_pd.index,
    mean_std_pd["mean"] - mean_std_pd["std"],
    mean_std_pd["mean"] + mean_std_pd["std"],
    alpha=0.25,
)
plt.tight_layout()
plt.ylabel("Close ($)")

Text(26.83333333333334, 0.5, 'Close ($)')

5.5 Grab Bag

5.5.1 Offsets

Pandas has two similar objects for datetime arithmetic: DateOffset which respects calendar arithmetic, and Timedelta which respects absolute time arithmetic. DateOffset understands things like daylight savings time, and can work with holidays too.

Polars just has a Duration type which is like Pandas Timedelta.

ohlcv_pl.select(pl.col("time") + pl.duration(days=80))

shape: (366, 1)

time
date
2021-03-22
2021-03-23
2021-03-24
…
2022-03-21
2022-03-22

ohlcv_pd.index + pd.Timedelta(80, "D")

DatetimeIndex(['2021-03-22', '2021-03-23', '2021-03-24', '2021-03-25',
               '2021-03-26', '2021-03-27', '2021-03-28', '2021-03-29',
               '2021-03-30', '2021-03-31',
               ...
               '2022-03-13', '2022-03-14', '2022-03-15', '2022-03-16',
               '2022-03-17', '2022-03-18', '2022-03-19', '2022-03-20',
               '2022-03-21', '2022-03-22'],
              dtype='datetime64[ns]', name='time', length=366, freq=None)

ohlcv_pd.index + pd.DateOffset(months=3, days=-10)

DatetimeIndex(['2021-03-22', '2021-03-23', '2021-03-24', '2021-03-25',
               '2021-03-26', '2021-03-27', '2021-03-28', '2021-03-29',
               '2021-03-30', '2021-03-31',
               ...
               '2022-03-13', '2022-03-14', '2022-03-15', '2022-03-16',
               '2022-03-17', '2022-03-18', '2022-03-19', '2022-03-20',
               '2022-03-21', '2022-03-22'],
              dtype='datetime64[us]', name='time', length=366, freq=None)

5.5.2 Holiday calendars

Not many people know this, but Pandas can do some quite powerful stuff with Holiday Calendars. There is an open issue to add this functionality to Polars.

5.5.3 Timezones

Suppose we know that our timestamps are UTC, and we want to see what time it was in US/Eastern:

Polars
Pandas

(
    ohlcv_pl
    .with_columns(
        pl.col("time")
        .cast(pl.Datetime)
        .dt.replace_time_zone("UTC")
        .dt.convert_time_zone("US/Eastern")
    )
)

shape: (366, 6)

time	open	high	low	close	volume
datetime[μs, US/Eastern]	f64	f64	f64	f64	f64
2020-12-31 19:00:00 EST	28923.63	29600.0	28624.57	29331.69	54182.925011
2021-01-01 19:00:00 EST	29331.7	33300.0	28946.53	32178.33	129993.873362
2021-01-02 19:00:00 EST	32176.45	34778.11	31962.99	33000.05	120957.56675
…	…	…	…	…	…
2021-12-30 19:00:00 EST	47120.88	48548.26	45678.0	46216.93	34937.99796
2021-12-31 19:00:00 EST	46216.93	47954.63	46208.37	47722.65	19604.46325

(
    ohlcv_pd
    .tz_localize('UTC')
    .tz_convert('US/Eastern')
)

	open	high	low	close	volume
time
2020-12-31 19:00:00-05:00	28923.63	29600.00	28624.57	29331.69	54182.925011
2021-01-01 19:00:00-05:00	29331.70	33300.00	28946.53	32178.33	129993.873362
...	...	...	...	...	...
2021-12-30 19:00:00-05:00	47120.88	48548.26	45678.00	46216.93	34937.997960
2021-12-31 19:00:00-05:00	46216.93	47954.63	46208.37	47722.65	19604.463250

366 rows × 5 columns

5.6 Conclusion

Polars has really good time series support, though expanding aggregations and holiday calendars are niches in which it is lacking. Pandas DateTimeIndexes are quite cool too, even if they do bring some pain.