A faster way to generate lagged values
At Novi Labs, we spend a lot of time working with timeseries data. Generically speaking, these data are formatted something like this:
id time value
---------------
a 0 1
a 1 2
a 2 3
b 0 4
b 1 5
c 0 6
where we have individual sensors represented by string IDs, and a sequence of data per sensor that is indexed by time. The time always starts at 0, and each time step is the same duration. Because different sensors come online at different times, the ending time varies per sensor.
Our timeseries data eventually get fed into forecasting models, and typically we include the last value from each sensor (or the last two values, or the last three values, etc.) as features in the forecasting model. Initially, we were using Pandas operations for generating these features, like so:
def get_lags_pandas(df, lags=1):
lagged = pd.DataFrame(index=df.index)
for lag in range(1, lags+1):
lagged[lag] = df.groupby('id')['value'].shift(lag, fill_value=0)
return lagged
Which, for our toy example, returns something like this when called with lags=3
:
1 2 3
-------
0 0 0
1 0 0
2 1 0
0 0 0
4 0 0
0 0 0
Unfortunately, this operation was very slow. Slow enough to dominate the runtime of model inference. Some careful profiling showed that, for a single sensor, generating lagged features accounted for nearly 60% of time to produce a forecast.
>>> %timeit get_lags_pandas(df)
2.9 ms ± 31 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We end up losing a lot of time to the series of calls underlying groupby
specifically, and in time spent finding where Pandas has stored things in memory more generally. We could operate directly on NumPy arrays instead of Pandas DataFrames to reduce access time into the datastructures, but this performance gain could be offset by needing extra logic at the Python level. To get the best of both worlds, we can try writing a lagging function in Numba, and have it compiled ahead of time.
@numba.njit([
numba.int64[:, :](numba.int64[:], numba.int64[:], numba.int64),
])
def _get_lags_numba(ids, values, lags):
lagged = np.zeros(shape=(len(values), lags), dtype=np.int64)
for j in range(lags):
offset = j + 1
time = 0
old_id = -1
for i in range(len(values)):
new_id = ids[i]
if new_id == old_id:
if time >= offset:
lagged[i, j] = values[i - offset]
time += 1
else:
old_id = new_id
time = 1
return lagged
Numba itself can't represent Pandas objects in compiled code, so we'll also need a little wrapper function to pull the values we need out of our DataFrame as individual NumPy arrays:
def get_lags_numba(df, lags=1):
_, ids = np.unique(df.id.values, return_inverse=True)
values = df.value.values
return pd.DataFrame(_get_lags_numba(ids, values, lags), index=df.index)
For our toy example, the numba compiled code is roughly 20x faster.
>>> %timeit get_lags_numba(df)
166 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Now, of that 166µs, about 120 of them are spent recreating the Pandas DataFrame at the end. Given that this only happens once, it is dominating the runtime for our small example, and we'll see a much bigger difference with a more realistically sized dataset. Let's imagine that we have 1000 sensors instead of just 3, and that we have 50 timesteps recorded for each one.
id time value
---------------
0 0 0
0 1 1
0 2 2
0 3 3
0 4 4
... ... ...
999 45 49995
999 46 49996
999 47 49997
999 48 49998
999 49 49999
[50000 rows x 3 columns]
Pushing these data through IPython's magic timeit shows about a 200x speed-up over the Pandas implementation.
>>> %timeit get_lags_pandas(big_df)
310 ms ± 6.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit get_lags_numba(big_df)
1.49 ms ± 70.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This is a nice improvement in runtime performance, but you'll have noticed that it came with two unpleasant side effects.
First, the Numba compiled code is less readable. It mostly involves low-level operations directly on arrays of data, where the intention behind the code is less obvious than the Pandasish group by sensor ID and shift by the lag
.
Second, the Numba compiled code is less generic. To take one example: our function only shifts integers. We could update the signature to allow floating point inputs, but cannot lag over strings or arbitrary Python objects, as Pandas does.
python pandas numba timeseries performance