A faster way to generate lagged values

At Novi Labs, we spend a lot of time working with timeseries data. Generically speaking, these data are formatted something like this:

id  time  value
---------------
a     0      1
a     1      2
a     2      3
b     0      4
b     1      5
c     0      6

where we have individual sensors represented by string IDs, and a sequence of data per sensor that is indexed by time. The time always starts at 0, and each time step is the same duration. Because different sensors come online at different times, the ending time varies per sensor.

Our timeseries data eventually get fed into forecasting models, and typically we include the last value from each sensor (or the last two values, or the last three values, etc.) as features in the forecasting model. Initially, we were using Pandas operations for generating these features, like so:

def get_lags_pandas(df, lags=1):
    lagged = pd.DataFrame(index=df.index)
    for lag in range(1, lags+1):
        lagged[lag] = df.groupby('id')['value'].shift(lag, fill_value=0)
    return lagged

Which, for our toy example, returns something like this when called with lags=3:

1  2  3
-------
0  0  0
1  0  0
2  1  0
0  0  0
4  0  0
0  0  0

Unfortunately, this operation was very slow. Slow enough to dominate the runtime of model inference. Some careful profiling showed that, for a single sensor, generating lagged features accounted for nearly 60% of time to produce a forecast.

>>> %timeit get_lags_pandas(df)
2.9 ms ± 31 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We end up losing a lot of time to the series of calls underlying groupby specifically, and in time spent finding where Pandas has stored things in memory more generally. We could operate directly on NumPy arrays instead of Pandas DataFrames to reduce access time into the datastructures, but this performance gain could be offset by needing extra logic at the Python level. To get the best of both worlds, we can try writing a lagging function in Numba, and have it compiled ahead of time.

@numba.njit([
    numba.int64[:, :](numba.int64[:], numba.int64[:], numba.int64),
])
def _get_lags_numba(ids, values, lags):
    lagged = np.zeros(shape=(len(values), lags), dtype=np.int64)
    for j in range(lags):
        offset = j + 1
        time = 0
        old_id = -1
        for i in range(len(values)):
            new_id = ids[i]
            if new_id == old_id:
                if time >= offset:
                    lagged[i, j] = values[i - offset]
                time += 1
            else:
                old_id = new_id
                time = 1
    return lagged

Numba itself can't represent Pandas objects in compiled code, so we'll also need a little wrapper function to pull the values we need out of our DataFrame as individual NumPy arrays:

def get_lags_numba(df, lags=1):
    _, ids = np.unique(df.id.values, return_inverse=True)
    values = df.value.values
    return pd.DataFrame(_get_lags_numba(ids, values, lags), index=df.index)

For our toy example, the numba compiled code is roughly 20x faster.

>>> %timeit get_lags_numba(df)
166 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Now, of that 166µs, about 120 of them are spent recreating the Pandas DataFrame at the end. Given that this only happens once, it is dominating the runtime for our small example, and we'll see a much bigger difference with a more realistically sized dataset. Let's imagine that we have 1000 sensors instead of just 3, and that we have 50 timesteps recorded for each one.

id  time  value
---------------
0     0      0
0     1      1
0     2      2
0     3      3
0     4      4
...   ...    ...
999    45  49995
999    46  49996
999    47  49997
999    48  49998
999    49  49999

[50000 rows x 3 columns]

Pushing these data through IPython's magic timeit shows about a 200x speed-up over the Pandas implementation.

>>> %timeit get_lags_pandas(big_df)
310 ms ± 6.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit get_lags_numba(big_df)
1.49 ms ± 70.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This is a nice improvement in runtime performance, but you'll have noticed that it came with two unpleasant side effects.

First, the Numba compiled code is less readable. It mostly involves low-level operations directly on arrays of data, where the intention behind the code is less obvious than the Pandasish group by sensor ID and shift by the lag.

Second, the Numba compiled code is less generic. To take one example: our function only shifts integers. We could update the signature to allow floating point inputs, but cannot lag over strings or arbitrary Python objects, as Pandas does.