Adding data augmentation to torchtext datasets

It is universally acknowledged that artificially augmented datasets lead to models which are both more accurate and more generalizable. They do this by introducing variability which is likely to be encountered in ecologically valid settings but is not present in the training data; and, by providing negative examples of spurious …

SciPy Proceedings 2020 Survey

The mission of the SciPy Proceedings Committee (Proccom) is to celebrate and promote the work of the members of the SciPy community. This is taken in the broad sense to include a community of the authors and maintainers of core libraries; the scientists and engineers who use these libraries to …

Installing cuda on Ubuntu 18.04 for pytorch or tensorflow

I recently needed to update some servers running an old Ubuntu LTS (Xenial, 16.04) to a slightly less old Ubuntu LTS (Bionic, 18.04). I had been putting it off for some time, mostly due to the noise I heard about problems installing the Nvidia CUDA toolkit. But that …

Three reasons to use Shapley values

Last time, we discussed Shapley values and how they are defined, mathematically. This time, let's turn our attention to how to use them.1

We discussed how explainable artificial intelligence (XAI) is focused around taking models which have high predictive power (high variance, or high VC models) and providing an …

How Shapley values work

A common concern in machine learning (ML) solutions is that apparent predictive power is coming from a problematic source.1 For example, a model might learn to predict burrito quality from latitude and longitude. In this case, the actual signal is likely coming from a particular city or neighborhood having …

A multi-file torchtext data loader

To help models generalize, it's common to use some form of data augmentation. This is where the original training data are modified in some way that preserves their semantics while changing their values. The model is fitted to the original training data, plus one or more augmented versions of it …

A faster way to generate lagged values

At Novi Labs, we spend a lot of time working with timeseries data. Generically speaking, these data are formatted something like this:

id  time  value
a     0      1
a     1      2
a     2      3
b     0      4
b     1      5
c     0      6

where we have individual sensors represented by …