blog.neater-hut

How to deploy conda-based docker images

By dillon niederhut under python conda docker

The scientific python community has settled on conda and conda-forge as the easiest way to compile and install dependencies that have complicated build recipes:

As long as the Python community thinks of their problems as “Python packaging problems”, and not “legacy C compiler & linker issues from the 1970s which we …

Fast operations on scikit-learn decision trees with numba

By dillon niederhut under python scikit-learn numba decision tree performance

The title is a bit wordy. But that's what this post is about.

To start with, you might be wondering why someone would want to operate on a decision tree from inside numba in the first place. After all, the scikit-learn implementation of trees uses Cython, which should be providing …

Writing an image annotation tool in 50 lines of Python

By dillon niederhut under python chaco traits image data image annotation

There are a couple of really nice image annotation libraries that are free and open source. For example, I use LabelImg whenever I need to hand-annotate bounding boxes to create new (or augment existing) datasets for object detection. It can output labels in both Pascal and YOLO formats, which is …

How to add plots to docstrings

By dillon niederhut under python visualization open source

Recently, we released functionality in niacin for performing data augmentation on timeseries. As a part of this, we wanted to be able to show before and afters in the documentation for how a timeseries (in this case, a sine curve) gets transformed by any particular augmenting function. In a lot …

Getting started with timeseries data augmentation

By dillon niederhut under timeseries augmentation python niacin

Data augmentation is a critical component in modern machine learning practice due to its benefits for model accuracy, generalizability, and robustness to adversarial examples. Elucidating the precise mechanisms by which this occurs is a currently active area of research, but a simplified explanation of the current proposals might look like …

Virtual epochs for PyTorch

By dillon niederhut under python pytorch ml large datasets

A common problem when training neural networks is the size of the data¹. There are several strategies for storing and querying large amounts of data, or for increasing model throughput to speed up training when there are large amounts of data, but scale causes problems in much more mundane …

Superconvergence in PyTorch

By dillon niederhut under python pytorch ml papers

In Super-Convergence: Very fast training of neural networks using large learning rates¹, Smith and Tobin present evidence for a learning rate parametrization scheme that can result in a 10x decrease in training time, while maintaining similar accuracy. Specifically, they propose the use of a cyclical learning rate, which starts …

A faster way to generate thin plate splines

By dillon niederhut under python numpy object detection adversarial attack performance

In Evading real-time person detectors by adversarial t-shirt¹, Xu and coauthors show that the adversarial patch attack described by Thys, Van Ranst, and Goedemé² is less successful when applied to flexible media like fabric, due to the warping and folding that occurs.

low success rate with adversarial patch from AUTHORS

They propose to remedy this failure …

How to combine variable length sequences in PyTorch DataLoaders

By dillon niederhut under python pytorch torchtext nlp

If you're getting started with PyTorch for text, you've probably encountered an error that looks something like:

Sizes of tensors must match except in dimension 0.

The short explanation for this error is that sequences are often different lengths, but tensors are required to be rectangular. The fix for this …

Adding data augmentation to torchtext datasets

By dillon niederhut under python pytorch torchtext nlp augmentation

It is universally acknowledged that artificially augmented datasets lead to models which are both more accurate and more generalizable. They do this by introducing variability which is likely to be encountered in ecologically valid settings but is not present in the training data; and, by providing negative examples of spurious …

A multi-file torchtext data loader

By dillon niederhut under python pytorch torchtext nlp

To help models generalize, it's common to use some form of data augmentation. This is where the original training data are modified in some way that preserves their semantics while changing their values. The model is fitted to the original training data, plus one or more augmented versions of it …

A faster way to generate lagged values

By dillon niederhut under python pandas numba timeseries performance

At Novi Labs, we spend a lot of time working with timeseries data. Generically speaking, these data are formatted something like this:

id  time  value
---------------
a     0      1
a     1      2
a     2      3
b     0      4
b     1      5
c     0      6

where we have individual sensors represented by …