A multi-file torchtext data loader

To help models generalize, it's common to use some form of data augmentation. This is where the original training data are modified in some way that preserves their semantics while changing their values. The model is fitted to the original training data, plus one or more augmented versions of it, over one or more epochs.

In pytorch, for images, this capability is built into the data loading process. In the init call for a VisionDataset, you can pass a sequence of transformations, helpfully defined for you in torchvision.transforms. When the data are loaded, the transformations will be applied on the fly.

For text data, there is no such transformers sequence for that can be passed to the torchtext dataset object. And, since the input data are eagerly vectorized, they can't be applied after the fact either. In this case, we'll need a way to pre-compute all the kinds of augmentation we want to use, and have the dataset cover multiple input files at the same time.

You might have wanted to do this anyway, even when using torchvision. If you are

  1. using a very large number of epochs; or,
  2. if you are using a lot of augmentation methods; or,
  3. if those transformations are computationally expensive to perform;
  4. or, all of the above; then,

it can be more efficient to perform those transformations once, persist them somewhere, then read them back in as needed.

The built in dataset constructor functions in torchtext only operates on a single file path, so this won't work for us right out of the box. The good news is that the changes we'll need to make are relatively minor.

First, we'll want to take a quick look at how torchtext does it. The text datasets that ship with torchtext (e.g. the Yahoo News dataset) are constructed on the fly by generating a vocabulary, a collection of input tensors, and a collection of labels (or other target).

The tensors and labels are generated by a helper function called _create_data_from_iterator, which looks like this:

def _create_data_from_iterator(vocab, iterator, include_unk):
    data = []
    labels = []
    with tqdm(unit_scale=0, unit='lines') as t:
        for cls, tokens in iterator:
            if include_unk:
                tokens = torch.tensor([vocab[token] for token in tokens])
            else:
                token_ids = list(filter(lambda x: x is not Vocab.UNK, [vocab[token]
                                        for token in tokens]))
                tokens = torch.tensor(token_ids)
            if len(tokens) == 0:
                logging.info('Row contains no tokens.')
            data.append((cls, tokens))
            labels.append(cls)
            t.update(1)
    return data, set(labels)

The "iterator" that this function receives as an argument is called _csv_iterator, which reads in a comma-delimited file and returns the first token on each line (assumed to be the label), and then everything else from that line:

def _csv_iterator(data_path, ngrams, yield_cls=False):
    tokenizer = get_tokenizer("basic_english")
    with io.open(data_path, encoding="utf8") as f:
        reader = unicode_csv_reader(f)
        for row in reader:
            tokens = ' '.join(row[1:])
            tokens = tokenizer(tokens)
            if yield_cls:
                yield int(row[0]) - 1, ngrams_iterator(tokens, ngrams)
            else:
                yield ngrams_iterator(tokens, ngrams)

When we construct our own custom TextDataset objects, we can use the same code that torchtext does internally, with one modification -- we'll create one more helper function for yielding from multiple files at once. E.g. something like the following:

def _csvs_iterator(train_files, ngrams):
    for fp in train_files:
        yield from _csv_iterator(fp, ngrams)

Finally, our dataset constuction logic will look like the following:

vocab = build_vocab_from_iterator(_csvs_iterator(filenames, ngrams))
data, labels = _create_data_from_iterator(
    vocab, _csvs_iterator(path, ngrams), include_unk)

return TextClassificationDataset(vocab, data, labels)