MLOps lessons from Creativity, Inc. (Part 1)

I recently finished listening to the audiobook version of Creativity Inc., Ed Catmull's book on the history of Pixar Animation. Many of the company policies and managerial decisions discussed in the book, especially the parts about experimentation and feedback, sound very similar to what you would hear in an agile software development environment. This is not an accident -- Ed draws explicit inspiration from the lean manufacturing movement, which is also the inspiration behind modern devops practices.

However, on reflection I was struck in particular by the direct applicability of many of Pixar's challenges specifically to machine learning model development and deplyment. Maybe this is just an example of the frequency illusion, but there is something to be said about the similarities between the two: Pixar writes software to perform rendering of images with the ultimate objective of telling a story; a machine learning engineer writes software to train models with the ultimate objective of answering a question.

So, here follows an incomplete list of the MLOps lessons from Creativity Inc. that I remembered long enough to write down after I got back from whatever run I was on.

1. Get Feedback Early

This is an old tenet of agile and also one of the principles of devops. Briefly stated, you want to know whether a final product will have any issues. The earlier to know (or suspect!) that there might be a problem, the sooner you can intervene, saving time and resources. This means that you will need to collect feedback at intermediate steps in the process.

At Pixar, at the time Crativity Inc was written anyway, an entire film is recompiled every 2-4 weeks. The film is the "end product", so this is the ultimate cycle time for generating feedback to incorporate into the film's development. However, the directors of a movie will review every scene being developed on a daily basis, to provide feedback to the crew working on the film on a much tighter timeline.

However, this isn't only to reduce cycle time. In particular, a film can work, but an individual scene in that film might still be bad. At Pixar, the directors intervene at the level of the individual film component both to ensure that the component itself, as a standalone effort, works; and, to provide a top-down view on how that piece fits into the broader vision.

In MLOps, this translates into two separate concerns: a) does a model component do what it's supposed to do; and b), is it working in a way that stakeholders will find acceptable. The first case I think obvious, as it's a fairly direct translation of the concept of unit testing into data transforms, but the second is a bit more subtle.

Say, for instance, you have a model that achieves sufficient accuracy at some task, and is using something like listwise deletion or single imputation to handle missing values. If one of your particularly savvy customers doesn't approve of this choice, then the model is a failure, even if technically it works just fine.

To take another example, let's say you are modeling some outcome that is irregularly distributed -- maybe something like sales per person, where the typical person buys $0 worth of things. A typical regression model -- even a very good one -- will hesitate to predict 0 for non-buying people and will tend to predict very small numbers, maybe 0.1. In your wisdom, you decide to replace your simple regression model with a cascade architecture that first classifies people into buy/no-buy, then only predicts sale amounts for the "buy" group. Your model becomes much more accurate, but if anyone downstream was normalizing by expected sales, they now have a bunch of infs showing up in their analyses.

It's not enough to get feedback only about the whole model, or on model metrics. The affordances of the individual components also matter.

2. Embrace Mistakes and Learn From Them

This one sounds pretty obvious in theory but can be hard to pull off in practice. Failure is the best teacher! If you aren't learning from your mistakes, they are costing you twice as much! Etc!

The whole book is filled with examples of mistakes that Ed has made (doing a direct-to-DVD film), or mistakes that his directors have made (they had to rewrite... I think it was Toy Story 2? at the last minute because the story was so unbelievable), so it's hard to choose (or remember 😅) one particular motivating example from Pixar for this one.

One really important thing that Ed mentions is that corporate culture is a top-down culture, so if you want your team to embrace failures then you need to be the one modeling that behavior. One is tempted to look at Ed's book as an extended example of modeling how to honestly and openly discuss failure and the things you have learned.

For ML though, learning from failure goes beyond bad personal decisions to include bad machine decisions. In some cases, this is a simple fact of a model trying to reduce it's error weighted by prevalence in the dataset, so underrepresented segments of the population will tend to have higher errors. In other cases this is an indication that a model works well for one segment, but a different segment might need a different model architecture to perform well.

In yet other cases, this may mean that your model only had high accuracy because things weren't changing much! In a temperate climate, the amount of time people spend outside might be a simple linear function of the temperature. More warm, more time outdoors. What if you include areas of the world where the temperature varies from 30C to 40C? For more on this particular failure mode, see this McKinsey whitepaper "Leadership's role in fixing the models that COVID-19 broke".

Speaking generally however, every model will start to see its errors increase, either because time is passing and the future is not the past, or because the model is being applied to new locations, new domains, or new problems. The technical term for this is distribution shift and it implies that it doesn't really matter how good your model was to start with -- it is going to fail. And that's okay! The failure itself is only bad if you aren't ready for it, and ready to learn from it. Your ML practice should be monitoring model errors in real time. Are the failure rates increasing? If so, what do those failures have in common?

3. Make Teams Responsible for Their Own Decisions

One of the mistake that Ed talks about in his book is setting up separate teams for producing scenes and for approving them for use in the film. The intention behind this was something that should sound familiar to anyone who has worked with a dedicated testing team or a dedicated deployment team: separating these functions allows each team to focus on one area where they can specialize. Additionally, you get a second pair of eyes, and specifically disinterested eyes (it's not their baby!), looking over a product before release.

The downside to this setup is that it creates misaligned incentives. The production team is incentivized to get their work past the audit team, even if that means compromising on creativity. The audit team is incentivized to avoid failures, even if that means there are lots of false positives and good work gets cut. Neither of these incentives is aligned with the company as a whole, whose goal is to make good movies!

At Pixar, the solution was to dissolve the audit team and make the animators and producers themselves responsible for the decision to ship, and any consequence arising from that choice. In DevOps, the concomitant workflow is uniting development with deplpyment, so that a single team is responsible for a unit of work from ideation to deployment and monitoring.

In MLOps, the solution is similar -- the ML engineering team should own this process, the decisions, and the consequences, from end-to-end. In practice this can be a lot harder, because not every data scientist knows how to write code, let alone deploy it.

We feel pretty strongly (at Novi Labs) in end-to-end ownership, so our data scientists deploy their own models on their own infrastructure. There is a nontrivial cost to this approach, namely in finding and hiring data scientists who can do all the things. But the benefit is that we have substantially lower communication costs, because the team goals are exactly aligned with company goals (ship a product!), and shipping that product does not depend on the extent to which other teams in the company understand some of the more subtle nuances of statistical learning.

Cliffhanger!

This turned out to be a lot more typing that I was expecting, so we're going to break this into a two-parter. Stay tuned for the next three lessons from a book I sort of remember listening to while running!

Update: here is part 2!