How to deploy conda-based docker images

Posted Thu 19 October 2023 by dillon niederhut

The scientific python community has settled on conda and conda-forge as the easiest way to compile and install dependencies that have complicated build recipes:

As long as the Python community thinks of their problems as “Python packaging problems”, and not “legacy C compiler & linker issues from the 1970s which we inherited as the cursed prize of becoming the world’s most popular language for data and numerics”, then we won’t be able to… https://t.co/tvQUpQinwO
— Peter Wang 🦋 (@pwang) April 23, 2023

However, conda has some tradeoffs when it comes to using it to create and deploy docker-based services. In particular, these are:

conda activate does not work inside a Dockerfile (see this SO question)
the image itself can be several gigabytes in size

The good news is that the first issue is relatively easy to work around -- if you specify the name of your environment and the location of the conda install, you can prepend the virtual environment directly to shell's path:

ENV PATH=/$INSTALL_DIR/conda/envs/$ENV_NAME/bin:$PATH

The bad news is that the second problem is a bit trickier. Jim Crist-Harif wrote a widely referenced blog post about this in 2019, and Uwe Korn followed up two years later with another set of guidelines for smaller images. Some of the suggestions (like avoiding MKL) no longer provide any size improvements. Others, like using a distroless image, might be infeasible for other reasons, if e.g. your deployment depends on having access to something installed via apt or yum.

The image size matters for two reasons:

First, the larger the image size is, the longer it will take to transfer over the network to a host. This isn't a huge deal if you do this, say, once a month for a persistent service. However, if this service needs to scale up and down quickly in response to compute demands, the image size can add a lot of latency. We've also run into a problem with aws batch specifically, where deploying several large images onto the same ec2 host will saturate its bandwith allocation and your deployment will fail with TCP timeouts.

Second, larger images means more storage. This usually isn't a problem for the container repository itself, but can cause issues on the image host. For example, if you have some embarassingly parallel computations that you would like to perform in parallel, you might launch 400 containers, each handling one small portion of the input. If 50 of those containers land on the same docker host (because the compute and memory requirements are relatively light), and each conda image is 4GB, you'll need to attach 200GB of hard disk storage just for the images, and this ignores the room you'll need for the host OS, the docker engine, any disk caching that your processes use, etc.

The good news is that there are three tricks we can use to reduce the size of our conda image by roughly an order of magnitude. To see what these look like, let's start with this example conda environment specification:

name: docker-example
channels:
  - conda-forge
  - defaults
dependencies:
  - aiobotocore>=2.5.0
  - boto3
  - fsspec
  - numpy<2.0.0
  - pandas<2.0.0
  - pip
  - pyarrow
  - python=3.9
  - s3fs>=2023.5.0

It has some pretty standard data-sciency stuff in there, like numpy and pandas, a few extra libraries to help us read Apache parquet files out of AWS s3 buckets. Note the lower bounds on aiobotocore and s3fs -- these can be finicky in terms of compatibility with each other / boto3, so it's helpful to keep them within fairly tight bounds to avoid runtime errors. Note also the upper bounds on numpy and pandas -- you will probably want to start doing this if you are not already.

If you use #NumPy, upper bound your dependencies to <2.0 now.

Also, as of 1.25, you no longer need to use oldest-supported-numpy in your builds.#SciPy2023 pic.twitter.com/NqnfTS6WSN
— Dillon Niederhut PhD (@dillonniederhut) July 14, 2023

Next up is this Dockerfile:

FROM debian:stable-slim

RUN apt-get update && apt-get install -y -q build-essential wget

ENV CONDA_DIR=/opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /opt/miniconda.sh && /bin/bash /opt/miniconda.sh -b -p /opt/conda

COPY env.yaml /opt/app/
COPY app.py /opt/app/

RUN /opt/conda/bin/conda env create -n app --file=/opt/app/env.yaml

ENV PATH=$CONDA_DIR/envs/app/bin:$PATH

WORKDIR /opt/app/
ENTRYPOINT [ "python", "-m", "app" ]

which starts from a trimmed down version of debian, and installs stuff we'll need to compile any dependencies along with wget, which we need to get conda. We'll download and install miniconda (because it's mini!), use that to create a virtual environment with the packages we need, then set our brand new conda environment as the first thing in our path.

How large is the result?

REPOSITORY     TAG       IMAGE ID       CREATED          SIZE
docker-conda   0         d80558a36b24   2 hours ago      3.12GB

😵‍💫 more than 3GB!

1. Delete unneeded files

We'll start here, since this is part of the advice given in the earlier two blog posts. Conda might download and cache some stuff that you don't need, so we'll start by getting rid of those with conda clean.

The packages themselves might include bytecode-compiled python files (__pycache__ and .pyc) and some build artifacts (.a). We can safely delete those since either they are no longer needed, or can be quickly regenerated when our image starts running.

We'll also go through and ask the system package manager (aptitude) to clean out anything that it knows it doesn't need anymore with apt-get clean. The next version of our Dockerfile looks like this:

FROM debian:stable-slim

RUN apt-get update \
    && apt-get install -y -q build-essential wget \
    && apt-get clean \
    && apt-get autoremove

ENV CONDA_DIR=/opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /opt/miniconda.sh && /bin/bash /opt/miniconda.sh -b -p /opt/conda

COPY env.yaml /opt/app/
COPY app.py /opt/app/

RUN /opt/conda/bin/conda env create -n app --file=/opt/app/env.yaml \
    && /opt/conda/bin/conda clean -afy \
    && find /opt/conda/ -follow -type f -name '*.a' -delete \
    && find /opt/conda/ -follow -type f -name '*.pyc' -delete \
    && find /opt/conda/ -follow -type d -name '__pycache__' -delete

ENV PATH=$CONDA_DIR/envs/app/bin:$PATH

WORKDIR /opt/app/
ENTRYPOINT [ "python", "-m", "app" ]

With all those cleanup steps, how did we do?

REPOSITORY     TAG       IMAGE ID       CREATED        SIZE
docker-conda   1         76d0e38d2ad8   2 hours ago    2.09GB
docker-conda   0         d80558a36b24   2 hours ago    3.12GB

Our image got 30% smaller which is nice, but 2GB is still a very large image 😦.

2. Use a multi-stage build

In the previous step, we went through and deleted by hand a bunch of cached files and build artifacts that were no longer needed. We didn't get all of them though, since aptitude has no way of knowing that we're never going to compile anything in this image again.

What we can do instead is use a docker multi-stage build. This lets us compile all of the artifacts we need in one image, then copy those files out into a new, clean image. In our case, this means creating our Python environment with all of its packages in one image, then copying our conda installation to the new image. Everything conda installs is in its own directory, so this is safe to do.

Since we aren't also copying any of the things we installed with apt-get, we also don't need to ask it to clean up after itself anymore. Note: the COPY statement for the environment file goes in the first image -- for the application it goes into the second image. Our new Dockerfile looks like this:

FROM debian:stable-slim AS build

RUN apt-get update && apt-get install -y -q build-essential wget

ENV CONDA_DIR=/opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /opt/miniconda.sh && /bin/bash /opt/miniconda.sh -b -p /opt/conda

COPY env.yaml /opt/app/

RUN /opt/conda/bin/conda env create -n app --file=/opt/app/env.yaml \
    && /opt/conda/bin/conda clean -afy \
    && find /opt/conda/ -follow -type f -name '*.a' -delete \
    && find /opt/conda/ -follow -type f -name '*.pyc' -delete \
    && find /opt/conda/ -follow -type d -name '__pycache__' -delete

FROM debian:stable-slim as deploy

COPY --from=build /opt/conda /opt/conda
COPY app.py /opt/app/

ENV PATH=$CONDA_DIR/envs/app/bin:$PATH

WORKDIR /opt/app/
ENTRYPOINT [ "python", "-m", "app" ]

REPOSITORY     TAG       IMAGE ID       CREATED          SIZE
docker-conda   2         719cc144a7f9   2 hours ago      1.33GB
docker-conda   1         76d0e38d2ad8   2 hours ago      2.09GB
docker-conda   0         d80558a36b24   2 hours ago      3.12GB

Another 700MB! But the image itself is still over a GB so we have some work left to do 😔.

3. Delete conda

In the last step, we started with a fresh image and copied over the conda directory. But we don't actually need the conda directory anymore. It has a lot of stuff in there about talking to its upstream repositories, managing package dependencies, etc., but we won't need to install anything into a service after it has already been deployed.

Conda itself can also cause other issues, like a very very very outdated version of Flask in its test suite causing Vanta to flag a critical security CVE. This is important if you, for example, are maintaining SOC2 compliance. Even if you aren't, it's a good idea in general to limit your services to having only the software they need to do their jobs properly, in the same way that you want to limit the ports they have open to only the ones that are necessary.

So what we'll do for our third attempt is only copy the environment from the build image, and leave the rest of conda behind. Here's the new Dockerfile:

FROM debian:stable-slim AS build

RUN apt-get update && apt-get install -y -q build-essential wget

ENV CONDA_DIR=/opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /opt/miniconda.sh && /bin/bash /opt/miniconda.sh -b -p /opt/conda

COPY env.yaml /opt/app/

RUN /opt/conda/bin/conda env create -n app --file=/opt/app/env.yaml \
    && /opt/conda/bin/conda clean -afy \
    && find /opt/conda/ -follow -type f -name '*.a' -delete \
    && find /opt/conda/ -follow -type f -name '*.pyc' -delete \
    && find /opt/conda/ -follow -type d -name '__pycache__' -delete

FROM debian:stable-slim as deploy

COPY --from=build /opt/conda/envs/app /opt/conda/envs/app
COPY app.py /opt/app/

ENV PATH=$CONDA_DIR/envs/app/bin:$PATH

WORKDIR /opt/app/
ENTRYPOINT [ "python", "-m", "app" ]

How did we do this time?

REPOSITORY     TAG       IMAGE ID       CREATED             SIZE
docker-conda   3         d52ceae93fde   2 hours ago         1.07GB
docker-conda   2         719cc144a7f9   2 hours ago         1.33GB
docker-conda   1         76d0e38d2ad8   2 hours ago         2.09GB
docker-conda   0         d80558a36b24   2 hours ago         3.12GB

Another 300MB gone! But we're still over 1GB in size, so there is one more step 🫤

4. Install pyarrow from pip instead of conda

This image is still a bit larger than we would like, and to figure out why we're going to use du to find where the disk space is getting used up.

997M    /opt/conda/envs/app

The first thing that we see is that, out of our 1.07GB image, 997MB is our application environment, so even if we got rid of our entire debian distribution, the space savings would be minimal. We can do better.

The next thing we see is that some of our python packages have come with their test suites!

26M /opt/conda/envs/app/lib/python3.9/site-packages/pandas/tests
5.1M    /opt/conda/envs/app/lib/python3.9/site-packages/numpy/core/tests
1.5M    /opt/conda/envs/app/lib/python3.9/site-packages/numpy/lib/tests
1.0M    /opt/conda/envs/app/lib/python3.9/site-packages/numpy/typing/tests

There are arguments to both sides of this debate, but one reason to have tests in a separate directory is to avoid other people installing them when they don't need them. Another reason, pointed out by Hynek, is that this makes it easy to accidentally test against your dev environment instead of the package as it will actually be installed by your users -- https://hynek.me/articles/testing-packaging/.

If you scan through all the directory listings, you'll also see that we're giving up space to libraries used by gcp and aws:

4.0M    /opt/conda/envs/app/include/google
23M /opt/conda/envs/app/include/aws

The conda-forge feedstock for arrow-cpp includes both the gcp and aws C++ SDKs in every installation. I don't know for sure why, but I would guess that this has to do with conda's limited support for optional dependencies.

Because pip allows optional dependencies during installation (a double-edged sword, as anyone who has forgotten to install openpyxl knows) we can install pyarrow from PYPI which does not include those options. So we'll add one little change to our environment file:

name: docker-example
channels:
  - conda-forge
  - defaults
dependencies:
  - aiobotocore>=2.5.0
  - boto3
  - fsspec
  - nomkl
  - numpy<2.0.0
  - pandas<2.0.0
  - pip
  - python=3.9
  - s3fs>=2023.5.0
  - pip:
    - pyarrow

leave the Dockerfile the same, and try again:

REPOSITORY     TAG       IMAGE ID       CREATED             SIZE
docker-conda   4         22bb71256a3e   2 hours ago         543MB
docker-conda   3         d52ceae93fde   2 hours ago         1.07GB
docker-conda   2         719cc144a7f9   2 hours ago         1.33GB
docker-conda   1         76d0e38d2ad8   2 hours ago         2.09GB
docker-conda   0         d80558a36b24   2 hours ago         3.12GB

Another 400MB, and we've finally gotten the image below a gigabyte 🤤. There is some more cleanup we could do here (like deleting all of those test directories) but I am running Docker Desktop on an M2 mac, and building these five images has already taken an entire day 🙄.

So, to summarize, to get your conda-based docker images down to a reasonable size, you need to do three things:

Use a multi-stage build
Delete absolutely everything you can, including conda itself
Install packages with lots of optional dependencies via pip and not conda

python conda docker