Tiny Docker Images for Scala Native with Multi-Stage Builds

I've been really excited by the rapid progress of Scala Native since its initial release just a few months ago. As a systems-oriented Scala hacker, I'm eager to use my favorite language for small, standalone tools, without some of the downsides of the JVM.

In this post, I'll explain how to build tiny Docker images with Scala Native--in the example repo here, we can reduce the size of a running image from 680MB down to 16MB. To do this, we'll use Alpine Linux, multi-stage Docker builds, and some "fun" Linux binary and symbol-table hacking.

I'd also like to take this opportunity to thank my colleague Justin Nauman for pointing me toward the multi-stage build technique, and Alex Ellis for his excellent blog post on the topic.

Native Binaries and Dynamic Linking

A Scala Native build is relatively complex, and has several stages:

.scala source code is compiled into .nir (Native Intermediate Representation) files
.nir files are compiled into .ll (LLVM) files
.ll files are compiled in .o (Native Binary) files
.o files are linked into a final executable

However, this process isn't necessarily enough to create a true self-contained binary, because of dynamic linking. Essentially, the linker marks the output binary executable file with references to shared library files that must be present on the system for the program to execute -- typically, these libraries have extensions like .dylib, .so, and the notorious .dll. Then, at run time, the dynamic program loader will load the executable and link in the shared libraries. Unsurprisingly, this can be quite error-prone.

To see this in action, you can use the Linux ldd utility to print out the dynamic libraries of any binary program. For example, here are the dynamic libraries for git on Alpine 3.3:

$> ldd /usr/bin/git
   /lib/ld-musl-x86_64.so.1 (0x55c2e5177000)
   libpcre.so.1 => /usr/lib/libpcre.so.1 (0x7f513d16d000)
   libz.so.1 => /lib/libz.so.1 (0x7f513cf57000)
   libcrypto.so.38 => /lib/libcrypto.so.38 (0x7f513cbad000)
   libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x55c2e5177000)

Which is telling us that git links to the shared libraries for PCRE, zip, crypto, and the standard libc, and if you look closely, you'll see that it also points at specific versions at concrete file paths. This is a huge obstacle toward building portable unix binaries. Even worse, the version of libc in use here, musl, isn't quite compatible with the more common glibc.

In contrast, here's the output of the same command on Ubuntu 17.04:

$> ldd /usr/bin/git
   linux-vdso.so.1 =>  (0x00007ffd4a5ed000)
   libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007f38be92c000)
   libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f38be710000)
   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f38be4f2000)
   librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f38be2ea000)
   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f38bdf23000)
   /lib64/ld-linux-x86-64.so.2 (0x000056370d47a000)

We can see here that we have a different set of dependencies -- including pthread, librt, and the special vdso -- in different versions. As a result, if we try to run the Ubuntu executable on the Alpine system, this happens:

$> ./git.ubuntu 
/bin/sh: ./git.ubuntu: not found

Which is completely unhelpful, but if we use ldd we can see what's happening:

$> ldd git.ubuntu
Error loading shared library libpcre.so.3: No such file or directory (needed by git.ubuntu)
      libz.so.1 => /lib/libz.so.1 (0x7f18208e4000)
      libpthread.so.0 => /lib64/ld-linux-x86-64.so.2 (0x56519fa3b000)
      librt.so.1 => /lib64/ld-linux-x86-64.so.2 (0x56519fa3b000)
      libc.so.6 => /lib64/ld-linux-x86-64.so.2 (0x56519fa3b000)
Error relocating git.ubuntu: __snprintf_chk: symbol not found
Error relocating git.ubuntu: __vfprintf_chk: symbol not found
Error relocating git.ubuntu: __read_chk: symbol not found
Error relocating git.ubuntu: pcre_maketables: symbol not found
Error relocating git.ubuntu: _obstack_begin: symbol not found
Error relocating git.ubuntu: pcre_exec: symbol not found
Error relocating git.ubuntu: __memmove_chk: symbol not found
Error relocating git.ubuntu: __memcpy_chk: symbol not found
Error relocating git.ubuntu: obstack_free: symbol not found
Error relocating git.ubuntu: pcre_compile: symbol not found
Error relocating git.ubuntu: __open64_2: symbol not found
Error relocating git.ubuntu: __vsnprintf_chk: symbol not found
Error relocating git.ubuntu: _obstack_newchunk: symbol not found
Error relocating git.ubuntu: __printf_chk: symbol not found
Error relocating git.ubuntu: __fread_chk: symbol not found
Error relocating git.ubuntu: pcre_study: symbol not found
Error relocating git.ubuntu: __memset_chk: symbol not found
Error relocating git.ubuntu: __fprintf_chk: symbol not found
Error relocating git.ubuntu: pcre_free: symbol not found

As you can see, it's failing to load the PCRE library, and then choking on all the missing symbols, i.e., functions.

Static Linking and Platform Lock-in

At this point, you may be wondering how it's possible to build portable binary software distributions for UNIX at all. The standard technique is static linking, which essentially copies the body of the shared library code into the output executable. The catch is that the process to perform this static linking is itself incredibly fiddly and platform-dependent, especially if you consider UNIX variants like Mac OS.

For this reason, there are very few platforms that can do static linking in a platform-neutral way. Most simply give you a hook for passing platform-specific linker options, like Golang, Rust, and indeed Scala Native. The catch, though, is that as soon as you start writing platform-specific linker flags into your build, you've either locked yourself into a single platform, or else committed yourself to maintaining N complex build files.

The Minimum Viable Docker Image, and Build-time vs. Run-time Dependencies

So, to avoid the quagmire of platform-specific build config, we're going to try to create a minimal, reproducible environment that can build and run our program. We'll start with Alpine Linux, which is an awesome minimal base distribution in just 2MB. We'll then install the tools we need and build our software. To achieve portability, we'll rely on Docker to fully virtualize the filesystem and library dependencies.

Looking at Dockerfile.alpine.big in our repo, before we compile our program, we need:

java
scala
sbt
C build tools
LLVM
git
wget
all shared libraries
source code for RE2

These dependencies add up: even though the final executable is just 5.3MB, this image weighs in at 680MB(!), which is definitely in the undesirable range. How do we trim this down?

The traditional way to do this is to use two Dockerfiles: one to perform the build, a slimmer image for run-time, plus a shell script or two to stitch them together. This works, but it can be error-prone to maintain, especially for large projects with complex dependencies and multiple subsystems. However, recent updates to Docker give us an alternative approach: multi-stage builds. To use this, you'll need a very recent Docker distribution from the Edge channel.

Multi-stage Docker Builds

Multi-stage builds are a new Docker feature that let you fully automate complex build workflows in a single Dockerfile. The linked article demonstrates how to build a super-slim Go executable like this:

FROM golang:1.7.3 as builder
WORKDIR /go/src/github.com/alexellis/href-counter/
RUN go get -d -v golang.org/x/net/html  
COPY app.go    .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .

FROM alpine:latest  
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /go/src/github.com/alexellis/href-counter/app .
CMD ["./app"]

Essentially, the second FROM directive lets your build start over from a clean slate, but then move files from previous stages with the COPY --from=builder directive. This is basically what we'll do for our app, just with more steps.

# Start from Alpine and Java 8, and name this stage "build"
FROM openjdk:8u121-jre-alpine AS build
# Install C libraries and build tools
RUN echo "installing dependencies" && \
    apk --update add gc-dev clang musl-dev libc-dev build-base git && \
    apk add libunwind-dev --update-cache --repository http://nl.alpinelinux.org/alpine/edge/main
# Install re2 from source for clang++ compatability
RUN git clone https://github.com/google/re2.git && cd re2 && \
    CXX=clang++ make && make install

# Install SBT
ENV SBT_VERSION 0.13.15
ENV SBT_HOME /usr/local/sbt
ENV PATH ${PATH}:${SBT_HOME}/bin
RUN echo "installing SBT $SBT_VERSION" && \
    apk add --no-cache --update bash wget && mkdir -p "$SBT_HOME" && \
    wget -qO - --no-check-certificate "https://dl.bintray.com/sbt/native-packages/sbt/$SBT_VERSION/sbt-$SBT_VERSION.tgz" | tar xz -C $SBT_HOME --strip-components=1 && \
    echo -ne "- with sbt $SBT_VERSION\n" >> /root/.built && \
    sbt sbtVersion

# Set up the directory structure for our project
RUN mkdir -p /root/project-build/project
WORKDIR /root/project-build

# Resolve all our dependencies and plugins to speed up future compilations
ADD ./project/plugins.sbt project/
ADD ./project/build.properties project/
ADD build.sbt .
RUN sbt update

# Add and compile our actual application source code
ADD . /root/project-build/
RUN sbt clean nativeLink

# Copy the binary executable to a consistent location
RUN cp ./target/scala-2.11/*-out ./project-build-out

# Start over from a clean Alpine image
FROM alpine:3.3

# Copy in C libraries from the earlier "build" stage
COPY --from=build \
   /usr/lib/libunwind.so.8 \
   /usr/lib/libunwind-x86_64.so.8 \
   /usr/lib/libgc.so.1 \
   /usr/lib/libstdc++.so.6 \
   /usr/lib/libgcc_s.so.1 \
   /usr/lib/
COPY --from=build \
   /usr/local/lib/libre2.so.0 \
   /usr/local/lib/libre2.so.0

# Copy in the executable from "build"
COPY --from=build \
   /root/project-build/project-build-out .

# Finally run the thing!
CMD ["./project-build-out"]

Whew! Once we run this with docker build -t scala-native-alpine ., we can check out the resulting image:

$> docker images | grep scala-native-alpine
scala-native-alpine   latest              32b1e4dec1ac        46 seconds ago       16MB

Just 16MB, which is more than a 43x reduction in size from 680MB!

Finally, we're ready to run the binary, which should take about a minute:

$> docker run -it -v $PWD:/project-build/ scala-native-alpine
Rendering (8 spp) 100.00%

You can look at the code for the example Scala Native project, but this is essentially doing a bunch of C-level math operations to render a 800 x 600 image file.

Next Steps

If you've followed along this far, you've learned how to build tiny Scala Native images with Docker and Alpine Linux. The big benefits of this approach are that the whole build lifecycle is encapsulated in a single Dockerfile. This will help us out immensely when building more complex applications.

In a future post, I'd like to write about Dinosaur, a simple CGI-based web framework for Scala Native, and some of the quirks of multi-process web programming from a systems perspective. In the meantime, we'd love to hear from you on Twitter if you've found this post useful--reach us at @spantreellc and @RichardWhaling!