Reproducible Builds
What are Reproducible Builds?
The first tricky problem in this area is that the term "reproducible builds" means different things to different people. The simplest scenario is that you need to be able to re-run a build at some point in the future and have it succeed. The project already has mechanisms to provide the basics for this such as source mirroring support so that if upstream sources disappear, the build and still run. We also have technology such as sstate which means when you rebuild, prebuilt artefacts can be used where possible instead of rebuilding from scratch.
The others, reproducibility is also about the levels of determinism in the builds. Releases prior to pyro didn't have "recipe specific sysroots" which meant that potentially, builds could be contaminated depending on build ordering. From pyro onwards the sysroots are isolated to be per recipe to avoid this issue. Perfect determinism in this area would mean task specific sysroots since multiple tasks from the same recipe can currently run in parallel however that is not something we've felt the project needs right now.
There are other factors which affect determism and reproducibility, for example:
- compressing files with different levels of parallelism could result in different output
- dates, times and build paths can be embedded in the built objects
- dates and times of build objects changes depending on when the build was run and which tasks happened in parallel
Our current intent is to model the best practises available and have tools and techniques to fix deltas where we can.
It is worth noting there are things it would be sensible to do outside the build system to help with reproducibility over the long term too, its not a problem the build system can solve alone. Depending on your circumstances and requirements, it may help to:
- build on the same distro version with the same installed packages
- build in the same path
- use the same build hardware
The system is designed to factor out these deltas but for example, the build system today which works on Ubuntu X is not likely to work on Ubuntu X+5 years as the compilers will be totally different and probably unable to build today's codebase. If you need to run Ubuntu X in 5 years time, you may need older hardware to do that. There are solutions with containers, the build system can even self host inside its own image but there are limits to every approach and the exact circumstances of each individual situation need to be considered.
Current Status and Planned Development
The Yocto Project aims to have builds which are entirely reproducible. That is; if you run a build today, then run that same configuration X time in the future, the binaries you get out of the build should be binary identical.
This implies that the host system you run the build on and the path you run the build in should not affect the target system output.
Our build system doesn't produce binary reproducible builds today, but we are actively working towards that goal and fixing issues as we identify them. We also plan to improve our testing to help find reproducibility issues.
The design of the system lends itself very well to producing reproducible builds, as we provide a reproducible build environment with minimal dependencies on the host/build OS.
We have several technologies in place which aide in reproducibility:
Our shared state (sstate) mechanism base our builds on hashes of input metadata, reusing the outputs if the inputs are the same.
As our sstate files need to be reusable regardless of build path and can be target or native binaries, we have mechanisms for working around various issues such as hard-coded paths (though we'd prefer to remove the need for them entirely).
A related problem to sstate is that of knowing when the input has changed, has the output changed? This is useful in the context of package feeds, amongst other things, to know whether we should update them or not. We have binary build comparison tools at an early stage of development to allow us to reduce unnecessary churn in package feeds, however further developments are planned in this area using the tools from the reproducible-builds project.
Our SDKs need to be relocatable and run anywhere they are installed to. We include our own C library to do this in a "run anywhere" scenario and are able to generate fully relocatable toolchains.
Our new for Yocto Project 2.3 (Pyro) recipe specific sysroots ensure that the output of a recipe doesn't change depending on whether other recipes have already been built and in which order, even when the software tries to autodetect available features.
At this point the project does not intend to target timestamp levels of reproducibility so whilst the binary content should be the same, file timestamps may not be and this means package manager tarballs would not be binary identical due to timestamp differences. There are likely some ways we can fix this too and we are working on it but its a lower priority than some other determinism issues.
Why do we want reproducible builds?
The many benefits of reproducible builds as listed in the reproducible-builds project, not least of all the ability to verify a built output matches the source, are key motivations for our work on reproducible builds. Additonal benefits specific to the Yocto Project include:
- ability to reduce the churn in package feeds
- ability to improve reliability of allarch recipes and enable wider use of them
Related Work
Some key bugs that are important in our reproducibility efforts are:
#1560 Enable recipe specific sysroots— DONE for 2.3- #5866 Reproducible builds: identical binaries
- #10813 Replace build-compare with tools from the reproducible-builds project
- #11176 Add optional command to rootfs-postprocess to remove non-determinism from rootfs
- Integrate changes to improve determinism in OE-Core:
- #11177 Generate archives with deterministic metadata
- #11178 Make use of SOURCE_DATE_EPOCH, most likely with patches from Debian to ensure tools are using it.
- #11179 Deterministic timezone and locale settings
- Develop tests to catch issues with reproducibility
- Reproducibility analysis tool which will run diffoscope on all binary artefacts and produce a report that lists files with issues and points to the package and recipe the file came from.
Verification
To ensure that our builds are reproducible we need to implement an infrastructure of verification over the build systems outputs Images, SDKs, Binary packages, etc.
The idea is to run diffoscope over target shared states (populate_sysroot) because is the first output on what the other artifacts (Images, SDK's, Binary packages) are generated. The build output will be generated running bitbake world with two different Host systems, this can be done using two Virtual machines as Autobuilder workers and then launch the comparison process (diffoscope) against the two build outputs.
A script exists to call diffoscope against shared states, the AB needs to use something similar to get the comparison results. [1]. There is a example of the output here [2].
An Autobuilder development branch for reproducible builds. [3]
To preserve the results a web system needs to be implemented (the AB will push the results), this also will serve to publish something like www.yoctoproject.org/reproduciblebuilds to see the actual status and what recipes needs works in order to be 100% reproducible. An initial implementation only with database models. [4]