Reproducible Builds: Difference between revisions
No edit summary |
No edit summary |
||
Line 78: | Line 78: | ||
To preserve the results a web system needs to be implemented (the AB will push the results), this also will serve to publish something like www.yoctoproject.org/reproduciblebuilds to see the actual status and what recipes needs works in order to be 100% reproducible. An initial implementation only with database models. [https://github.com/alimon/reproducible-builds] | To preserve the results a web system needs to be implemented (the AB will push the results), this also will serve to publish something like www.yoctoproject.org/reproduciblebuilds to see the actual status and what recipes needs works in order to be 100% reproducible. An initial implementation only with database models. [https://github.com/alimon/reproducible-builds] | ||
== Current Development == | |||
Yocto builds are well suited to provide reproducible builds, as there is minimal reliance on host tools. Most tools, including various toolchains, are built from scratch. | |||
Hence builds on different machines, but using the same build recipes, will generally use the same toolchain for cross-compiling. Before we dive into pesky details, we need to define what we mean by "binary reproducible". Yocto final binaries are placed in the folder "deploy", so the most natural definition is that two "deploy" folders should be binary identical. This is the actual goal we aim for. In the process, we need to deal with various well known sources of binary differences: | |||
1. '''Embedded paths''' | |||
Usually a result of the C macro __FILE__, popular in asserts, error messages etc. | |||
These are actually very easy to deal with using Yocto GCC compiler. The compiler has a special command line argument that remaps a hard coded __FILE__ path to an arbitrary string. | |||
This feature allowed the whole Linux kernel and all kernel modules modules to build reproducibly without making any changes to the Linux source tree. | |||
2. '''Timestamps''' | |||
Timestamps are dealt with the usual way via the environment variable SOURCE_DATE_EPOCH. | |||
Yocto takes advantage of multi-core systems to utilize parallel builds. We need to set a recipe specific SOURCE_DATE_EPOCH in each recipe environment for various tasks. One way would be to modify all recipes one-by-one to specify SOURCE_DATE_EPOCH explicitly, but that is not realistic as there are hundreds (probably thousands) of recipes in various meta-layers. | |||
So this is done automatically instead in the ''reproducible_binaries.bbclass''. | |||
After sources are unpacked but before they are patched, we try to determine the value for SOURCE_DATE_EPOCH. | |||
There are 4 ways to determine SOURCE_DATE_EPOCH: | |||
1. Use value from __source-date-epoch.txt file if this file exists. This file was most likely created in the previous build by one of the following methods 2,3,4. | |||
(But, in principle, it could actually provided by a recipe via SRC_URI) | |||
If the file does not exist: | |||
2. Use .git last commit date timestamp (git does not allow checking out files and preserving their timestamps) | |||
3. Use "known" files such as NEWS, CHANGLELOG, ... | |||
4. Use the youngest file of the source tree. | |||
Once the value of SOURCE_DATE_EPOCH is determined, it is stored in the recipe ${WORKDIR} in a text file "__source-date-epoch.txt'. | |||
If this file is found by other recipe task, the value is placed in the SOURCE_DATE_EPOCH var in the task environment. | |||
This is done in an anonymous python function, so SOURCE_DATE_EPOCH is guaranteed to exist for all tasks. | |||
In some cases, the timestamp may be embedded in the code that does not respect SOURCE_DATE_EPOCH. Those cases are dealt with on an individual bases. | |||
'''3. Compression''' | |||
Various packing/compression utilities ''gzip'', ''tar'',..., etc. place timestamps or file mtimes in the archives. When we run into these issues, we modify | |||
the arguments to prevent this. There may be build hosts with older ''tar'' and ''cpio'' not supporting these options. In this case we can | |||
provide a natively built replacement. | |||
'''4. Build host references leaks''' | |||
Sometimes packages may contain references to the host build system. This happens most frequently with various ptest and debug packages. This is very easy to | |||
detect with ''diffoscope''. The remedy is to modify the corresponding recipes and filter out the leakage. Typical offenders are HOSTTOLS_DIR, DEBUG_PREFIX_MAP, | |||
RECIPE_SYSROOT_NATIVE, STAGING_DIR_TARGET. This is mostly a straight-forward fix. | |||
'''5. Sorting''' | |||
Parallel build can build packages/images in non-deterministic order. This may become an issue is some cases. | |||
For reproducibility, we need to consistently assign the UID/GID values. There is a proper way to do this in Yocto: | |||
http://www.yoctoproject.org/docs/latest/mega-manual/mega-manual.html#ref-classes-useradd | |||
'''6. Rootfs''' | |||
There are numerous other issues: file mtimes in rootfs, pre-link time... | |||
All of these will get a timestamp as specified by the variable REPRODUCIBLE_TIMESTAMP_ROOTFS. | |||
This is a catch-all timestamp, used in the final step of building an image. | |||
'''7. Package type''' | |||
Yocto supports the following package types: RPM, Debian, IPK. At this point the focus is on Debian. | |||
Making RPM and IPK packages reproducible as well is at the moment on a back-burner due to time | |||
constraints. We don't expect too much resistance, though. The issues are mostly file mtimes and dependencies on file system | |||
directory inode size. | |||
'''8. Dates and times embedded in the built objects''' | |||
This usually requires patching the offending source code. However, the patch may already exist, Debian does a good job. | |||
If not, creating a new patch may be necessary while also contacting the maintainer. | |||
This is still a work in progress, with patches being sent to the OpenEmbedded Core mailing list, the latest one being: | |||
https://patchwork.openembedded.org/series/6504/ | |||
Recent results (using the above patch set while building core-image minimal, two clean builds, same | |||
machine/OS, same date, two different folders, at two different times): | |||
<pre> | |||
Same: | |||
core-image-minimal-initramfs-qemux86 | |||
bzImage-qemux86.bin | |||
vmlinux.gz-qemux86.bin | |||
Comparing Debian packages in tmp/deploy/deb: | |||
Same: 4005 | |||
Different: 38 | |||
Total: 4043 | |||
</pre> |
Revision as of 22:44, 23 August 2017
What are Reproducible Builds?
The first tricky problem in this area is that the term "reproducible builds" means different things to different people. The simplest scenario is that you need to be able to re-run a build at some point in the future and have it succeed. The project already has mechanisms to provide the basics for this such as source mirroring support so that if upstream sources disappear, the build and still run. We also have technology such as sstate which means when you rebuild, prebuilt artefacts can be used where possible instead of rebuilding from scratch.
The others, reproducibility is also about the levels of determinism in the builds. Releases prior to pyro didn't have "recipe specific sysroots" which meant that potentially, builds could be contaminated depending on build ordering. From pyro onwards the sysroots are isolated to be per recipe to avoid this issue. Perfect determinism in this area would mean task specific sysroots since multiple tasks from the same recipe can currently run in parallel however that is not something we've felt the project needs right now.
There are other factors which affect determism and reproducibility, for example:
- compressing files with different levels of parallelism could result in different output
- dates, times and build paths can be embedded in the built objects
- dates and times of build objects changes depending on when the build was run and which tasks happened in parallel
Our current intent is to model the best practises available and have tools and techniques to fix deltas where we can.
It is worth noting there are things it would be sensible to do outside the build system to help with reproducibility over the long term too, its not a problem the build system can solve alone. Depending on your circumstances and requirements, it may help to:
- build on the same distro version with the same installed packages
- build in the same path
- use the same build hardware
The system is designed to factor out these deltas but for example, the build system today which works on Ubuntu X is not likely to work on Ubuntu X+5 years as the compilers will be totally different and probably unable to build today's codebase. If you need to run Ubuntu X in 5 years time, you may need older hardware to do that. There are solutions with containers, the build system can even self host inside its own image but there are limits to every approach and the exact circumstances of each individual situation need to be considered.
Current Status and Planned Development
The Yocto Project aims to have builds which are entirely reproducible. That is; if you run a build today, then run that same configuration X time in the future, the binaries you get out of the build should be binary identical.
This implies that the host system you run the build on and the path you run the build in should not affect the target system output.
Our build system doesn't produce binary reproducible builds today, but we are actively working towards that goal and fixing issues as we identify them. We also plan to improve our testing to help find reproducibility issues.
The design of the system lends itself very well to producing reproducible builds, as we provide a reproducible build environment with minimal dependencies on the host/build OS.
We have several technologies in place which aide in reproducibility:
Our shared state (sstate) mechanism base our builds on hashes of input metadata, reusing the outputs if the inputs are the same.
As our sstate files need to be reusable regardless of build path and can be target or native binaries, we have mechanisms for working around various issues such as hard-coded paths (though we'd prefer to remove the need for them entirely).
A related problem to sstate is that of knowing when the input has changed, has the output changed? This is useful in the context of package feeds, amongst other things, to know whether we should update them or not. We have binary build comparison tools at an early stage of development to allow us to reduce unnecessary churn in package feeds, however further developments are planned in this area using the tools from the reproducible-builds project.
Our SDKs need to be relocatable and run anywhere they are installed to. We include our own C library to do this in a "run anywhere" scenario and are able to generate fully relocatable toolchains.
Our new for Yocto Project 2.3 (Pyro) recipe specific sysroots ensure that the output of a recipe doesn't change depending on whether other recipes have already been built and in which order, even when the software tries to autodetect available features.
At this point the project does not intend to target timestamp levels of reproducibility so whilst the binary content should be the same, file timestamps may not be and this means package manager tarballs would not be binary identical due to timestamp differences. There are likely some ways we can fix this too and we are working on it but its a lower priority than some other determinism issues.
Why do we want reproducible builds?
The many benefits of reproducible builds as listed in the reproducible-builds project, not least of all the ability to verify a built output matches the source, are key motivations for our work on reproducible builds. Additonal benefits specific to the Yocto Project include:
- ability to reduce the churn in package feeds
- ability to improve reliability of allarch recipes and enable wider use of them
Related Work
Some key bugs that are important in our reproducibility efforts are:
#1560 Enable recipe specific sysroots— DONE for 2.3- #5866 Reproducible builds: identical binaries
- #10813 Replace build-compare with tools from the reproducible-builds project
- #11176 Add optional command to rootfs-postprocess to remove non-determinism from rootfs
- Integrate changes to improve determinism in OE-Core:
- #11177 Generate archives with deterministic metadata
- #11178 Make use of SOURCE_DATE_EPOCH, most likely with patches from Debian to ensure tools are using it.
- #11179 Deterministic timezone and locale settings
- Develop tests to catch issues with reproducibility
- Reproducibility analysis tool which will run diffoscope on all binary artefacts and produce a report that lists files with issues and points to the package and recipe the file came from.
Verification
To ensure that our builds are reproducible we need to implement an infrastructure of verification over the build systems outputs Images, SDKs, Binary packages, etc.
The idea is to run diffoscope over target shared states (populate_sysroot) because is the first output on what the other artifacts (Images, SDK's, Binary packages) are generated. The build output will be generated running bitbake world with two different Host systems, this can be done using two Virtual machines as Autobuilder workers and then launch the comparison process (diffoscope) against the two build outputs.
A script exists to call diffoscope against shared states, the AB needs to use something similar to get the comparison results. [1]. There is a example of the output here [2].
An Autobuilder development branch for reproducible builds. [3]
To preserve the results a web system needs to be implemented (the AB will push the results), this also will serve to publish something like www.yoctoproject.org/reproduciblebuilds to see the actual status and what recipes needs works in order to be 100% reproducible. An initial implementation only with database models. [4]
Current Development
Yocto builds are well suited to provide reproducible builds, as there is minimal reliance on host tools. Most tools, including various toolchains, are built from scratch. Hence builds on different machines, but using the same build recipes, will generally use the same toolchain for cross-compiling. Before we dive into pesky details, we need to define what we mean by "binary reproducible". Yocto final binaries are placed in the folder "deploy", so the most natural definition is that two "deploy" folders should be binary identical. This is the actual goal we aim for. In the process, we need to deal with various well known sources of binary differences:
1. Embedded paths
Usually a result of the C macro __FILE__, popular in asserts, error messages etc. These are actually very easy to deal with using Yocto GCC compiler. The compiler has a special command line argument that remaps a hard coded __FILE__ path to an arbitrary string. This feature allowed the whole Linux kernel and all kernel modules modules to build reproducibly without making any changes to the Linux source tree.
2. Timestamps
Timestamps are dealt with the usual way via the environment variable SOURCE_DATE_EPOCH. Yocto takes advantage of multi-core systems to utilize parallel builds. We need to set a recipe specific SOURCE_DATE_EPOCH in each recipe environment for various tasks. One way would be to modify all recipes one-by-one to specify SOURCE_DATE_EPOCH explicitly, but that is not realistic as there are hundreds (probably thousands) of recipes in various meta-layers. So this is done automatically instead in the reproducible_binaries.bbclass. After sources are unpacked but before they are patched, we try to determine the value for SOURCE_DATE_EPOCH.
There are 4 ways to determine SOURCE_DATE_EPOCH:
1. Use value from __source-date-epoch.txt file if this file exists. This file was most likely created in the previous build by one of the following methods 2,3,4. (But, in principle, it could actually provided by a recipe via SRC_URI)
If the file does not exist:
2. Use .git last commit date timestamp (git does not allow checking out files and preserving their timestamps)
3. Use "known" files such as NEWS, CHANGLELOG, ...
4. Use the youngest file of the source tree.
Once the value of SOURCE_DATE_EPOCH is determined, it is stored in the recipe ${WORKDIR} in a text file "__source-date-epoch.txt'. If this file is found by other recipe task, the value is placed in the SOURCE_DATE_EPOCH var in the task environment. This is done in an anonymous python function, so SOURCE_DATE_EPOCH is guaranteed to exist for all tasks.
In some cases, the timestamp may be embedded in the code that does not respect SOURCE_DATE_EPOCH. Those cases are dealt with on an individual bases.
3. Compression
Various packing/compression utilities gzip, tar,..., etc. place timestamps or file mtimes in the archives. When we run into these issues, we modify the arguments to prevent this. There may be build hosts with older tar and cpio not supporting these options. In this case we can provide a natively built replacement.
4. Build host references leaks
Sometimes packages may contain references to the host build system. This happens most frequently with various ptest and debug packages. This is very easy to detect with diffoscope. The remedy is to modify the corresponding recipes and filter out the leakage. Typical offenders are HOSTTOLS_DIR, DEBUG_PREFIX_MAP, RECIPE_SYSROOT_NATIVE, STAGING_DIR_TARGET. This is mostly a straight-forward fix.
5. Sorting
Parallel build can build packages/images in non-deterministic order. This may become an issue is some cases. For reproducibility, we need to consistently assign the UID/GID values. There is a proper way to do this in Yocto: http://www.yoctoproject.org/docs/latest/mega-manual/mega-manual.html#ref-classes-useradd
6. Rootfs
There are numerous other issues: file mtimes in rootfs, pre-link time... All of these will get a timestamp as specified by the variable REPRODUCIBLE_TIMESTAMP_ROOTFS. This is a catch-all timestamp, used in the final step of building an image.
7. Package type
Yocto supports the following package types: RPM, Debian, IPK. At this point the focus is on Debian. Making RPM and IPK packages reproducible as well is at the moment on a back-burner due to time constraints. We don't expect too much resistance, though. The issues are mostly file mtimes and dependencies on file system directory inode size.
8. Dates and times embedded in the built objects
This usually requires patching the offending source code. However, the patch may already exist, Debian does a good job. If not, creating a new patch may be necessary while also contacting the maintainer.
This is still a work in progress, with patches being sent to the OpenEmbedded Core mailing list, the latest one being: https://patchwork.openembedded.org/series/6504/
Recent results (using the above patch set while building core-image minimal, two clean builds, same machine/OS, same date, two different folders, at two different times):
Same: core-image-minimal-initramfs-qemux86 bzImage-qemux86.bin vmlinux.gz-qemux86.bin Comparing Debian packages in tmp/deploy/deb: Same: 4005 Different: 38 Total: 4043