Reproducible Builds: Difference between revisions

From Yocto Project
Jump to navigationJump to search
No edit summary
 
(2 intermediate revisions by one other user not shown)
Line 1: Line 1:
== Reproducible Build Test Results ==
The Yocto project publishes a summary of the latest reproducible build results [https://www.yoctoproject.org/reproducible-build-results/ here]


== What are Reproducible Builds? ==
== What are Reproducible Builds? ==
Line 65: Line 68:
** [https://bugzilla.yoctoproject.org/show_bug.cgi?id=11179 #11179] Deterministic timezone and locale settings
** [https://bugzilla.yoctoproject.org/show_bug.cgi?id=11179 #11179] Deterministic timezone and locale settings
* Develop tests to catch issues with reproducibility
* Develop tests to catch issues with reproducibility
* Reproducibility analysis tool which will run diffoscope on all binary artefacts and produce a report that lists files with issues and points to the package and recipe the file came from.
* Reproducibility analysis tool which will run ''diffoscope'' on all binary artifacts and produce a report that lists files with issues and points to the package and recipe the file came from.
 
As the goal is to achieve 100% reproducibility in building packages and the number of packages built that are binary different is relatively small, we started filing individual bug entries for each of them.


== Verification ==
== Verification ==
Line 78: Line 83:


To preserve the results a web system needs to be implemented (the AB will push the results), this also will serve to publish something like www.yoctoproject.org/reproduciblebuilds to see the actual status and what recipes needs works in order to be 100% reproducible. An initial implementation only with database models. [https://github.com/alimon/reproducible-builds]
To preserve the results a web system needs to be implemented (the AB will push the results), this also will serve to publish something like www.yoctoproject.org/reproduciblebuilds to see the actual status and what recipes needs works in order to be 100% reproducible. An initial implementation only with database models. [https://github.com/alimon/reproducible-builds]
== Current Development ==
Yocto builds are well suited to provide reproducible builds, as there is minimal reliance on host tools. Most tools, including various toolchains, are built from scratch.
Hence builds on different machines, but using the same build recipes, will generally use the same toolchain for cross-compiling. Before we dive into pesky details, we need to define what we mean by "binary reproducible". Yocto final binaries are placed in the folder "deploy", so the most natural definition is that two "deploy" folders should be binary identical. This is the actual goal we aim for. In the process, we need to deal with various well known sources of binary differences:
1. '''Embedded paths'''
Usually a result of the C macro __FILE__, popular in asserts, error messages etc.
These are actually very easy to deal with using Yocto GCC compiler. The compiler has a special command line argument that remaps a hard coded __FILE__ path to an arbitrary string.
This feature allowed the whole Linux kernel and all kernel modules modules to build reproducibly without making any changes to the Linux source tree.
2. '''Timestamps'''
Timestamps are dealt with the usual way via the environment variable SOURCE_DATE_EPOCH. The simplest way is to set
<pre>
export SOURCE_DATE_EPOCH="some value"
</pre>
The actual timestamp value is not really important. However, a recipe specific values for SOURCE_DATE_EPOCH may be more desirable.
Yocto takes advantage of multi-core systems to utilize parallel builds. We need to set a recipe specific SOURCE_DATE_EPOCH in each recipe environment for various tasks. One way would be to modify all recipes one-by-one to specify SOURCE_DATE_EPOCH explicitly, but that is not realistic as there are hundreds (probably thousands) of recipes in various meta-layers.
So this is done automatically instead in the ''reproducible_binaries.bbclass''.
After sources are unpacked but before they are patched, we try to determine the value for SOURCE_DATE_EPOCH.
There are 4 ways to determine SOURCE_DATE_EPOCH:
1. Use value from __source-date-epoch.txt file if this file exists. This file was most likely created in the previous build by one of the following methods 2,3,4.
(But, in principle, it could actually provided by a recipe via SRC_URI)
If the file does not exist:
2. Use .git last commit date timestamp (git does not allow checking out files and preserving their timestamps)
3. Use "known" files such as NEWS, CHANGLELOG, ...
4. Use the youngest file of the source tree.
Once the value of SOURCE_DATE_EPOCH is determined, it is stored in the recipe ${WORKDIR} in a text file "__source-date-epoch.txt'.
If this file is found by other recipe task, the value is placed in the  SOURCE_DATE_EPOCH var in the task environment.
This is done in an anonymous python function, so SOURCE_DATE_EPOCH is guaranteed to exist for all tasks.
In some cases, the timestamp may be embedded in the code that does not respect SOURCE_DATE_EPOCH. Those cases are dealt with on an individual bases.
'''3. Compression'''
Various packing/compression utilities ''gzip'', ''tar'',..., etc. place timestamps or file mtimes in the archives. When we run into these issues, we modify
the arguments to prevent this. There may be build hosts with older ''tar'' and ''cpio'' not supporting these options. In this case we can
provide a natively built replacement.
'''4. Build host references leaks'''
Sometimes packages may contain references to the host build system. This happens most frequently with various ptest and debug packages. This is very easy to
detect with ''diffoscope''. The remedy is to modify the corresponding recipes and filter out the leakage. Typical offenders are HOSTTOLS_DIR, DEBUG_PREFIX_MAP,
RECIPE_SYSROOT_NATIVE, STAGING_DIR_TARGET. This is mostly a straight-forward fix.
'''5. Sorting'''
Parallel build can build packages/images in non-deterministic order. This may become an issue is some cases.
For reproducibility, we need to consistently assign the UID/GID values. There is a proper way to do this in Yocto:
http://www.yoctoproject.org/docs/latest/mega-manual/mega-manual.html#ref-classes-useradd
'''6. Rootfs'''
There are numerous other issues: file mtimes in rootfs, pre-link time...
All of these will get a timestamp as specified by the variable REPRODUCIBLE_TIMESTAMP_ROOTFS.
This is a catch-all timestamp, used in the final step of building an image.
'''7. Package type'''
Yocto supports the following package types: RPM, Debian, IPK. Until recently, only Debian packages supported reproducible build.
However, due to very recent development, we can also build RPM and IPK packages reproducible as well.
'''8. Dates and times embedded in the built objects'''
This usually requires patching the offending source code. However, the patch may already exist, Debian does a good job.
If not, creating a new patch may be necessary while also contacting the maintainer.
== Hands On: Building Reproducible Binaries ==
While this is still a work in progress, with patches being sent to the OpenEmbedded Core mailing list, we can already build binary identical images.
We need to introduce two new variables:
'''BUILD_REPRODUCIBLE_BINARIES'''
Setting BUILD_REPRODUCIBLE_BINARIES="1" will remove/disable certain features that were intentionally performed by default.
Sometimes choosing reproducibility can override certain features originally meant to increase security, so this is the way to specify you prefer reproducibility.
For example, prelink will not use random addresses for libraries if BUILD_REPRODUCIBLE_BINARIES="1".
'''REPRODUCIBLE_TIMESTAMP_ROOTFS'''
When building packages, various timestamps can be controlled by SOURCE_DATE_EPOCH. This, however, does not work
for building images. Images contain various scattered timestamps, and as a consequence, two builds of otherwise identical images will differ.
The purpose of this variable is to use the value as a "catch-all" rootfs image timestamp and build all images with identical timestamps.
The value for this variable is always up to the developers/image builders.
In addition, we need four environmental variables need to be exported with consistent values. For example, placing the following in your local.conf will allow you to get started:
<pre>
BUILD_REPRODUCIBLE_BINARIES = "1"
export PYTHONHASHSEED = "0"
export PERL_HASH_SEED = "0"
export TZ = 'UTC'
export SOURCE_DATE_EPOCH ??= "1520598896"
REPRODUCIBLE_TIMESTAMP_ROOTFS ??= "1520598896"
</pre>
Note that the value for REPRODUCIBLE_TIMESTAMP_ROOTFS and SOURCE_DATE_EPOCH are rather arbitrary and not really very important. Feel free to use your own favorite dates.
https://www.epochconverter.com/ .
Comparing two clean builds (same host, same date, different build folder, different time) of core-image-minimal will result in something similar to:
<pre>
            DEB    RPM    IPK
-----------------------------
Same:      3949  3949  3949
Different:    2      2      2
Total:    3951  3951  3951
</pre>
More complex images, for example core-image-sato-sdk-ptest-multilib-corei7 will produce results:
<pre>
            IPK    RPM    DEB
-----------------------------
Same:      8771  8772  8772
Different:  76    75    79
Total:    8847  8847  8851
</pre>
The numbers above will vary, as they reflect the current master, which is a moving target. The numbers above do not include a few packages that differ due
to being built on two different days. The important point here is, there is only a finite number of remaining packages that are binary different and the goal to reduce them to
zero is quite realistic. As always, you can scrutinize the differences using ''diffoscope''.

Latest revision as of 18:17, 9 February 2021

Reproducible Build Test Results

The Yocto project publishes a summary of the latest reproducible build results here

What are Reproducible Builds?

The first tricky problem in this area is that the term "reproducible builds" means different things to different people. The simplest scenario is that you need to be able to re-run a build at some point in the future and have it succeed. The project already has mechanisms to provide the basics for this such as source mirroring support so that if upstream sources disappear, the build and still run. We also have technology such as sstate which means when you rebuild, prebuilt artefacts can be used where possible instead of rebuilding from scratch.

The others, reproducibility is also about the levels of determinism in the builds. Releases prior to pyro didn't have "recipe specific sysroots" which meant that potentially, builds could be contaminated depending on build ordering. From pyro onwards the sysroots are isolated to be per recipe to avoid this issue. Perfect determinism in this area would mean task specific sysroots since multiple tasks from the same recipe can currently run in parallel however that is not something we've felt the project needs right now.

There are other factors which affect determism and reproducibility, for example:

  • compressing files with different levels of parallelism could result in different output
  • dates, times and build paths can be embedded in the built objects
  • dates and times of build objects changes depending on when the build was run and which tasks happened in parallel

Our current intent is to model the best practises available and have tools and techniques to fix deltas where we can.

It is worth noting there are things it would be sensible to do outside the build system to help with reproducibility over the long term too, its not a problem the build system can solve alone. Depending on your circumstances and requirements, it may help to:

  • build on the same distro version with the same installed packages
  • build in the same path
  • use the same build hardware

The system is designed to factor out these deltas but for example, the build system today which works on Ubuntu X is not likely to work on Ubuntu X+5 years as the compilers will be totally different and probably unable to build today's codebase. If you need to run Ubuntu X in 5 years time, you may need older hardware to do that. There are solutions with containers, the build system can even self host inside its own image but there are limits to every approach and the exact circumstances of each individual situation need to be considered.

Current Status and Planned Development

The Yocto Project aims to have builds which are entirely reproducible. That is; if you run a build today, then run that same configuration X time in the future, the binaries you get out of the build should be binary identical.

This implies that the host system you run the build on and the path you run the build in should not affect the target system output.

Our build system doesn't produce binary reproducible builds today, but we are actively working towards that goal and fixing issues as we identify them. We also plan to improve our testing to help find reproducibility issues.

The design of the system lends itself very well to producing reproducible builds, as we provide a reproducible build environment with minimal dependencies on the host/build OS.

We have several technologies in place which aide in reproducibility:

Our shared state (sstate) mechanism base our builds on hashes of input metadata, reusing the outputs if the inputs are the same.

As our sstate files need to be reusable regardless of build path and can be target or native binaries, we have mechanisms for working around various issues such as hard-coded paths (though we'd prefer to remove the need for them entirely).

A related problem to sstate is that of knowing when the input has changed, has the output changed? This is useful in the context of package feeds, amongst other things, to know whether we should update them or not. We have binary build comparison tools at an early stage of development to allow us to reduce unnecessary churn in package feeds, however further developments are planned in this area using the tools from the reproducible-builds project.

Our SDKs need to be relocatable and run anywhere they are installed to. We include our own C library to do this in a "run anywhere" scenario and are able to generate fully relocatable toolchains.

Our new for Yocto Project 2.3 (Pyro) recipe specific sysroots ensure that the output of a recipe doesn't change depending on whether other recipes have already been built and in which order, even when the software tries to autodetect available features.

At this point the project does not intend to target timestamp levels of reproducibility so whilst the binary content should be the same, file timestamps may not be and this means package manager tarballs would not be binary identical due to timestamp differences. There are likely some ways we can fix this too and we are working on it but its a lower priority than some other determinism issues.

Why do we want reproducible builds?

The many benefits of reproducible builds as listed in the reproducible-builds project, not least of all the ability to verify a built output matches the source, are key motivations for our work on reproducible builds. Additonal benefits specific to the Yocto Project include:

  • ability to reduce the churn in package feeds
  • ability to improve reliability of allarch recipes and enable wider use of them

Related Work

Some key bugs that are important in our reproducibility efforts are:

  • #1560 Enable recipe specific sysroots — DONE for 2.3
  • #5866 Reproducible builds: identical binaries
  • #10813 Replace build-compare with tools from the reproducible-builds project
  • #11176 Add optional command to rootfs-postprocess to remove non-determinism from rootfs
  • Integrate changes to improve determinism in OE-Core:
    • #11177 Generate archives with deterministic metadata
    • #11178 Make use of SOURCE_DATE_EPOCH, most likely with patches from Debian to ensure tools are using it.
    • #11179 Deterministic timezone and locale settings
  • Develop tests to catch issues with reproducibility
  • Reproducibility analysis tool which will run diffoscope on all binary artifacts and produce a report that lists files with issues and points to the package and recipe the file came from.

As the goal is to achieve 100% reproducibility in building packages and the number of packages built that are binary different is relatively small, we started filing individual bug entries for each of them.

Verification

To ensure that our builds are reproducible we need to implement an infrastructure of verification over the build systems outputs Images, SDKs, Binary packages, etc.

The idea is to run diffoscope over target shared states (populate_sysroot) because is the first output on what the other artifacts (Images, SDK's, Binary packages) are generated. The build output will be generated running bitbake world with two different Host systems, this can be done using two Virtual machines as Autobuilder workers and then launch the comparison process (diffoscope) against the two build outputs.

A script exists to call diffoscope against shared states, the AB needs to use something similar to get the comparison results. [1]. There is a example of the output here [2].

An Autobuilder development branch for reproducible builds. [3]

To preserve the results a web system needs to be implemented (the AB will push the results), this also will serve to publish something like www.yoctoproject.org/reproduciblebuilds to see the actual status and what recipes needs works in order to be 100% reproducible. An initial implementation only with database models. [4]

Current Development

Yocto builds are well suited to provide reproducible builds, as there is minimal reliance on host tools. Most tools, including various toolchains, are built from scratch. Hence builds on different machines, but using the same build recipes, will generally use the same toolchain for cross-compiling. Before we dive into pesky details, we need to define what we mean by "binary reproducible". Yocto final binaries are placed in the folder "deploy", so the most natural definition is that two "deploy" folders should be binary identical. This is the actual goal we aim for. In the process, we need to deal with various well known sources of binary differences:

1. Embedded paths

Usually a result of the C macro __FILE__, popular in asserts, error messages etc. These are actually very easy to deal with using Yocto GCC compiler. The compiler has a special command line argument that remaps a hard coded __FILE__ path to an arbitrary string. This feature allowed the whole Linux kernel and all kernel modules modules to build reproducibly without making any changes to the Linux source tree.

2. Timestamps

Timestamps are dealt with the usual way via the environment variable SOURCE_DATE_EPOCH. The simplest way is to set

export SOURCE_DATE_EPOCH="some value"

The actual timestamp value is not really important. However, a recipe specific values for SOURCE_DATE_EPOCH may be more desirable. Yocto takes advantage of multi-core systems to utilize parallel builds. We need to set a recipe specific SOURCE_DATE_EPOCH in each recipe environment for various tasks. One way would be to modify all recipes one-by-one to specify SOURCE_DATE_EPOCH explicitly, but that is not realistic as there are hundreds (probably thousands) of recipes in various meta-layers. So this is done automatically instead in the reproducible_binaries.bbclass. After sources are unpacked but before they are patched, we try to determine the value for SOURCE_DATE_EPOCH.

There are 4 ways to determine SOURCE_DATE_EPOCH:

1. Use value from __source-date-epoch.txt file if this file exists. This file was most likely created in the previous build by one of the following methods 2,3,4. (But, in principle, it could actually provided by a recipe via SRC_URI)

If the file does not exist:

2. Use .git last commit date timestamp (git does not allow checking out files and preserving their timestamps)

3. Use "known" files such as NEWS, CHANGLELOG, ...

4. Use the youngest file of the source tree.

Once the value of SOURCE_DATE_EPOCH is determined, it is stored in the recipe ${WORKDIR} in a text file "__source-date-epoch.txt'. If this file is found by other recipe task, the value is placed in the SOURCE_DATE_EPOCH var in the task environment. This is done in an anonymous python function, so SOURCE_DATE_EPOCH is guaranteed to exist for all tasks.

In some cases, the timestamp may be embedded in the code that does not respect SOURCE_DATE_EPOCH. Those cases are dealt with on an individual bases.

3. Compression

Various packing/compression utilities gzip, tar,..., etc. place timestamps or file mtimes in the archives. When we run into these issues, we modify the arguments to prevent this. There may be build hosts with older tar and cpio not supporting these options. In this case we can provide a natively built replacement.

4. Build host references leaks

Sometimes packages may contain references to the host build system. This happens most frequently with various ptest and debug packages. This is very easy to detect with diffoscope. The remedy is to modify the corresponding recipes and filter out the leakage. Typical offenders are HOSTTOLS_DIR, DEBUG_PREFIX_MAP, RECIPE_SYSROOT_NATIVE, STAGING_DIR_TARGET. This is mostly a straight-forward fix.

5. Sorting

Parallel build can build packages/images in non-deterministic order. This may become an issue is some cases. For reproducibility, we need to consistently assign the UID/GID values. There is a proper way to do this in Yocto: http://www.yoctoproject.org/docs/latest/mega-manual/mega-manual.html#ref-classes-useradd

6. Rootfs

There are numerous other issues: file mtimes in rootfs, pre-link time... All of these will get a timestamp as specified by the variable REPRODUCIBLE_TIMESTAMP_ROOTFS. This is a catch-all timestamp, used in the final step of building an image.

7. Package type

Yocto supports the following package types: RPM, Debian, IPK. Until recently, only Debian packages supported reproducible build. However, due to very recent development, we can also build RPM and IPK packages reproducible as well.

8. Dates and times embedded in the built objects

This usually requires patching the offending source code. However, the patch may already exist, Debian does a good job. If not, creating a new patch may be necessary while also contacting the maintainer.


Hands On: Building Reproducible Binaries

While this is still a work in progress, with patches being sent to the OpenEmbedded Core mailing list, we can already build binary identical images. We need to introduce two new variables:

BUILD_REPRODUCIBLE_BINARIES

Setting BUILD_REPRODUCIBLE_BINARIES="1" will remove/disable certain features that were intentionally performed by default. Sometimes choosing reproducibility can override certain features originally meant to increase security, so this is the way to specify you prefer reproducibility. For example, prelink will not use random addresses for libraries if BUILD_REPRODUCIBLE_BINARIES="1".

REPRODUCIBLE_TIMESTAMP_ROOTFS

When building packages, various timestamps can be controlled by SOURCE_DATE_EPOCH. This, however, does not work for building images. Images contain various scattered timestamps, and as a consequence, two builds of otherwise identical images will differ. The purpose of this variable is to use the value as a "catch-all" rootfs image timestamp and build all images with identical timestamps. The value for this variable is always up to the developers/image builders.

In addition, we need four environmental variables need to be exported with consistent values. For example, placing the following in your local.conf will allow you to get started:

BUILD_REPRODUCIBLE_BINARIES = "1"
export PYTHONHASHSEED = "0"
export PERL_HASH_SEED = "0"
export TZ = 'UTC'
export SOURCE_DATE_EPOCH ??= "1520598896"
REPRODUCIBLE_TIMESTAMP_ROOTFS ??= "1520598896"

Note that the value for REPRODUCIBLE_TIMESTAMP_ROOTFS and SOURCE_DATE_EPOCH are rather arbitrary and not really very important. Feel free to use your own favorite dates. https://www.epochconverter.com/ .

Comparing two clean builds (same host, same date, different build folder, different time) of core-image-minimal will result in something similar to:


            DEB    RPM    IPK
-----------------------------
Same:      3949   3949   3949
Different:    2      2      2
Total:     3951   3951   3951

More complex images, for example core-image-sato-sdk-ptest-multilib-corei7 will produce results:

            IPK    RPM    DEB
-----------------------------
Same:      8771   8772   8772
Different:   76     75     79
Total:     8847   8847   8851

The numbers above will vary, as they reflect the current master, which is a moving target. The numbers above do not include a few packages that differ due to being built on two different days. The important point here is, there is only a finite number of remaining packages that are binary different and the goal to reduce them to zero is quite realistic. As always, you can scrutinize the differences using diffoscope.