Entropy on Autobuilders

From Yocto Project
Revision as of 18:06, 10 November 2017 by Rpurdie (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Entropy in our image testing on our autobuilders

When you run "bitbake <image> -c testimage", the system boots that image under qemu and runs tests against it. We've noticed occasional hangs of these images. A while ago we realised there was an entropy issue, particularly on qemuppc but also sometimes on qemux86 and others.

We therefore made sure that:

a) We enabled CONFIG_HW_RANDOM_VIRTIO in the kernels
b) We add "-device virtio-rng-pci" to the qemu commandline
c) We add rng-tools to the qemu images

This means that entropy from the host can pass through to the VMs. That entropy is used more heavily at machine startup, by the kernel itself and for the generation of ssh keys for the image amongst other things.

We were still seeing occasional hangs.

Useful notes:

a) Setting VERBOSE=very in /etc/default/rcS makes the boot logs more useful
b) The kernel has two rng init messages, "random: fast init done" happens first. "random: crng init done" happens later once the system has at least 128 bits of entropy.
c) You can check available entropy with 
  'cat /proc/sys/kernel/random/entropy_avail'.
d) You can check if the virtio rng is being used with 
 'cat /sys/class/misc/hw_random/rng_current'
e) You can eat all the entropy on the host with 
 "dd if=/dev/random of=/dev/null bs=1"

Using the dd command whilst booting an image in parallel I could make the images hang. They appear to hang not at host key generation but when launching rngd from rng-tools. This is because it opens /dev/hwrng and reads from it to perform some tests of randomness mandated by FIPS. The process will not deamonise until this has completed. Until that happens it blocks the boot, meaning we don't reach the login prompt which is shown later in the init process.

I also tweaked a kernel to add the entropy value to printk. To do this, I added:

int random_get_ent_count(void) { int ent_count = ENTROPY_BITS(&input_pool); return ent_count; }

to drivers/char/random.c and then tweaked the code in kernel/printk/printk.c to:

extern int random_get_ent_count(void); [...] if (!buf) return snprintf(NULL, 0, "[%5lu.000000] %d ", (unsigned long)ts, random_get_ent_count());

return sprintf(buf, "[%5lu.%06lu] %d ", (unsigned long)ts, rem_nsec / 1000, random_get_ent_count());

which is a touch evil but effective.

With this change I could clearly see the crng message isn't shown until the system entropy raises. I could also see that the rng-tools userspace isn't needed to pass the hwrng to the kernel as its clearly being effective much earlier in the boot process. This is done by the "hwrng" kernel thread which you can see in "ps".

I found that the kernel will block early in the boot process when entropy is restricted, e.g.:

[    2.137765] 47 00:06: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
[    2.731526] 47 tsc: Refined TSC clocksource calibration: 2194.931 MHz
[    2.732341] 47 clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fa37e3a27f, max_idle_ns: 440795212633 ns
[   73.870555] 47 Linux agpgart interface v0.103
[   73.905820] 47 brd: module loaded
[   74.013546] 47 loop: module loaded
[   74.017805] 47 Uniform Multi-Platform E-IDE driver
[...]
[   76.464840] 47 Freeing unused kernel memory: 1496K
[   76.588870] 47 random: fast init done

so early boot hangs can be an entropy restriction problem until the first "fast init done" message appears.

The "crng init done" message controls when /dev/random becomes accessible. That is used by connman/wpa_supplicant in our images.

The conclusion I've reached from all of this is that the hosts we run the builds on are actually where we're running out of entropy and we need run haveged or rng-tools there.

With a larger pool of entropy on the hosts, we should be better placed for more predictable builds. I also believe we could drop rng-tools from our images.

The delay at clocksource when entropy is low is caused by the rng_virtio driver:

[    3.844026] 3 initcall mod_init+0x0/0x188 returned -19 after 1 usecs
[    3.844580] 3 calling  mod_init+0x0/0x49 @ 1
[    3.844993] 3 initcall mod_init+0x0/0x49 returned -19 after 0 usecs
[    3.845543] 3 calling  virtio_rng_driver_init+0x0/0x12 @ 1
[   23.211133] 3 initcall virtio_rng_driver_init+0x0/0x12 returned 0 after 18911181 usecs
[   23.212170] 3 calling  agp_init+0x0/0x27 @ 1
[   23.212609] 3 Linux agpgart interface v0.103
[   23.213080] 3 initcall agp_init+0x0/0x27 returned 0 after 459 usecs
[   23.213659] 3 calling  agp_intel_init+0x0/0x2a @ 1 


(setting initcall_debug and looking at KERN_DEBUG messages as per init/main.c in the kernel)