TipsAndTricks/DebuggingHardQemuFailures: Difference between revisions
(Created page with "Sometimes we hit hard to debug failures on the autobuilder infrastructure. This details some of the thinking and tricks used when debugging such failures. SSH access to the a...") |
No edit summary |
||
Line 10: | Line 10: | ||
* source oe-init-build-env as usual and the build/debug away | * source oe-init-build-env as usual and the build/debug away | ||
When the failure is intermittent this adds extra complexity. The first step is often to build an environment where you can reproduce the issue at will. This means reducing time taken to trigger the issue and perhaps brute forcing it by running many items in parallel. | When the failure is intermittent this adds extra complexity. The first step is often to build an environment where you can reproduce the issue at will. This means reducing time taken to trigger the issue and perhaps brute forcing it by running many items in parallel. RP developed a script, "runqemu-parallel" which boots qemu using runqemu waiting for it to reach a login prompt, then immediately starting a new qemu. Around 60 of these can be run at once until failures appear. The script was optimised so that runqemu didn't need to call into bitbake, improving the speed it could cycle processes. | ||
In one case RP ended up building a similar kernel version locally and then copying in the modules and kernel from the broken autobuilder. For Ubuntu this is something like: | |||
<pre> | |||
sudo apt install install git build-essential kernel-package fakeroot libncurses5-dev libssl-dev ccache | |||
git clone -b linux-4.11.y git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git | |||
cp /boot/config-`uname -r` .config | |||
make -j `getconf _NPROCESSORS_ONLN` deb-pkg LOCALVERSION=-custom | |||
</pre> | |||
and then copying in /boot/XXX and /lib/kernel/modules/XXX to the system and running update-grub. Kernel commandline can be tweaked in /etc/default/grub. | |||
Sometimes its memory fragmentation that causes the issue, particularly if the issue occurs after sustained uptime and builds. A "echo 1 > /proc/sys/vm/drop_caches" will often make this kind of issue disappear again. If that is the case, http://rpsys.net/fragment.tgz is a version of https://oss.oracle.com/projects/codefragments/src/trunk/fragment-slab/README which works on 4.13-4.15 kernels. Loading it (insmod fragment.ko) and then "echo 900000 > /proc/temp" will fragement the system memory enough to reproduce fragmentation issues. You'll need to experiment to find suitable numbers for your memory size. /proc/pagetypeinfo and /proc/slabinfo contain useful information about how much memory is available at each allocation size order. | |||
Tracing kernel functions can be done with ftrace, this is handy for production kernels where you might not easily be able to rebuild the live kernel, or where you have a system which is in a "broken" state and you want to debug the problem. The magic incantations are something like: | Tracing kernel functions can be done with ftrace, this is handy for production kernels where you might not easily be able to rebuild the live kernel, or where you have a system which is in a "broken" state and you want to debug the problem. The magic incantations are something like: |
Revision as of 15:08, 13 December 2017
Sometimes we hit hard to debug failures on the autobuilder infrastructure. This details some of the thinking and tricks used when debugging such failures.
SSH access to the autobuilders often helps and is available to those needing it to debug failures. There are some simple rules/steps:
- Pause the autobuilder (from https://autobuilder.yocto.io/buildslaves/ , e.g. https://autobuilder.yocto.io/buildslaves/fedora26.yocto.io)
- Let RP/Joshua/Halstead/Ross know that debugging is taking place (so we don't accidentally reboot or renable it)
- "sudo -iu pokybuild" so there are no permissions issues
- cd to the directory in the failing build (shown at the top of the failing build log)
- make sure auto.conf matches the failing configuration (subsequent builds may have reset it)
- source oe-init-build-env as usual and the build/debug away
When the failure is intermittent this adds extra complexity. The first step is often to build an environment where you can reproduce the issue at will. This means reducing time taken to trigger the issue and perhaps brute forcing it by running many items in parallel. RP developed a script, "runqemu-parallel" which boots qemu using runqemu waiting for it to reach a login prompt, then immediately starting a new qemu. Around 60 of these can be run at once until failures appear. The script was optimised so that runqemu didn't need to call into bitbake, improving the speed it could cycle processes.
In one case RP ended up building a similar kernel version locally and then copying in the modules and kernel from the broken autobuilder. For Ubuntu this is something like:
sudo apt install install git build-essential kernel-package fakeroot libncurses5-dev libssl-dev ccache git clone -b linux-4.11.y git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git cp /boot/config-`uname -r` .config make -j `getconf _NPROCESSORS_ONLN` deb-pkg LOCALVERSION=-custom
and then copying in /boot/XXX and /lib/kernel/modules/XXX to the system and running update-grub. Kernel commandline can be tweaked in /etc/default/grub.
Sometimes its memory fragmentation that causes the issue, particularly if the issue occurs after sustained uptime and builds. A "echo 1 > /proc/sys/vm/drop_caches" will often make this kind of issue disappear again. If that is the case, http://rpsys.net/fragment.tgz is a version of https://oss.oracle.com/projects/codefragments/src/trunk/fragment-slab/README which works on 4.13-4.15 kernels. Loading it (insmod fragment.ko) and then "echo 900000 > /proc/temp" will fragement the system memory enough to reproduce fragmentation issues. You'll need to experiment to find suitable numbers for your memory size. /proc/pagetypeinfo and /proc/slabinfo contain useful information about how much memory is available at each allocation size order.
Tracing kernel functions can be done with ftrace, this is handy for production kernels where you might not easily be able to rebuild the live kernel, or where you have a system which is in a "broken" state and you want to debug the problem. The magic incantations are something like:
$ trace-cmd record -b 20000 -T -e kmem
which logs all the kernel memory allocation requests. The -T option includes callgraph information. The trace.dat file generated can be huge, e.g. 19.6GB for multiple image tests running in parallel. To find memory allocations that failed (return pointer was NULL):
$ trace-cmd report | grep kmalloc.*ptr=[^0]
You could analyse by CPU to easily find the backtrace that corresponds to the allocation failure:
$ trace-cmd report --cpu 27 | grep kmalloc.*ptr=[^0]
Other random tips:
- https://elixir.free-electrons.com/linux/latest/source provides a nice way to navigating kernel source code.