TipsAndTricks/DebuggingHardQemuFailures
Sometimes we hit hard to debug failures on the autobuilder infrastructure. This details some of the thinking and tricks used when debugging such failures.
SSH access to the autobuilders often helps and is available to those needing it to debug failures. There are some simple rules/steps:
- Pause the autobuilder (from https://autobuilder.yocto.io/buildslaves/ , e.g. https://autobuilder.yocto.io/buildslaves/fedora26.yocto.io)
- Let RP/Joshua/Halstead/Ross know that debugging is taking place (so we don't accidentally reboot or renable it)
- "sudo -iu pokybuild" so there are no permissions issues
- cd to the directory in the failing build (shown at the top of the failing build log)
- make sure auto.conf matches the failing configuration (subsequent builds may have reset it)
- source oe-init-build-env as usual and the build/debug away
When the failure is intermittent this adds extra complexity. The first step is often to build an environment where you can reproduce the issue at will. This means reducing time taken to trigger the issue and perhaps brute forcing it by running many items in parallel.
Tracing kernel functions can be done with ftrace, this is handy for production kernels where you might not easily be able to rebuild the live kernel, or where you have a system which is in a "broken" state and you want to debug the problem. The magic incantations are something like:
$ trace-cmd record -b 20000 -T -e kmem
which logs all the kernel memory allocation requests. The -T option includes callgraph information. The trace.dat file generated can be huge, e.g. 19.6GB for multiple image tests running in parallel. To find memory allocations that failed (return pointer was NULL):
$ trace-cmd report | grep kmalloc.*ptr=[^0]
You could analyse by CPU to easily find the backtrace that corresponds to the allocation failure:
$ trace-cmd report --cpu 27 | grep kmalloc.*ptr=[^0]
Other random tips:
- https://elixir.free-electrons.com/linux/latest/source provides a nice way to navigating kernel source code.