Yocto Build Failure Swat Team
Overview
The assembly of the Yocto Project SWAT team is mainly to tackle urgent technical problems that break build on the master branch or major release branches in a timely manner, thus to maintain the stability of the master and release branch. The SWAT team includes volunteers or appointed members of the Yocto Project team. Community members can also volunteer to be part of the SWAT team.
Scope of Responsibility
Whenever a build (nightly build, weekly build, release build) fails, the SWAT team is responsible for ensuring the necessary debugging occurs and organizing resources to solve the issue and ensure successful builds. If resolving the issues requires schedule or resource adjustment, the SWAT team should work with program and development management to accommodate the change in the overall planning. If resolving the issues requires access to the autobuilder, please contact either Beth Flanagan or Michael Halstead for access rights.
Build failures are reported on the yocto-build mailing list.
Please review the Media:Swat.odp presentation.
Members
- Elizabeth Flanagan (US) (Autobuilder Maintainer)
- Paul Eggleton (UK)
- Saul Wold (US) (Autobuilder Administrator)
- Ross Burton (UK)
- Cristian Iorga (RO)
- Randy Witt (US)
Chair
A chairperson role will be rotated among team members each other week. The Chairperson should monitor the build status for the entire two weeks. Whenever a build is broken, the Chairperson should do necessary debugging and organize resources to solve the problems in a timely manner to meet the overall project and release schedule. The Chairperson serves as the focal point of the SWAT team to external people such as program managers or development managers.
Rotation Process
The Chairperson rotation takes place during the bi-weekly technical project meeting (Every other Tuesday at 8:00 AM PT). Usually, this will take a simple round robin order. In case the next person cannot take the role due to tight schedule, vacation or some other reasons, the role will be passed to the next person.
Debugging BKMs
When looking at a failure, the first question is what the baseline was and what changed. If there were recent known good builds it helps to narrow down the number of changes that were likely responsible for the failure. It's also useful to note if the build was from scratch or from existing sstate files. You can tell by seeing what "setscene" tasks run in the log.
The primary responsibility is to ensure that any failures are categorized correctly and that the right people get to know about them.
It's important *someone* is then tasked with fixing it. Image failures are particular tricky since its likely some component of the image that failed and the question is then whether that component changed recently, whether it was some kind of core functionality at fault and so on.
Ideally we want to get the failure reported to the person who knows something about the area and can come up with a fix without it distracting them too much. As a secondary responsibility, it's often helpful to triage the failure. This might mean documenting a way to reproduce the failure outside a full build and/or documenting how the failure is happening and maybe even propose a fix.
To fulfill the primary responsibility, it's suggested that bugs are opened on the bugzilla for each type of failure. This way, appropriate people can be brought into the discussion and a specific owner of the failure can be assigned. Replying to the build failure with the bug ID and also bringing the bug to the attention of anyone you suspect was responsible for the problem are also good practices.
When filing the bug, please cut and paste the relevant error in the bug comment, and include the log file as an attachement. This ensures the assignee and triage team can quickly asses this issue.
Every build failure should be responded to. If it is a known issue, a response with a single line containing "Known Issue" is sufficient. This assures others that the failure has been looked at and is being worked on.
Autobuilder BKMs
Sometimes failures are difficult to understand and can require direct ssh access to the autobuilder so the issue can be debugged passively on the system to examine contents of files and so forth. If doing this ensure you don't change any of the file system for example adding files that couldn't then be deleted by the autobuilder when it rebuilds.
Rarely, "live" debugging might be needed where you'd su to the pokybuild user and run a build manually to see the failure in real time. If doing this, ensure you only create files as the pokybuild user and you are careful not to generate sstate packages which shouldn't be present or any other bad state that might get reused. In general its recommended not to do "live" debugging. This can be escalated to RP/Saul/Beth if needed.
Live debugging is generally something we try to avoid doing. It should only occur if an issue can only be reproduced on the autobuilder.
Autobuilder Overview
Infrastructure Overview
ab01: The yocto master autobuilder. This runs one low utility slave which does, universe fetch, package index, bitbake self test, builds the adt-installer and generally acts as the release mechanism for the Yocto Project. It also acts as a trigger parent for our full nightly build. This nightly build is essentially what builds our release, minus release notes.
ab02, ab04, ab05, ab06, ab10: Generic nightly slaves. These run three slaves a piece. ab10 also runs our eclipse plugin build
Build Targets
Nightly is a "dummy" buildset that does relatively few things and is only ever run on ab01. It mainly does universe fetch, building adt-installer and building the eclipse plugin. It's main function is to trigger nightly-${ARCH} and wait until they're done. ab02, ab04, ab05, ab06 are what is used to run this pool of nightly arch builds.
NOTE: Just because nightly-* ran on ab04 the last time does not mean it will again. It's semi random. In order to find out what host you need to log into, please look for the buildstep that says:
Building on autobuilder04 Linux autobuilder04 2.6.37.6-0.9-default #1 SMP 2011-10-19 22:33:27 +0200 x86_64 x86_64 x86_64 GNU/Linux
Build "gotchas"
Currently, we share sstate-cache and downloads between these slaves via NAS. For the moment we are also splitting up sstate and lsb-sstate. They are currently stored in /srv/www/vhosts/autobuilder.yoctoproject.org/pub/[sstate|lsb-sstate]. This will change after M2 to be combined into one directory; /srv/www/vhosts/autobuilder.yoctoproject.org/pub/sstate
TMPDIR between distros (poky and poky-lsb) is not shared. $TMPDIR ends up being moved to ~pokybuild/yocto-autobuilder/yocto-slave/nightly-${ARCH}/build/build/nonlsb-tmp and poky-lsb is left in the above path's tmp.
Live Debugging Process
If you need to do live debugging on the autobuilder, you want to:
- Check that nothing is running on the builder:
http://autobuilder.yoctoproject.org:8010/buildslaves
- If nothing is running, remove the buildslave from the pool. Please let either Beth or sgw know if you're planning on doing this. Email/IRC is fine.
Keep in mind that we are currently utilizing two autobuilders. One is just for bugzilla reference (logs and whatnot), the other is production. There have been instances of people not know where the running autobuilder lives.
The new autobuilder lives in ~/pokybuild/yocto-autobuilder-new. This will eventually change when I EOL the old autobuilder. However, when in doubt about where to find the base dir of the slave, always check the Create BBLayers Configuration step of the build you want. From this you can derive the base dir.
Example:
http://autobuilder.yoctoproject.org:8011/builders/nightly/builds/101
Looking at: http://autobuilder.yoctoproject.org:8011/builders/nightly/builds/101/steps/CreateAutoConf/logs/stdio shows that we've not moved TMPDIR
"BBLAYERS += " \ /srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta \ /srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta-yocto \ /srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta-yocto-bsp \ /srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta-qt3 \ "
indicates that the layers all exist in the slave's build dir for that build set. Which means that TMPDIR is most likely in: /srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/build/tmp
sudo -i -u pokybuild cd yocto-autobuilder-new . ./yocto-autobuilder-setup ./yocto-stop-autobuilder slave
This will ensure that the directory you are working in doesn't disappear out from under you. Please make sure that after you are done, you restart:
sudo -i -u pokybuild cd yocto-autobuilder-new . ./yocto-autobuilder-setup ./yocto-start-autobuilder slave
Things to never do
- NEVER clean sstate (cleanall, cleansstate). As sstate is shared across builders, you do not want it wiped like this. If you need to toss sstate, let Beth/sgw/RP know. We try not to remove sstate as it speeds up build times dramatically. As it's fairly large and takes a while to wipe, we try to avoid this.
- NEVER stop ab01's master/slave. If you need to debug something on ab01, let sgw, RP and Beth know. As we're the only three who can kick builds off, it's really important they all know so they don't kick off a build and tromp on live debugging. If you need to work on ab01 one of them must know about it *and* have given the ok.
- NEVER create a file as yourself under ~pokybuild/yocto-autobuilder/* This can cause future builds to fail and is frustrating to debug.