Yocto Build Failure Swat Team

From Yocto Project
Revision as of 11:58, 7 September 2015 by Eflanagan (talk | contribs) (→‎Members)
Jump to navigationJump to search

Overview

The assembly of the Yocto Project SWAT team is mainly to tackle urgent technical problems that break build on the master branch or major release branches in a timely manner, thus to maintain the stability of the master and release branch. The SWAT team includes volunteers or appointed members of the Yocto Project team. Community members can also volunteer to be part of the SWAT team.

Scope of Responsibility

Whenever a build (nightly build, weekly build, release build) fails, the SWAT team is responsible for ensuring the necessary debugging occurs and organizing resources to solve the issue and ensure successful builds. If resolving the issues requires schedule or resource adjustment, the SWAT team should work with program and development management to accommodate the change in the overall planning. If resolving the issues requires access to the autobuilder, please contact either Beth Flanagan or Michael Halstead for access rights.

In general, priority should always go first towards major release candidates and secondly to master failures.

Point releases (yocto-1.X.x) should have minimal problems in the first place. As well, stable branch maintainers should be paying attention to their own point release candidate builds.

Build failures are reported on the yocto-build mailing list.

Please review the Media:Swat.odp presentation.

Members

  • Elizabeth Flanagan (IE) (Autobuilder Maintainer)
  • Saul Wold (US) (Autobuilder Administrator)
  • Paul Eggleton (UK)
  • Ross Burton (UK)
  • Cristian Iorga (RO)
  • Randy Witt (US)
  • Benjamin Esquivel (MX)
  • Juro Bystricky (US)
  • Anibal Limon (MX)
  • Tracy Graydon (US)
  • Alejandro Hermandez (MX)
  • Jussi Kukkonen (FI)

Chair

A chairperson role will be rotated among team members each week on Friday. The Chairperson should monitor the build status for the entire week. Whenever a build is broken, the Chairperson should do necessary debugging and organize resources to solve the problems in a timely manner to meet the overall project and release schedule. The Chairperson serves as the focal point of the SWAT team to external people such as program managers or development managers.

Rotation Process

The Chairperson rotation takes place during the weekly when the Friday morning status report is sent. Usually, this will take a simple round robin order. In case the next person cannot take the role due to tight schedule, vacation or some other reasons, the role will be passed to the next person.

Process

The wiki page BuildLog will list why a build has been triggered and what the expectations of that build are. For each build failure that occurs, the expectation is a bug is opened for each issue found, or, if there is already a bug for the issue, that the new failure is appended to that bugzilla entry. There are some exceptions to this:

  • If the build is a master-next or mut build, then an alternative is to reply to the unmerged patch causing the problem on the mailing list with a link to the failure
  • If the BuildLog mentions that bugs are not to be filed, there is no need.
  • If someone has sent out a patch for the issue already.

You can always check with the person who triggered the build but if in doubt file a bug. Failures on master should always have corresponding bug entries.

Whatever the outcome, you should add a note to the BuildLog page explaining which action was taken for each failure.

The primary responsibility is to ensure that any failures are categorized correctly and that the right people get to know about them. It's important *someone* is then tasked with fixing it. To fulfill the primary responsibility, bugs are opened on the bugzilla for each type of failure. This way, appropriate people can be brought into the discussion and a specific owner of the failure can be assigned. Replying to the build failure with the bug ID and also bringing the bug to the attention of anyone you suspect was responsible for the problem are also good practices.

Ideally we want to get the failure reported to the person who knows something about the area and can come up with a fix without it distracting them too much. As a secondary responsibility, it's often helpful to triage the failure. This might mean documenting a way to reproduce the failure outside a full build and/or documenting how the failure is happening and maybe even propose a fix. The SWAT team is not responsible for debugging the failure though, only ensuring it is reported and that someone is found to look at the issue.

When filing the bug, please cut and paste the relevant error in the bug comment, and include the log file as an attachment. This ensures the assignee and triage team can quickly asses this issue. In the bug report, do not post links to any Autobuilder log. The logs are non-persistent and hence the bug report will eventually end up with a dead link. Sometimes, failures occur on autobuilders on private company networks. Do not post links into the bugzilla for these failures, its pointless as nobody else can access them.

Every build failure should be responded to. If it is a known issue, a response with a single line containing "Known Issue" is sufficient. This assures others that the failure has been looked at and is being worked on.

Debugging BKMs

When looking at a failure, the first question is what the baseline was and what changed. If there were recent known good builds it helps to narrow down the number of changes that were likely responsible for the failure. It's also useful to note if the build was from scratch or from existing sstate files. You can tell by seeing what "setscene" tasks run in the log.

Image failures are particular tricky since its likely some component of the image that failed and the question is then whether that component changed recently, whether it was some kind of core functionality at fault and so on.

If a build fails, you can check which branch the build failure occurred on in the error log, i.e. the log contains:

branch : master-next

Autobuilder BKMs

Sometimes failures are difficult to understand and can require direct ssh access to the autobuilder so the issue can be debugged passively on the system to examine contents of files and so forth. If doing this ensure you don't change any of the file system for example adding files that couldn't then be deleted by the autobuilder when it rebuilds.

Rarely, "live" debugging might be needed where you'd su to the pokybuild user and run a build manually to see the failure in real time. If doing this, ensure you only create files as the pokybuild user and you are careful not to generate sstate packages which shouldn't be present or any other bad state that might get reused. In general its recommended not to do "live" debugging. This can be escalated to RP/Saul/Beth if needed.

Live debugging is generally something we try to avoid doing. It should only occur if an issue can only be reproduced on the autobuilder.

Autobuilder Overview

Infrastructure Overview

ab01: The yocto master autobuilder. This runs one low utility slave which does, universe fetch, package index, bitbake self test, builds the adt-installer and generally acts as the release mechanism for the Yocto Project. It also acts as a trigger parent for our full nightly build. This nightly build is essentially what builds our release, minus release notes.

ab02, ab04, ab05, ab06, ab10: Generic nightly slaves. These run three slaves a piece. ab10 also runs our eclipse plugin build

Build Targets

Nightly is a "dummy" buildset that does relatively few things and is only ever run on ab01. It mainly does universe fetch, building adt-installer and building the eclipse plugin. It's main function is to trigger nightly-${ARCH} and wait until they're done. ab02, ab04, ab05, ab06 are what is used to run this pool of nightly arch builds.

NOTE: Just because nightly-* ran on ab04 the last time does not mean it will again. It's semi random. In order to find out what host you need to log into, please look for the buildstep that says:

Building on autobuilder04 Linux autobuilder04 2.6.37.6-0.9-default #1 SMP 2011-10-19 22:33:27 +0200 x86_64 x86_64 x86_64 GNU/Linux

Build "gotchas"

Currently, we share sstate-cache and downloads between these slaves via NAS. For the moment we are also splitting up sstate and lsb-sstate. They are currently stored in /srv/www/vhosts/autobuilder.yoctoproject.org/pub/[sstate|lsb-sstate]. This will change after M2 to be combined into one directory; /srv/www/vhosts/autobuilder.yoctoproject.org/pub/sstate

TMPDIR between distros (poky and poky-lsb) is not shared. $TMPDIR ends up being moved to ~pokybuild/yocto-autobuilder/yocto-slave/nightly-${ARCH}/build/build/nonlsb-tmp and poky-lsb is left in the above path's tmp.


Live Debugging Process

If you need to do live debugging on the autobuilder, you want to:

  • Check that nothing is running on the builder:

https://autobuilder.yoctoproject.org/main/buildslaves

  • If nothing is running, remove the buildslave from the pool. Please let either Beth or sgw know if you're planning on doing this. Email/IRC is fine.

Keep in mind that we are currently utilizing two autobuilders. One is just for bugzilla reference (logs and whatnot), the other is production. There have been instances of people not know where the running autobuilder lives.

The new autobuilder lives in ~/pokybuild/yocto-autobuilder-new. This will eventually change when I EOL the old autobuilder. However, when in doubt about where to find the base dir of the slave, always check the Create BBLayers Configuration step of the build you want. From this you can derive the base dir.

Example:

http://autobuilder.yoctoproject.org:8011/builders/nightly/builds/101

Looking at: http://autobuilder.yoctoproject.org:8011/builders/nightly/builds/101/steps/CreateAutoConf/logs/stdio shows that we've not moved TMPDIR

Looking at: http://autobuilder.yoctoproject.org:8011/builders/nightly/builds/101/steps/Create%20BBLayers%20Configuration/logs/stdio

"BBLAYERS += " \ /srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta \ /srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta-yocto \ /srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta-yocto-bsp \ /srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta-qt3 \ "

indicates that the layers all exist in the slave's build dir for that build set. Which means that TMPDIR is most likely in: /srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/build/tmp

sudo -i -u pokybuild
cd yocto-autobuilder-new
. ./yocto-autobuilder-setup
./yocto-stop-autobuilder slave
 

This will ensure that the directory you are working in doesn't disappear out from under you. Please make sure that after you are done, you restart:

sudo -i -u pokybuild
cd yocto-autobuilder-new
. ./yocto-autobuilder-setup
./yocto-start-autobuilder slave
 
Things to never do
  • NEVER clean sstate (cleanall, cleansstate). As sstate is shared across builders, you do not want it wiped like this. If you need to toss sstate, let Beth/sgw/RP know. We try not to remove sstate as it speeds up build times dramatically. As it's fairly large and takes a while to wipe, we try to avoid this.
  • NEVER stop ab01's master/slave. If you need to debug something on ab01, let sgw, RP and Beth know. As we're the only three who can kick builds off, it's really important they all know so they don't kick off a build and tromp on live debugging. If you need to work on ab01 one of them must know about it *and* have given the ok.
  • NEVER create a file as yourself under ~pokybuild/yocto-autobuilder/* This can cause future builds to fail and is frustrating to debug.
  • NEVER post links to any Autobuilder log in bug reports. The logs are non-persistent and hence the bug report will eventually end up with a dead link.