Yocto Build Failure Swat Team: Difference between revisions

From Yocto Project
Jump to navigationJump to search
(Clarify process for SWAT reboot)
Line 1: Line 1:
==Overview==
{| style="color:black; background-color:#b8ddff" width="100%" cellpadding="10" class="wikitable"
|'''Note''': The SWAT process has changed. Please read the new process information (up to, and including, section 6). If you're already au-fait with the new process you may want the [[#Process_summary|summary bullets]].
|}


The assembly of the Yocto Project SWAT team is mainly to tackle urgent technical problems that break build on the master branch or major release branches in a timely manner, thus to maintain the stability of the master and release branch. The SWAT team includes volunteers or appointed members of the Yocto Project team. Community members can also volunteer to be part of the SWAT team.
== Overview ==


==Scope of Responsibility==
The role of the SWAT team is to monitor the autobuilder and investigate all failures to ensure they are logged and brought to the attention of a suitable owner.


Whenever a build (nightly build if master or master-next, weekly build, release build) fails, the SWAT team is responsible for ensuring the necessary debugging occurs and organizing resources to solve the issue and ensure successful builds. If resolving the issues requires schedule or resource adjustment, the SWAT team should work with program and development management to accommodate the change in the overall planning. If resolving the issues requires access to the autobuilder, please contact either [[User:Eflangan| Beth Flanagan]] or [[User:Mhalstead| Michael Halstead]] for access rights.
== Scope ==


In general, priority should always go first towards major release candidates and secondly to master failures.  
All builds run on the public autobuilder are important for the Yocto Project, whether they be a post-merge validation run (for master or a release branch) or a pre-merge test build (for master-next, ross/mut and others). Any build should be monitored by the SWAT team unless the [[BuildLog]] entry for that build indicates otherwise. That is; SWAT is opt-out by whomever triggers a build on the Autobuilder, not opt-in.


Point releases (yocto-1.X.x) should have minimal problems in the first place. As well, stable branch maintainers should be paying attention to their own point release candidate builds.
== Pre-triage ==


Build failures are reported on the [https://lists.yoctoproject.org/listinfo/yocto-builds yocto-build mailing list].
SWAT isn't responsible for resolving issues encountered on the Autobuilder. Their focus is on performing minimal analysis of a failure in order to ensure that it is logged and brought to the attention of a suitable owner, a process we'll refer to as pre-triage.


Please review the [[Media:Swat.odp]] (Darren, 2012) and [[Media:YP_Swat.pdf]] (Benjamin, 2016) presentations.
== Rotation Process ==


==Members==
The active member rotation takes place weekly at the end of Friday. Usually, this will take a simple round robin order through the members list. In case the next person cannot take the role due to tight schedule, vacation or some other reasons, the role will be passed to the next person.


* Saul Wold (US) (Autobuilder Administrator)
=== Roles ===
* Paul Eggleton (NZ)
'''Active member''': the currently active member of the SWAT team is expected to monitor the Autobuilder and pre-triage failures in a timely fashion. Team members are active for one week at a time.
* Ross Burton (UK)
* Randy Witt (US)
* Leo Sandoval (MX)
* Juro Bystricky (US)
* Anibal Limon (MX)
* Tracy Graydon (US)
* Alejandro Hernandez (MX)
* Jussi Kukkonen (FI)
* Maxin John (FI)
* Joshua Lock (UK) (Autobuilder Maintainer)
* Armin Kuster (US)


==Chair==
'''SWAT Chair''': the SWAT chair provides backup cover for the active member and is a first point of contact for SWAT. [[User:Tracy_Graydon| Tracy Graydon]] is the current SWAT Chair.
A chairperson role will be rotated among team members each week on Friday. The Chairperson should monitor the build status for the entire week. Whenever a build is broken, the Chairperson should do necessary debugging and organize resources to solve the problems in a timely manner to meet the overall project and release schedule. The Chairperson serves as the focal point of the SWAT team to external people such as program managers or development managers.


==Rotation Process==
'''SWAT Facilitator''': the SWAT facilitator is responsible for managing the rotation process. [[User:Stephen_K._Jolley| Stephen Jolley]] is the current SWAT Facilitator.
The Chairperson rotation takes place during the weekly when the Friday morning status report is sent. Usually, this will take a simple round robin order. In case the next person cannot take the role due to tight schedule, vacation or some other reasons, the role will be passed to the next person.


==Process==
== Process ==


The wiki page [[BuildLog]] will list why a build has been triggered and what the expectations of that build are. For each build failure that occurs, the expectation is a bug is opened for each issue found, or, if there is already a bug for the issue, that the new failure is appended to that bugzilla entry. There are some exceptions though.
The [[BuildLog]] wiki page is automatically updated by the autobuilder when a new build is triggered. An entry should include a reason for triggering the build (as entered in the "Reason" field of the autobuilder "Force build" page when triggering a build) and may also include detail on what the expectations for the build are.
For each build failure that occurs the active SWAT member is responsible for pre-triaging the failure.


===Exceptions for Appending New Failures to Bugs===
The pre-triage of failures takes two forms:


* If the build is a master-next or mut build, then an alternative is to reply to the unmerged patch causing the problem on the mailing list with a link to the failure
* for builds against a master or release branch of the poky repo any issues observed should be [[#Filing_bugs | filed in bugzilla]].
* If the BuildLog mentions that bugs are not to be filed, there is no need.
* for builds against other branches (master-next, ross/mut, -next branches for stable releases, etc.), where an issue is caused by a patch not in the master branch, the relevant unmerged patch causing the problem should be replied to on the mailing list.
* If someone has sent out a patch for the issue already.
** When it isn't obvious which patch caused the failure file an issue in bugzilla and alert the branch owner (CC or assignment on the bug should suffice).
** '''If in doubt file a bug''', ''all'' observed errors must be actioned unless a patch has already been sent for the issue (in which case please make note of this in the [[BuildLog]]).
** Infrastructure issues can be filed in bugzilla, where they will be assigned to [[User:Halstead| Michael Halstead]] (''halstead'' on irc) by default.
** Autobuilder logic bugs also go into bugzilla, where they will be assigned to [[User:Joshua_Lock| Joshua Lock]] by default.


The results of pre-triage for an issue should be added to the corresponding entry in the [[BuildLog]], including a link to the pre-triage outcome (bugzilla entry, mailing list post in the archive, etc) and a brief summary of the issue.


You can always check with the person who triggered the build but if in doubt file a bug. Failures on master should always have corresponding bug entries.
Every build failure should be addressed in the [[BuildLog]]. If it is a known issue, an entry with a single line containing "Known Issue" is sufficient (a link to further detail is, of course, much better). This assures others that the failure has been looked at and is being worked on.


Whatever the outcome, you should add a note to the [[BuildLog]] page explaining which action was taken for each failure.
=== Filing bugs ===


The primary responsibility is to ensure that any failures are categorized correctly and that the right people get to know about them. It's important *someone* is then tasked with fixing it. To fulfill the primary responsibility, bugs are opened in [https://bugzilla.yoctoproject.org Bugzilla ] for each type of failure. This way, appropriate people can be brought into the discussion and a specific owner of the failure can be assigned. Replying to the build failure with the bug ID and also bringing the bug to the attention of anyone you suspect was responsible for the problem are also good practices.
When filing the bug, please:
* cut and paste the relevant error in the bug comment, and include the log file as an attachment
* include the log from the ''CreateAutoConf'' step as an attachment (this ensures the assignee and triage team can quickly asses this issue)
* include a pointer to the [https://errors.yoctoproject.org ErrorLog] page associated with the failure (as ErrorLog)


Ideally we want to get the failure reported to the person who knows something about the area and can come up with a fix without it distracting them too much.
{| style="color:black; background-color:#b8ddff" width="100%" cellpadding="10" class="wikitable"
As a secondary responsibility, it's often helpful to triage the failure. This might mean documenting a way to reproduce the failure outside a full build and/or documenting how the failure is happening and maybe even propose a fix. The SWAT team is not responsible for debugging the failure though, only ensuring it is reported and that someone is found to look at the issue.
|'''Note''': Autobuilder logs are non-persistent, feel free to include a link to the log in a bug report but be sure to ''also'' attach a copy of the log and include relevant sections copy/pasted into the bug.
|}
{| style="color:black; background-color:#b8ddff" width="100%" cellpadding="10" class="wikitable"
|'''Note''': Sometimes, failures occur on autobuilders on private company networks. Do not post links into the bugzilla for these failures as nobody else can access them.
|}


When filing the bug, please cut and paste the relevant error in the bug comment, and include the log file as an attachment. Also include the log from the ''CreateAutoConf'' step. This ensures the assignee and triage team can quickly asses this issue.
=== Process summary ===
'''In the bug report, do not post links to any Autobuilder log. The logs are non-persistent and hence the bug report will eventually end up with a dead link.'''
'''Sometimes, failures occur on autobuilders on private company networks. Do not post links into the bugzilla for these failures, its pointless as nobody else can access them.''' <span style="color: red;">Do include a pointer to the [https://errors.yoctoproject.org ErrorLog] page associated with the failure (as ErrorLog)</span>


Every build failure should be responded to. If it is a known issue, a response with a single line containing "Known Issue" is sufficient. This assures others that the failure has been looked at and is being worked on.
* Monitor builds via one (or more) of:
** the autobuilder [https://autobuilder.yoctoproject.org/main/tgrid tgrid], [https://autobuilder.yoctoproject.org/main/grid grid], [https://autobuilder.yoctoproject.org/main/waterfall waterfall]
** the [[BuildLog]] wiki page
** the [https://lists.yoctoproject.org/listinfo/yocto-builds yocto-builds] mailing list
* Pre-triage each failure:
** File a bugzilla ticket ''OR'' respond to a patch ''OR'' note known issues
** Update the [[BuildLog]] with the result of pre-triage, linking to issues/mail archives when possible


==How to use [[BuildLog]]==
== Questions / Contact ==
All the builds listed at [[BuildLog]] should come from https://autobuilder.yoctoproject.org/main/builders/nightly. The [https://autobuilder.yoctoproject.org/main/builders/nightly nightly] page contains two important pieces of information:
* Build # - This # should correspond to the number listed for the build on [[BuildLog]].
* Revision(or BuildID) - Ideally this will also be listed in the [[BuildLog]] if the person starting the build was nice. In either case, this ID corresponds to the first field in the table in the [https://autobuilder.yoctoproject.org/main/tgrid?length=20 T-grid] output from the AutoBuilder. So all the builds in the row for that ID correspond to the Nightly build number that matches the ID.


In the simple case, the Build ID will be listed in the [[BuildLog]] entry. Then you just go to the [https://autobuilder.yoctoproject.org/main/tgrid?length=20 T-grid] and find the corresponding ID in the first field. The failures listed in the same row as the Build ID are the failures for the corresponding nightly build listed on [[BuildLog]].
If you have queries about the SWAT process you may reach out to the SWAT Facilitator [[User:Stephen_K._Jolley| Stephen Jolley]] and the SWAT Chair [[User:Tracy_Graydon| Tracy Graydon]].


If the Build ID is not listed, an extra step is required. You must first go to the https://autobuilder.yoctoproject.org/main/builders/nightly page and find the corresponding ''Build #''. The ''Revision'' for that ''Build #'' is the Build ID to be used when visiting the [https://autobuilder.yoctoproject.org/main/tgrid?length=20 T-grid] page. Feel free to be helpful and add the Build ID to the [[BuildLog]] entry to help save time for other users.
== Members ==


==Debugging BKMs==
* Saul Wold (US)
 
* Paul Eggleton (NZ)
When looking at a failure, the first question is what the baseline was and what changed. If there were recent known good builds it helps to narrow down the number of changes that were likely responsible for the failure. It's also useful to note if the build was from scratch or from existing sstate files. You can tell by seeing what "setscene" tasks run in the log.
* Ross Burton (UK)
 
* Randy Witt (US)
Image failures are particular tricky since its likely some component of the image that failed and the question is then whether that component changed recently, whether it was some kind of core functionality at fault and so on.
* Leo Sandoval (MX)
 
* Juro Bystricky (US)
If a build fails, you can check which branch the build failure occurred on in the error log, i.e. the log contains:
* Anibal Limon (MX)
 
* [[User:Tracy_Graydon| Tracy Graydon]] (US) (SWAT Chair)
''branch : master-next''
* Alejandro Hernandez (MX)
 
* Jussi Kukkonen (FI)
==Autobuilder BKMs==
* Maxin John (FI)
 
* [[User:Joshua_Lock| Joshua Lock]] (UK) (Autobuilder Maintainer)
Sometimes failures are difficult to understand and can require direct ssh access to the autobuilder so the issue can be debugged passively on the system to examine contents of files and so forth. If doing this ensure you don't change any of the file system for example adding files that couldn't then be deleted by the autobuilder when it rebuilds.
* Armin Kuster (US)
 
Rarely, "live" debugging might be needed where you'd su to the pokybuild user and run a build manually to see the failure in real time. If doing this, ensure you only create files as the pokybuild user and you are careful not to generate sstate packages which shouldn't be present or any other bad state that might get reused. In general its recommended not to do "live" debugging. This can be escalated to RP/Saul/Beth if needed.
 
Live debugging is generally something we try to avoid doing. It should only occur if an issue can only be reproduced on the autobuilder.
 
===Autobuilder Overview===
 
====Infrastructure Overview====
ab01: The yocto master autobuilder. This runs one low utility slave which does, universe fetch, package index, bitbake self test, builds the adt-installer and generally acts as the release mechanism for the Yocto Project. It also acts as a trigger parent for our full nightly build. This nightly build is essentially what builds our release, minus release notes.
 
ab02, ab04, ab05, ab06, ab10: Generic nightly slaves. These run three slaves a piece. ab10 also runs our eclipse plugin build
 
====Build Targets====
Nightly is a "dummy" buildset that does relatively few things and is only ever run on ab01. It mainly does universe fetch, building
adt-installer and building the eclipse plugin. It's main function is to trigger nightly-${ARCH} and wait until they're done. ab02, ab04,
ab05, ab06 are what is used to run this pool of nightly arch builds.
 
NOTE: Just because nightly-* ran on ab04 the last time does not mean it will again. It's semi random. In order to find out what host you need to log into, please look for the buildstep that says:
 
Building on
autobuilder04
Linux autobuilder04 2.6.37.6-0.9-default #1 SMP 2011-10-19 22:33:27
+0200 x86_64 x86_64 x86_64 GNU/Linux
 
====Build "gotchas"====
Currently, we share sstate-cache and downloads between these slaves via NAS. For the moment we are also splitting up sstate and lsb-sstate. They are currently stored in /srv/www/vhosts/autobuilder.yoctoproject.org/pub/[sstate|lsb-sstate]. This will change after M2 to be combined into one directory; /srv/www/vhosts/autobuilder.yoctoproject.org/pub/sstate
 
TMPDIR between distros (poky and poky-lsb) is not shared. $TMPDIR ends up being moved to ~pokybuild/yocto-autobuilder/yocto-slave/nightly-${ARCH}/build/build/nonlsb-tmp and poky-lsb is left in the above path's tmp.
 
 
====Live Debugging Process====
If you need to do live debugging on the autobuilder, you want to:
 
* Check that nothing is running on the builder:
https://autobuilder.yoctoproject.org/main/buildslaves
 
* If nothing is running, remove the buildslave from the pool. Please let either Beth or sgw know if you're planning on doing this. Email/IRC is fine.
 
Keep in mind that we are currently utilizing two autobuilders. One is just for bugzilla reference (logs and whatnot), the other is production. There have been instances of people not know where the running autobuilder lives.
 
The new autobuilder lives in ~/pokybuild/yocto-autobuilder-new. This will eventually change when I EOL the old autobuilder. However, when in doubt about where to find the base dir of the slave, always check the Create BBLayers Configuration step of the build you want. From this you can derive the base dir.
 
Example:
 
http://autobuilder.yoctoproject.org:8011/builders/nightly/builds/101
 
Looking at: http://autobuilder.yoctoproject.org:8011/builders/nightly/builds/101/steps/CreateAutoConf/logs/stdio shows that we've not moved TMPDIR
 
Looking at: http://autobuilder.yoctoproject.org:8011/builders/nightly/builds/101/steps/Create%20BBLayers%20Configuration/logs/stdio
 
"BBLAYERS += " \
/srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta \
/srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta-yocto \
/srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta-yocto-bsp \
/srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/meta-qt3 \
"
 
indicates that the layers all exist in the slave's build dir for that build set. Which means that TMPDIR is most likely in:
/srv/home/pokybuild/yocto-autobuilder-new/yocto-slave/nightly/build/build/tmp
 
<nowiki>
sudo -i -u pokybuild
cd yocto-autobuilder-new
. ./yocto-autobuilder-setup
./yocto-stop-autobuilder slave
</nowiki>
 
This will ensure that the directory you are working in doesn't disappear out from under you. Please make sure that after you are done, you restart:
 
<nowiki>
sudo -i -u pokybuild
cd yocto-autobuilder-new
. ./yocto-autobuilder-setup
./yocto-start-autobuilder slave
</nowiki>
 
=====Things to never do=====
* NEVER clean sstate (cleanall, cleansstate). As sstate is shared across builders, you do not want it wiped like this. If you need to toss sstate, let Beth/sgw/RP know. We try not to remove sstate as it speeds up build times dramatically. As it's fairly large and takes a while to wipe, we try to avoid this.
* NEVER stop ab01's master/slave. If you need to debug something on ab01, let sgw, RP and Beth know. As we're the only three who can kick builds off, it's really important they all know so they don't kick off a build and tromp on live debugging. If you need to work on ab01 one of them must know about it *and* have given the ok.
* NEVER create a file as yourself under ~pokybuild/yocto-autobuilder/* This can cause future builds to fail and is frustrating to debug.
*  NEVER post links to any Autobuilder log in bug reports. The logs are non-persistent and hence the bug report will eventually end up with a dead link.

Revision as of 16:18, 15 November 2016

Note: The SWAT process has changed. Please read the new process information (up to, and including, section 6). If you're already au-fait with the new process you may want the summary bullets.

Overview

The role of the SWAT team is to monitor the autobuilder and investigate all failures to ensure they are logged and brought to the attention of a suitable owner.

Scope

All builds run on the public autobuilder are important for the Yocto Project, whether they be a post-merge validation run (for master or a release branch) or a pre-merge test build (for master-next, ross/mut and others). Any build should be monitored by the SWAT team unless the BuildLog entry for that build indicates otherwise. That is; SWAT is opt-out by whomever triggers a build on the Autobuilder, not opt-in.

Pre-triage

SWAT isn't responsible for resolving issues encountered on the Autobuilder. Their focus is on performing minimal analysis of a failure in order to ensure that it is logged and brought to the attention of a suitable owner, a process we'll refer to as pre-triage.

Rotation Process

The active member rotation takes place weekly at the end of Friday. Usually, this will take a simple round robin order through the members list. In case the next person cannot take the role due to tight schedule, vacation or some other reasons, the role will be passed to the next person.

Roles

Active member: the currently active member of the SWAT team is expected to monitor the Autobuilder and pre-triage failures in a timely fashion. Team members are active for one week at a time.

SWAT Chair: the SWAT chair provides backup cover for the active member and is a first point of contact for SWAT. Tracy Graydon is the current SWAT Chair.

SWAT Facilitator: the SWAT facilitator is responsible for managing the rotation process. Stephen Jolley is the current SWAT Facilitator.

Process

The BuildLog wiki page is automatically updated by the autobuilder when a new build is triggered. An entry should include a reason for triggering the build (as entered in the "Reason" field of the autobuilder "Force build" page when triggering a build) and may also include detail on what the expectations for the build are. For each build failure that occurs the active SWAT member is responsible for pre-triaging the failure.

The pre-triage of failures takes two forms:

  • for builds against a master or release branch of the poky repo any issues observed should be filed in bugzilla.
  • for builds against other branches (master-next, ross/mut, -next branches for stable releases, etc.), where an issue is caused by a patch not in the master branch, the relevant unmerged patch causing the problem should be replied to on the mailing list.
    • When it isn't obvious which patch caused the failure file an issue in bugzilla and alert the branch owner (CC or assignment on the bug should suffice).
    • If in doubt file a bug, all observed errors must be actioned unless a patch has already been sent for the issue (in which case please make note of this in the BuildLog).
    • Infrastructure issues can be filed in bugzilla, where they will be assigned to Michael Halstead (halstead on irc) by default.
    • Autobuilder logic bugs also go into bugzilla, where they will be assigned to Joshua Lock by default.

The results of pre-triage for an issue should be added to the corresponding entry in the BuildLog, including a link to the pre-triage outcome (bugzilla entry, mailing list post in the archive, etc) and a brief summary of the issue.

Every build failure should be addressed in the BuildLog. If it is a known issue, an entry with a single line containing "Known Issue" is sufficient (a link to further detail is, of course, much better). This assures others that the failure has been looked at and is being worked on.

Filing bugs

When filing the bug, please:

  • cut and paste the relevant error in the bug comment, and include the log file as an attachment
  • include the log from the CreateAutoConf step as an attachment (this ensures the assignee and triage team can quickly asses this issue)
  • include a pointer to the ErrorLog page associated with the failure (as ErrorLog)
Note: Autobuilder logs are non-persistent, feel free to include a link to the log in a bug report but be sure to also attach a copy of the log and include relevant sections copy/pasted into the bug.
Note: Sometimes, failures occur on autobuilders on private company networks. Do not post links into the bugzilla for these failures as nobody else can access them.

Process summary

  • Monitor builds via one (or more) of:
  • Pre-triage each failure:
    • File a bugzilla ticket OR respond to a patch OR note known issues
    • Update the BuildLog with the result of pre-triage, linking to issues/mail archives when possible

Questions / Contact

If you have queries about the SWAT process you may reach out to the SWAT Facilitator Stephen Jolley and the SWAT Chair Tracy Graydon.

Members

  • Saul Wold (US)
  • Paul Eggleton (NZ)
  • Ross Burton (UK)
  • Randy Witt (US)
  • Leo Sandoval (MX)
  • Juro Bystricky (US)
  • Anibal Limon (MX)
  • Tracy Graydon (US) (SWAT Chair)
  • Alejandro Hernandez (MX)
  • Jussi Kukkonen (FI)
  • Maxin John (FI)
  • Joshua Lock (UK) (Autobuilder Maintainer)
  • Armin Kuster (US)