Yocto Build Failure Swat Team: Difference between revisions

From Yocto Project
Jump to navigationJump to search
No edit summary
 
(186 intermediate revisions by 29 users not shown)
Line 1: Line 1:
==Overview==
== Overview ==


The assembly of the Yocto Project SWAT team is mainly to tackle urgent technical problems that break build on the master branch or major release branches in a timely manner, thus to maintain the stability of the master and release branch. The SWAT team includes volunteers or appointed members of the Yocto Project team. Community members can also volunteer to be part of the SWAT team.
All builds that are run on the public autobuilder are important for the Yocto Project, whether they be routine validation runs or pre-integration test builds. Random failures if ignored accumulate and can result in a significant number of builds failing.


==Scope of Responsibility==
The role of the Bug Swat Team is to monitor the autobuilder and do preliminary investigation of failures, to ensure that they are logged and brought to the attention of the appropriate owner.


Whenever a build (nightly build, weekly build, release build) fails, the SWAT team is responsible for ensuring the necessary debugging occurs and organizing resources to solve the issue and ensure successful builds. If resolving the issues requires schedule or resource adjustment, the SWAT team should work with program and development management to accommodate the change in the overall planning. If resolving the issues requires access to the autobuilder, please contact either [[User:Eflangan| Beth Flanagan]] or [[User:Mhalstead| Michael Halstead]] for access rights.
Importantly, the Swat Team '''isn't responsible for resolving issues''' encountered on the autobuilder, simply just enough analysis so that it can be logged for later analysis and ideally make the right people aware of them.


Build failures are reported on the [https://lists.yoctoproject.org/listinfo/yocto-builds yocto-build mailing list].
Each week a different member of the team is on call. Every build that fails on the autobuilder should be monitored unless stated otherwise. The rotation happens at the end of Friday (deliberately vague), any failures over the weekend should be triaged by the incoming member on Monday.  


Please review the [[Media:Swat.odp]] presentation.
The Swat Chairs are the primary contact for the Swat Team. The current Swat Chairs are [[User:RossBurton | Ross Burton]] and [[User:Rpurdie | Richard Purdie]]. The Chairs are assisted by Stephen K. Jolley who handles the rotation process.  If the person currently on call, or about to be on call, can no longer perform their duty then they should contact Stephen to arrange a replacement.


==Members==
== Process ==


* Nitin Kamble (US)
The SWAT process is now using a specific tool, [https://swatbot.yoctoproject.org/ swatbot]. Swatbot has a filter which will list all the pending issues that need to be triaged using [https://swatbot.yoctoproject.org/mainindex/swat/  this link ] or the "SWAT Pending Builds" link on the left hand menu. Each issue has links to the autobuilder logs for the failing step (e.g. usually stdio and warning/error logs).
* Elizabeth Flanagan (US) (Autobuilder Maintainer)
* Paul Eggleton (UK)
* Jessica Zhang (US)
* Saul Wold (US) (Autobuilder Administrator)
* Richard Purdie (UK) (Autobuilder Administrator)
* Ross Burton (UK)
* Ioana Grigoropol (RO)
* Cristian Iorga (RO)
* Andrei Dinu (RO)
* Cristiana Voicu (RO)
* Radu Moisan (RO)
* Constantin Musca (RO)
* Bogdan Marinescu (RO)
* Laurentiu Palcu (RO)


==Chair==
The builds are shown in a tree like structure with the parent build and any child builds under it. The builds are edited as a group under the parent as quite often a failure might be common to the child builds. Each failure does need to be triaged individually although multiple builds can be changed at once to a given resolution.
A chairperson role will be rotated among team members each week. The Chairperson should monitor the build status for the entire week. Whenever a build is broken, the Chairperson should do necessary debugging and organize resources to solve the problems in a timely manner to meet the overall project and release schedule. The Chairperson serves as the focal point of the SWAT team to external people such as program managers or development managers.


==Rotation Process==
Swatbot can filter pending issues to be triaged using the [https://swatbot.yoctoproject.org/mainindex/swat/ SWAT Pending Builds] link. Once you have selected an issue to triage, you will have to take the correct reporting action and finally edit the entry to indicate what was done.
The Chairperson rotation takes place during the weekly technical project meeting (Tuesdays at 8:00 AM PT). Usually, this will take a simple round robin order. In case the next person cannot take the role due to tight schedule, vacation or some other reasons, the role will be passed to the next person.


The current Chairperson's full name and email address will be published on the project status wiki page: https://wiki.yoctoproject.org/wiki/Yocto_Project_v1.2_Status under "Current SWAT team Chairperson" section.
You can also be notified when a build fails by subscribing to the [https://lists.yoctoproject.org/g/yocto-builds yocto-builds] mailing list.  This is sending a mail when a build fails, including direct links to the [https://autobuilder.yoctoproject.org/ autobuilder job summary] and the [https://errors.yoctoproject.org/Errors/Latest/Autobuilder/ Error Reporting Service]. The mail will also state if it is expected that the build is triaged by Swat, so check this to see if the build can be ignored as the owner is taking full responsibility. Currently, swatbot will not give you this information so you have to get it from the autobuilder build entry (the <tt>Build properties</tt> tab should have: <tt>swat_monitor true</tt>), the autobuilder API, or the notification email.


==Debugging BKMs==
Another tool that can be used to monitor builds is the [https://autobuilder.yoctoproject.org/typhoon/#/console Autobuilder 'Yocto Console View'] which is an overview of the top-level builds (''a-full'' and ''a-quick'') and the sub-builds they trigger.


When looking at a failure, the first question is what the baseline was and what changed. If there were recent known good builds it helps to narrow down the number of changes that were likely responsible for the failure. It's also useful to note if the build was from scratch or from existing sstate files. You can tell by seeing what "setscene" tasks run in the log.
Both the top-level build entry and the mail notification will include notes from the build owner, so check this for any useful context. For example, it may request that failures are reported directly to a specific person instead of bugs created, or that particular failures that are expected.


The primary responsibility is to ensure that any failures are categorized correctly and that the right people get to know about them.
=== Report ===


It's important *someone* is then tasked with fixing it. Image failures are particular tricky since its likely some component of the image that failed and the question is then whether that component changed recently, whether it was some kind of core functionality at fault and so on.
There are two categories of builds that Swat will be monitoring: official branches and staging branches. The official branches are the primary top-level branches in Poky, that is master and all of the release branches (gatesgarth, dunfell, etc).  The staging branches are where patches are held for testing, such as master-next, stable/dunfell-nut, or ross/mut.


Ideally we want to get the failure reported to the person who knows something about the area and can come up with a fix without it distracting them too much.
Communication is important: if the build owner is on IRC then it's always worth discussing issues with them first as they may have further context and directions. Also, if the build owner triages the build failures then they must update the swatbot entries so that Swat doesn't duplicate the work.
As a secondary responsibility, it's often helpful to triage the failure. This might mean documenting a way to reproduce the failure outside a full build and/or documenting how the failure is happening and maybe even propose a fix.


To fulfill the primary responsibility, it's suggested that bugs are opened on the bugzilla for each type of failure. This way, appropriate people can be brought into the discussion and a specific owner of the failure can be assigned. Replying to the build failure with the bug ID and also bringing the bug to the attention of anyone you suspect was responsible for the problem are also good practices.
When reporting an issue, be it in a mailing list post or a new bug, the following information should be included:
* Relevant details about the build configuration. For example: did the failure happen just once, or in all PowerPC builds? Was it specific to multilib configurations?  Look across the entire build run and identify any patterns.
* The error itself. Trim the log down to just the error and any relevant context in the bug description.
* A link to the build failure.  Either a link to the [http://errors.yoctoproject.org/ error reports] page (such as http://errors.yoctoproject.org/Errors/Details/199667/) or a link to the autobuilder build log (such as https://autobuilder.yoctoproject.org/typhoon/#/builders/34/builds/168).


When filing the bug, please cut and paste the relevant error in the bug comment, and include the log file as an attachement. This ensures the assignee and triage team can quickly asses this issue.
When filing bugs, always search Bugzilla first to see if the issue is already known.  For example, there are some bugs that occur intermittently and are already filed with ''AB-INT'' in the whiteboard field. They are listed here: [https://bugzilla.yoctoproject.org/buglist.cgi?quicksearch=whiteboard%3AAB-INT&list_id=640327 Autobuilder issues]


Every build failure should be responded to. If it is a known issue, a response with a single line containing "Known Issue" is sufficient. This assures others that the failure has been looked at and is being worked on.
The exact progress depends on whether the branch is an official branch or a staging branch.


==Autobuilder BKMs==
==== Staging Branches ====


Sometimes failures are difficult to understand and can require direct ssh access to the autobuilder so the issue can be debugged passively on the system to examine contents of files and so forth. If doing this ensure you don't change any of the file system for example adding files that couldn't then be deleted by the autobuilder when it rebuilds.
For builds against staging branches which contain patches under test for integration (such as master-next, stable/dunfell-nut, ross/mut, etc), first attempt to identify if there is a patch in the branch that is likely to be responsible for the failure. For example, if <tt>wget</tt> fails with <tt>libgnutls</tt> errors and there is a GnuTLS upgrade in the branch, then that is a likely candidate.  If a patch can be identified that hasn't yet been merged into an official branch, then reply to the patch on the mailing list with the details. If it isn't obvious which patch is responsible for the failure, or a patch can be identified but it has already been merged to the release branch, then file a bug and ensure the branch maintainer (see the [[Releases]] page for names) is on the CC list.


Rarely, "live" debugging might be needed where you'd su to the pokybuild user and run a build manually to see the failure in real time. If doing this, ensure you only create files as the pokybuild user and you are careful not to generate sstate packages which shouldn't be present or any other bad state that might get reused. In general its recommended not to do "live" debugging. This can be escalated to RP/Saul/Beth if needed.
Most of the failures will be for staging branches as master-next is the branch that is tested the most. However, it is rebased quite frequently so it is not always easy to find which patchs were included. In that case, you have to get the actual commit hash, for example in the build properties, the variable is <tt>yp_build_revision</tt> or in the build configuration at the beginning of the stdio log. For example, this qemux86 build [https://autobuilder.yoctoproject.org/typhoon/#/builders/59/builds/3120] was master-next at revision 47482eff9897ccde946e9247724babc3a586d318.
With that information, you can then clone poky (or any other layer of interest) and fetch the proper commit and see the git log:


Live debugging is generally something we try to avoid doing. It should only occur if an issue can only be reproduced on the autobuilder.
<pre>
$ git clone git://git.yoctoproject.org/poky
$ cd poky
$ git fetch origin 47482eff9897ccde946e9247724babc3a586d318
$ git log FETCH_HEAD
</pre>


===Autobuilder Overview===


====Infrastructure Overview====
'''If in doubt, file a bug'''. All errors must be taken care of.
ab01: The yocto master autobuilder. This runs two low utility slaves which do the eclipse build, universe fetch and build adt-installer. It also acts as a trigger parent for our full nightly build.


ab02, ab04, ab05, ab06: Generic nightly slaves. These run two slaves a piece. On trigger from nightly, these slaves
If the issue is in the infrastructure or autobuilder itself then file a bug against "Infrastructure: Autobuilder", infrastructure bugs should be assigned to [[User:Halstead| Michael Halstead]] and autobuilder logic bugs to [[User:Rpurdie | Richard Purdie]].


==== Official Branches ====


For builds of official branches, that is master or a release branch, '''all failures or warnings are critical''' and must be [[#Filing_bugs | filed in Bugzilla]]. Remember to check that the issue isn't already filed. Where an issue is already filed, please do add a comment so we can assess how frequently different issues are occurring.


====Build Targets====
=== Update ===
Nightly is a "dummy" buildset that does relatively few things and is only ever run on ab01. It mainly does universe fetch, building
adt-installer and building the eclipse plugin. It's main function is to trigger nightly-${ARCH} and wait until they're done. ab02, ab04,
ab05, ab06 are what is used to run this pool of nightly arch builds.


NOTE: Just because nightly-* ran on ab04 the last time does not mean it will again. It's semi random. In order to find out what host you need to log into, please look for the buildstep that says:
Finally the swatbot build entry must be updated with a summary of the outcome.  Three different resolutions are available:
* Mail sent: used when you replied to the problematic patch directly on the mailing list.
* Bug Opened: used when a new bug has been opened or a new comment has been added to a bug. Please add the bug number in the notes.
* Handled (other): used when the maintainer is already aware of the issue and is working on it resolution or a patch has already been sent to solve the issue.


Building on
You need an account for this step. If it hasn't been provided, please ask on the swat mailling list.
autobuilder04
Linux autobuilder04 2.6.37.6-0.9-default #1 SMP 2011-10-19 22:33:27
+0200 x86_64 x86_64 x86_64 GNU/Linux


====Build "gotchas"====
'''Every issue that is dealt with must be annotated''', so it is easy to see which issues have been handled. This includes filing new bugs, finding existing bugs, contacting the mailing list, contacting the maintainer directly on IRC, or identifying that a patch has already been sent to fix the issue.
Currently, we share sstate-cache and downloads between these slaves via NAS. For the moment we are also splitting up sstate and lsb-sstate. They are currently stored in /srv/www/vhosts/autobuilder.yoctoproject.org/pub/[sstate|lsb-sstate]. This will change after M2 to be combined into one directory; /srv/www/vhosts/autobuilder.yoctoproject.org/pub/sstate


TMPDIR between distros (poky and poky-lsb) is not shared. $TMPDIR ends up being moved to ~pokybuild/yocto-autobuilder/yocto-slave/nightly-${ARCH}/build/build/nonlsb-tmp and poky-lsb is left in the above path's tmp.
== Tips ==


An issue will quite often repeat itself across multiple builds. It is worth looking for those repetitions as swatbot will allow you to select many builds and update them all at once.


====Live Debugging Process====
In the stdio logs window, clicking on the looking glass icon will load the full log in the browser, allowing you to use the search feature of the browser. Quite often, you'll start by looking for "error:".
If you need to do live debugging on the autobuilder, you want to:


* Check that nothing is running on the builder:
Be sure to check all log files, especially the testimage logs that are available for qemu_boot_log (for example in: /tmp/work/qemux86_64-poky-linux/core-image-sato-sdk/1.0-r0/testimage/qemu_boot_log.20210902120413)
http://autobuilder.yoctoproject.org:8010/buildslaves


* If nothing is running, remove the buildslave from the pool. Please let either Beth or sgw know if you're planning on doing this. Email/IRC is fine.
== Handoff ==


<nowiki>
At the end of the week, the outgoing person on Swat should email swat@lists.yoctoproject.org summarising the week and noting anything that the incoming person on Swat next week should be aware of. For example, noting that there's a new intermittent bug to watch for.
sudo -i -u pokybuild
cd yocto-autobuilder
. ./yocto-autobuilder-setup
./yocto-stop-autobuilder slave
</nowiki>


This will ensure that the directory you are working in doesn't disappear out from under you. Please make sure that after you are done, you restart:
== Members ==


<nowiki>
* [[User:RossBurton | Ross Burton]]
sudo -i -u pokybuild
cd yocto-autobuilder
. ./yocto-autobuilder-setup
./yocto-start-autobuilder slave
</nowiki>


=====Things to never do=====
* [[User:Leonardo_Sandoval | Leo Sandoval]]
* NEVER clean sstate (cleanall, cleansstate). As sstate is shared across builders, you do not want it wiped like this. If you need to toss sstate, let Beth/sgw/RP know. We try not to remove sstate as it speeds up build times dramatically. As it's fairly large and takes a while to wipe, we try to avoid this.
 
* NEVER stop ab01's master/slave. If you need to debug something on ab01, let sgw, RP and Beth know. As we're the only three who can kick builds off, it's really important they all know so they don't kick off a build and tromp on live debugging. If you need to work on ab01 one of them must know about it *and* have given the ok.
* [[User:Anibal Limon | Anibal Limon]]
* NEVER create a file as yourself under ~pokybuild/yocto-autobuilder/* This can cause future builds to fail and is frustrating to debug.
 
* [[User:Köry maincent | Köry Maincent]]
 
* [[User:Thomas Perrot | Thomas Perrot]]
 
* [[User:SaulWold | Saul Wold]]
 
* [[User:Oleksiy Obitotskyy | Oleksiy Obitotskyy]]
 
* [[User:Alejandro Enedino Hernandez Samaniego | Alejandro Hernandez Samaniego]]
 
* [[User:PaulEggleton | Paul Eggleton]]
 
* [[User:Naveen Kumar Saini | Naveen Saini]]
 
* [[User:Alexandre Belloni | Alexandre Belloni]]
 
* [[User:Kergoth | Christopher Larson]]
 
* [[User:Lee_chee_yang | Lee Chee Yang]]
 
* [[User:Jon Mason | Jon Mason]]
 
* [[User:Minjae Kim | Minjae Kim]]
 
* [[User:Jagadheesan | Jaga]]
 
* [[User:Valerii Chernous | Valerii Chernous]]

Latest revision as of 10:35, 5 October 2021

Overview

All builds that are run on the public autobuilder are important for the Yocto Project, whether they be routine validation runs or pre-integration test builds. Random failures if ignored accumulate and can result in a significant number of builds failing.

The role of the Bug Swat Team is to monitor the autobuilder and do preliminary investigation of failures, to ensure that they are logged and brought to the attention of the appropriate owner.

Importantly, the Swat Team isn't responsible for resolving issues encountered on the autobuilder, simply just enough analysis so that it can be logged for later analysis and ideally make the right people aware of them.

Each week a different member of the team is on call. Every build that fails on the autobuilder should be monitored unless stated otherwise. The rotation happens at the end of Friday (deliberately vague), any failures over the weekend should be triaged by the incoming member on Monday.

The Swat Chairs are the primary contact for the Swat Team. The current Swat Chairs are Ross Burton and Richard Purdie. The Chairs are assisted by Stephen K. Jolley who handles the rotation process. If the person currently on call, or about to be on call, can no longer perform their duty then they should contact Stephen to arrange a replacement.

Process

The SWAT process is now using a specific tool, swatbot. Swatbot has a filter which will list all the pending issues that need to be triaged using this link or the "SWAT Pending Builds" link on the left hand menu. Each issue has links to the autobuilder logs for the failing step (e.g. usually stdio and warning/error logs).

The builds are shown in a tree like structure with the parent build and any child builds under it. The builds are edited as a group under the parent as quite often a failure might be common to the child builds. Each failure does need to be triaged individually although multiple builds can be changed at once to a given resolution.

Swatbot can filter pending issues to be triaged using the SWAT Pending Builds link. Once you have selected an issue to triage, you will have to take the correct reporting action and finally edit the entry to indicate what was done.

You can also be notified when a build fails by subscribing to the yocto-builds mailing list. This is sending a mail when a build fails, including direct links to the autobuilder job summary and the Error Reporting Service. The mail will also state if it is expected that the build is triaged by Swat, so check this to see if the build can be ignored as the owner is taking full responsibility. Currently, swatbot will not give you this information so you have to get it from the autobuilder build entry (the Build properties tab should have: swat_monitor true), the autobuilder API, or the notification email.

Another tool that can be used to monitor builds is the Autobuilder 'Yocto Console View' which is an overview of the top-level builds (a-full and a-quick) and the sub-builds they trigger.

Both the top-level build entry and the mail notification will include notes from the build owner, so check this for any useful context. For example, it may request that failures are reported directly to a specific person instead of bugs created, or that particular failures that are expected.

Report

There are two categories of builds that Swat will be monitoring: official branches and staging branches. The official branches are the primary top-level branches in Poky, that is master and all of the release branches (gatesgarth, dunfell, etc). The staging branches are where patches are held for testing, such as master-next, stable/dunfell-nut, or ross/mut.

Communication is important: if the build owner is on IRC then it's always worth discussing issues with them first as they may have further context and directions. Also, if the build owner triages the build failures then they must update the swatbot entries so that Swat doesn't duplicate the work.

When reporting an issue, be it in a mailing list post or a new bug, the following information should be included:

When filing bugs, always search Bugzilla first to see if the issue is already known. For example, there are some bugs that occur intermittently and are already filed with AB-INT in the whiteboard field. They are listed here: Autobuilder issues

The exact progress depends on whether the branch is an official branch or a staging branch.

Staging Branches

For builds against staging branches which contain patches under test for integration (such as master-next, stable/dunfell-nut, ross/mut, etc), first attempt to identify if there is a patch in the branch that is likely to be responsible for the failure. For example, if wget fails with libgnutls errors and there is a GnuTLS upgrade in the branch, then that is a likely candidate. If a patch can be identified that hasn't yet been merged into an official branch, then reply to the patch on the mailing list with the details. If it isn't obvious which patch is responsible for the failure, or a patch can be identified but it has already been merged to the release branch, then file a bug and ensure the branch maintainer (see the Releases page for names) is on the CC list.

Most of the failures will be for staging branches as master-next is the branch that is tested the most. However, it is rebased quite frequently so it is not always easy to find which patchs were included. In that case, you have to get the actual commit hash, for example in the build properties, the variable is yp_build_revision or in the build configuration at the beginning of the stdio log. For example, this qemux86 build [1] was master-next at revision 47482eff9897ccde946e9247724babc3a586d318. With that information, you can then clone poky (or any other layer of interest) and fetch the proper commit and see the git log:

$ git clone git://git.yoctoproject.org/poky
$ cd poky
$ git fetch origin 47482eff9897ccde946e9247724babc3a586d318
$ git log FETCH_HEAD


If in doubt, file a bug. All errors must be taken care of.

If the issue is in the infrastructure or autobuilder itself then file a bug against "Infrastructure: Autobuilder", infrastructure bugs should be assigned to Michael Halstead and autobuilder logic bugs to Richard Purdie.

Official Branches

For builds of official branches, that is master or a release branch, all failures or warnings are critical and must be filed in Bugzilla. Remember to check that the issue isn't already filed. Where an issue is already filed, please do add a comment so we can assess how frequently different issues are occurring.

Update

Finally the swatbot build entry must be updated with a summary of the outcome. Three different resolutions are available:

  • Mail sent: used when you replied to the problematic patch directly on the mailing list.
  • Bug Opened: used when a new bug has been opened or a new comment has been added to a bug. Please add the bug number in the notes.
  • Handled (other): used when the maintainer is already aware of the issue and is working on it resolution or a patch has already been sent to solve the issue.

You need an account for this step. If it hasn't been provided, please ask on the swat mailling list.

Every issue that is dealt with must be annotated, so it is easy to see which issues have been handled. This includes filing new bugs, finding existing bugs, contacting the mailing list, contacting the maintainer directly on IRC, or identifying that a patch has already been sent to fix the issue.

Tips

An issue will quite often repeat itself across multiple builds. It is worth looking for those repetitions as swatbot will allow you to select many builds and update them all at once.

In the stdio logs window, clicking on the looking glass icon will load the full log in the browser, allowing you to use the search feature of the browser. Quite often, you'll start by looking for "error:".

Be sure to check all log files, especially the testimage logs that are available for qemu_boot_log (for example in: /tmp/work/qemux86_64-poky-linux/core-image-sato-sdk/1.0-r0/testimage/qemu_boot_log.20210902120413)

Handoff

At the end of the week, the outgoing person on Swat should email swat@lists.yoctoproject.org summarising the week and noting anything that the incoming person on Swat next week should be aware of. For example, noting that there's a new intermittent bug to watch for.

Members