Release Automation – Part 1: Bootstrap
February 7th, 2012
One of the first tasks I had as a full-time employee of Mozilla was getting the Bootstrap Release framework working with Firefox 3.0 Beta releases. Now, just over 4 years later, our Release Automation has changed dramatically in many ways: primary language, supported platforms, scope and extent, reliability, and versatility. I thought it made be interesting to trace the path from there to here, and talk about what’s in store for the future, too. Throughout all of this work there’s been two overarching goals: 1) Lower the time it takes to go from “go to build” to “updates available for testing” – which we call “end2end time”, and 2) Remove the number of machines we have to log into, commands we have to run, and active time we have to spend on a release – known as “manual touchpoints”. I’ll be referencing these a lot throughout this series.
This post will talk about what I know of Bootstrap and my work porting it to Firefox 3.0.
In its earliest form Bootstrap was a simple scripted version of much of the previously manual release process. The processes for tagging VCS repositories, creating deliverables (source packages, en-US and localized builds, updates), and some verifications were encapsulated into its scripts. This was a big improvement over the 100% manual, cut+paste-from-a-wiki, process. Instead of logging into many machines and running many commands, the release engineer had to log in to many machines and run a few, very simple commands. The very first release that was Bootstrap-aided was Firefox 1.5.0.9, built on December 6th, 2006. This was before my time, but a former release engineer, Rob Helmer, told me that the end2end time back then could be multiple days, and countless touchpoints.
Over time, more parts of the release process were automated with Bootstrap, further reducing the burden on the release engineer. Even with these big improvements some classes of things were still not codified: which machines to run which commands on, when and in what order to run things, who to notify about what. Enter: Buildbot. Integrating Bootstrap into Buildbot was the next logical step in the process. It would handle scheduling and status, while Bootstrap would remain responsible for all of implementation. With this, the release engineer only had to log in to a few machines and run a few, very simple commands. Another big improvement! The first release to benefit from this was Firefox 2.0.0.8, built on October 10th, 2007. This work was largely done by Rob Helmer.
Around this time we were gearing up to start shipping the first Firefox 3.0 Beta release and had never tested Bootstrap against that development branch. I was tasked with making whatever changes were necessary to Bootstrap and our Buildbot to make it work. The Buildbot side was largely simple, because of it being at such a high abstraction layer, but back in these days we still had single purpose Buildbot masters, so it involved adding several hundred lines of config code.
The Bootstrap side was far more interesting. Until this point, there was a lot of built-in assumptions based on what the 1.8 branch looked like, including:
- Releases are done from CVS branches (explicitly _not_ trunk)
- Windows build machines run Cygwin
- Linux packages are in .gz format
- The crash reporting system Talkback is always shipped
By themselves, none of these things are too challenging to deal with, but as a very new hire, the combination took me about a month to find solutions to and fully test, with many rounds of feedback and guidance along the way. With all of that done and landed, we managed to use the new automation to build Firefox 3.0b2 on December 10, 2007. At this point, the end2end time was around 24h and there were about 20 manual touchpoints.
Over the next 8 months or so there were a few major improvements of note. Firstly, Nick Thomas fixed bug 409394 (Support for long version names) allowed us to start shipping releases with nicer looking filenames like “Firefox Setup 3.0 Beta 4″. Not a crucial thing, but much nicer from the user perspective. bug 422235 (enable fast patcher for release automation) was a massive improvement in update generation, written by schrep. With this work, we went from taking 6-8 hours to generate updates, down to ~1h — an incredible savings in time. Finally, bug 428063 (Support major releases & quit using rc in overloaded ways) (also fixed by Nick) enabled us to build RCs with Bootstrap. While it may sound simple, there’s a lot of things in release automation that depend on filename, and catching them all can be difficult. As well as making it possible to build these, this bug also renamed the internal “rc” notion to “build”, to avoid situations where we’d have things like “3.0 RC1 rc1″, which was utterly confusing.
So, in the early days there were tons of improvement quickly: Bootstrap itself sped things up and lowered the possibility of error through reducing manual touchpoints. Buildbot + Bootstrap did so again, through the same methods. We also had pure speed-ups through things such as fast patcher. Having these things allowed us to maintain the 2.0.0.x and 3.0.x branches more more easily, and get chemspill releases out quickly and simultaneously. All of this work had to be done incrementally too, because we had to continue shipping releases while the work was happening. It’s hard to find good data for releases done with this version of the automation, but I guesstimate that the end2end time was around 12-14 hours and the number of manual touchpoints was still around 20 for a release without major issues.
Next up….Release Automation on Mercurial, v1.
Removed symlinks for dead branches on FTP (Firefox only)
October 5th, 2011
Since the Firefox 2.0 days we’ve had “latest-X.Y” symlinks on FTP for all major versions of Firefox. With rapid release, this has quickly caused an explosion in the number of them, cluttering things up. In bug 689936 I removed all of the ones for dead branches (2.0, 3.0, 3.5), and also all for rapid releases (4.0, 5.0, 6.0, 7.0). From now on, there will be no new branch based symlinks, simply a “latest” symlink that points to the latest rapid release.
New tests coming to opt builds and l10n repacks
March 25th, 2011
For a couple of years now we’ve been building Firefox release builds using slightly different packaging targets than nightly builds. This has streamlined our release automation in some ways, but has had the unfortunate side effect of release build packaging targets being largely untested. This week, we will start correcting that. When bug 600838 lands we will start testing the release build packaging code (“MOZ_PKG_PRETTYNAMES”). These packages will not be uploaded anywhere, but any build that fails in one of these targets will constitute a test failure, and turn the overall build orange. By doing so, we can ensure that any bustages to them will be caught at commit time, rather than during a release.
As of now, these tests are running on Linux (32 and 64 bit) and Windows opt en-US builds only, across all branches (including Try). Sometime next week these tests will be turned on for the remaining opt builds and l10n repacks, except the Mac ones on 1.9.1 and 1.9.2, which fail for unknown reasons, and aren’t worth debugging due to their limited life going forward.
New tests coming to Linux and Windows opt builds
March 21st, 2011
For a couple of years now we’ve been building Firefox release builds using slightly different packaging targets than nightly builds. This has streamlined our release automation in some ways, but has had the unfortunate side effect of release build packaging targets being largely untested. This week, we will start correcting that. When bug 600832 lands we will start testing the release build packaging code (“MOZ_PKG_PRETTYNAMES”). These packages will not be uploaded anywhere, but any build that fails in one of these targets will constitute a test failure, and turn the overall build orange. By doing so, we can ensure that any bustages to them will be caught at commit time, rather than during a release.
For now, these tests will be run on all Linux (32 and 64 bit) and Windows en-US opt builds, including nightlies, but will make their way to Mac builds and l10n repacks shortly.
Purple is a fruit (…and also a colour on Tinderbox!)
September 2nd, 2010
Today I successful landed the first part of bug 505512, which lays the ground work for catching all sorts of build problems and turning them purple, instead of red. As part of this initial work we’ll now be catching most problems when cloning Mercurial repositories, turning the builds purple, and automatically retrying them.
In the next week or two I’m going to add similar behaviour for at least the following:
- Graph Server post failures
- Slave disconnections
- Sendchange failures
- out of disk issues
- CVS checkout failures (yes, we still use CVS….)
If there’s other things people can think of that should be flagged as infra problems, or that should cause builds to be retried, please add them to this Etherpad: http://etherpad.mozilla.com:9000/build-infra-errors. Bonus points if you write the regular expression that catches it :-).
Currently, the purpleness is only visible on plain Tinderbox, but once bug 592340 is resolved, TBPL will support it as well.
More than 3 months worth of machine time has been saved by *you*
August 24th, 2010
Ever since bug 541364 landed 4 months ago it’s been possible to selectively disable platforms on try by overriding specific mozconfigs. Since that time, roughly 2321 hours (that’s 96 days or 13 weeks or ~3 months) of machine time have been saved through this — and that calculation is only compile time, even more than that has been saved on the test side. I just want to say a huuuuuuge THANKS! on behalf of RelEng. Taking the time to disable unneeded things on a push makes a noticeable difference in the time it takes us to turn around a full set of tests, especially during busier times.
More stats:
- The most common platforms disabled were Mac 64-bit (115 times), Maemo (4: 115, 5 gtk: 106, 5 qt: 105), and Android (120 times)
- The least common platforms disabled were Windows (58 times), Linux 32-bit (85 times), and Mac 32-bit (96 times)
My week of buildduty
August 20th, 2010
I’ve been on buildduty this week, which means I’ve been subject to countless interrupts to look at infrastructure issues, start Talos runs, cancel try server builds, et. al. It’s not shocking to me that I got very little that I planned to done this week, but I’m still a bit surprised when I look at the full list of what I *have* done:
- Supervise 4.0b4 release
- Deal with test masters getting really backed up
- RFT (Request for Talos) / try build canceling (~15 times)
- Disable merging for tests on try builds
- Help IT debug issues with Try repos
- Help Axel fix issues with l10n nightly builds on trunk
- Disable all nightly builds after omnijar bustage
- Schedule/manage downtime
- Sign BYOB builds
- Stage MozillaBuild 1.5.1
- Add Camino 2.0.4 to bouncer
….and my Friday is only half over!
Which build infrastructure problems do you see the most?
August 13th, 2010
I’m hoping to tackle bug 505512 (Make infrastructure related problems turn the tree a color other than red) in the next few weeks. Most of the ground work for it is laid, which means that most of what I’ll be doing is parsing logs for infrastructure errors.
So, what errors do you see most from our build infrastructure? Are there other things that you would classify as infrastructure issues? Please add any suggestions you have to this Etherpad: http://etherpad.mozilla.com:9000/build-infra-errors
Update on recent Tinderbox issues
August 12th, 2010
My last post talked about the issues we’ve been having with load on the Tinderbox server and some ways we could fix it. I’m happy to report that two things were completed yesterday that should keep the load under control for the foreseeable future.
One of the things mentioned in my previous post, splitting incoming build processing from the rest of Tinderbox (bug 585691), was completed very late last night. Additionally, Nick Thomas discovered that we had lost the cronjob that takes care of cleaning out old builds from Tinderbox’s memory. That script was re-enabled and a one time clean up removed 64GB of old build data. Both of these were completed around 4am PDT this morning and load is looking much better.
Especially because we’re now running cleanup scripts on a regular basis again, I believe that this should get as back to good state.
Everyone should feel free to send justdave their thanks for staying up reaaaaallllly late last night to get us back to a good state.
Recent Tinderbox issues
August 10th, 2010
As many of you know there have been numerous times lately that Tinderbox has become unresponsive, sometimes to the point of going down completely for a period of time. This post will attempt to summarize the issues and what’s being done about them.
The biggest issue is load (surprise!). In a period of a few years we’ve gone from a few active trees with tens of columns between them to tens of active trees with hundreds of columns between them. Unsurprisingly, this has made the Tinderbox server a lot busier. The biggest load items are:
- showlog.cgi – Shows a log file for a specific build
- showbuilds.cgi – Shows the main page for a tree (like this)
- processbuilds.pl – Processes incoming “build complete” mail
A bit of profiling has also been done in bug 585814 to try to find specific hotspots.
We’ve already done a few things to help with Tinderbox load:
Other ways we’re looking at improving the situation:
- bug 585691 – Split up Tinderbox data processing from display. This wouldn’t reduce overall load, but it should segregate it enough to keep the Tinderbox display up.
- bug 390341 – Pregenerate brief and full logs. This would eliminate the need for showlog.cgi to uncompress logs in most cases.
- bug 530318 – Put full logs on FTP server; stop serving them from Tinderbox.