Anatomy of an SDK update

October 2nd, 2009

Over the course of the past week or so I’ve been working on rolling out the Windows 7 SDK to our build machines. Doing so presented two challenges: Getting the SDK to deploy silently and properly, and updating the appropriate build configurations to use it. Neither of these may sound very challenging, and indeed, they didn’t to me either, but because of a combination of factors this ended up becoming a week long ordeal. In this post I will attempt to detangle everything that happened.

Let’s start with the actual SDK installation. Unlike most other reasonable packages, the Windows 7 SDK is not distributed as an MSI package, but rather a collection of MSIs wrapped in an EXE. Unfortunately, this EXE doesn’t enable you to do a customized, silent install – the precise thing we need. Vainly, I thought I could figure out the proper order and magic options to install the enclosed MSIs properly. Needless to say, this failed. To work around this I fell back onto using an Autoit script that would click through the interactive installer for me. It took some fuss, but not too much difficulty to get that working.

Now, the fun part (of deployment). We use a piece of software called OPSI to schedule and perform software installations across our farm of 80 or so Windows VMs. OPSI runs very early in the Windows start-up process, and actually executes as the SYSTEM user. Well, it turns out that the Windows 7 SDK must be installed by a full user, not the SYSTEM account. This seems unnecessary, as we’ve deployed other SDKs through OPSI in the past without issue. After trying to fake it out by setting various environment variables I turned to the OPSI forums for some help. (As an aside, the OPSI developers have been fantastic in their support of our installation, many thanks to them.) It turns out that I’m not the first person to hit problems like this. They pointed me to a template for a script that works around such an issue. The solution ends up being:

  1. Copy installation files to the slave
  2. Create a new user in the Administrators group, set that user to automatically login at next boot
  3. Reboot, and run the package installation at login
  4. Restore the original automatic login, reboot
  5. Cleanup (delete installation files, remove the created user)

This is obviously quite hacky, but it gets the job done.

So! With that in hand (and in repo) we set the SDK to deploy over the course of Wednesday night and Thursday morning. Overall, this went smoothly. For a reason (which I haven’t yet figured out) some of the slaves needed some kicking to do the installation properly.

Remember how I said part 2 of this was updating the build configurations? I had planned to do this on Friday, and even posted a patch in preparation. Well, it turns out that MozillaBuild likes to be smart and find the most recent SDK and compiler for you. This completely slipped my mind while I was doing the deployment and a result, all builds from Thursday (yesterday) morning to Friday (today) morning, including those on mozilla-1.9.1, were done with the Windows 7 SDK. This went unnoticed most of Thursday until I was doing a final test of my build configuration patch.

Here’s where the fun starts for this part. After discovering I’d accidentally changed the SDK for everything I went into a bit of a panic and rapidly started testing some fixes out in our staging environment. During the course of this I discovered that things were worse than I thought. Most builds were using the Windows 7 SDK, but not the “unit test” ones. So we weren’t even using the same SDK for all the builds for a given branch! Getting all of that sorted out was compounded by all of the iterations of path styles (c:/ vs. c:\ vs. /c/) I had to try before I found the magic combination. In the end, I discovered a few things:

  • If you’re specifying LIB/INCLUDE/SDKDIR in a mozconfig, you must use Windows-style paths
  • If you’re specifying PATH in a mozconfig, you CANNOT use Windows-style paths – you must use MSYS style
  • You can’t test for these things properly without clobbering

As I write this the first set of builds that all use the correct SDK are finishing up, and this deployment from hell appears to be nearly over. I want to express a special thanks to the OPSI developers, who were very helpful, and to Nick Thomas and Chris AtLee, for their patience with my countless iterations of build configuration patches. As a final note, let me state explicitly which SDK is being used where:

  • Windows Vista SDK (6.0a): mozilla-1.9.1 builds
  • Windows 7 SDK (7.0): mozilla-central, mozilla-1.9.2, TraceMonkey, Electrolysis, and Places builds

WinCE and WinMO builds are unaffected by this deployment.

It’s the start of a new week and the start of my first shift as RelEng sheriff. Do you suspect rando-orange? Talk to me. Do you suspect machine problems? Talk to me. I’ll be watching dev.tree-management, the releng triage queue, and #developers for issues.

During this time we will be deploying all of the changes listed on this page as well as two Talos bugs: bug 379233 and bug 458093

This will affect the following trees: Firefox (mozilla-central), Firefox3.1 (mozilla-1.9.1), Tracemonkey (tracemonkey), and Mobile (mobile-browser).

As a heads up please note that the Talos changes could affect numbers and will change the layout of the results on the Tinderbox Waterfall.

If there is any reason why we shouldn’t go ahead with this please e-mail release@mozilla.com.

Firefox 3.1 Branching Schedule

November 19th, 2008

(Please direct responses to mozilla.dev.planning)

Hi All,

To follow-up on my post yesterday, here’s then when and how of branching:
* Tag mozilla-central and l10n repositories with GECKO_1_9_1_BASE so we know where we branched going forward
* File IT bugs to have mozilla-central, l10n repositories cloned and a Firefox 3.1 Tinderbox created.
* While this is happening RelEng will do as much setup of infrastructure as possible.
* Once the clones are finished we will do some tests in our staging environment.
* When were satisfied with those results we will start push those infrastructure changes to production.

ETA on being fully up and running (that is to say: parity with mozilla-central) is early in the PST day Monday, November 24th. We do expect to have infrastructure up this week but some baking time, especially for Talos, is important.

Other important information:
* mozilla-central will NOT be used for trunk development until after we tag for Firefox 3.1b2. Once that happens we will bump mozilla-central versions to 1.9.2a1pre/3.2a1pre.
* The new mozilla-1.9.1 repository will be rebranded to Shiretoko to avoid confusion

https://bugzilla.mozilla.org/show_bug.cgi?id=464640 is the tracking bug for branching, for those interested in tracking the blow-by-blow.

During this time we will be upgrading both the Firefox 3 and Firefox 3.1 Unit test Buildbots to a newer version (0.7.9). In order to avoid interrupting running builds we will be closing the tree at 4am PDT and stopping any new builds from being scheduled. Once all builds have finished we will perform the upgrade and open the tree again. Depending on the timing this could take anywhere from 20 minutes to a few hours. The tree should be open again no later than 7am PDT.

If there is any reason why we shouldn’t go ahead with this please e-mail release@mozilla.com

Useful symbols for OS X builds!

September 8th, 2008

Some of you may have noticed bug 448616. It turns out that we have not had proper breakpad symbols for OS X builds *ever* on 3.1 builds. This caused our OS X crash reports to be sigificantly less useful We’ve had a heck of a time isolating the problem. It took quite a few manual builds before we could even reproduce the problem. There were a few rounds of diagnosis involved before the problem became clear: Running ‘make package’ before ‘make buildsymbols’ breaks OS X symbols. It seems that ‘make package’ on Mac strips objdir/ppc/dist/universal, which is where we generate our symbols from!

With that knowledge in hand I deployed a fix….and unwittingly broke Linux nightlies. As part of my fix, I had changed around the order in which we do symbols, updates, and packaging. Doing so had tripped another weirdness in our build system: On Linux, ‘make -C tools/update-packaging’ (creating a full mar) *must* be done after ‘make package’. On Mac and Windows we’re able to create a mar at any point, however.

So, the correct order to do post-build processing in is thus: Symbols, Packaging, Updates.

This fix is now deployed in production and nightly builds have been re-spun. The latest Mac nightly of 3.1 should have proper breakdpad symbols (and in turn, make our crash reports useful). (Unfortunately, due to bug 454198 line numbers are only available in the raw dump, but still, this is a big improvement.) Big thanks to Lukas and Ted for all their help here. Any regressions due to this should be filed in mozilla.org:Release Engineering. \

As a side note, we ended up upgrading to Xcode 3.1 in a failed attempt to fix this bug. Included with it is GCC 4.2, which sadly bails out very early in our build process.

During this time we will be landing two things:
* Test reporting to graphs.mozilla.org for codesighs and leak tests on mozilla-central (
bug 433710)
* Support for pushing try server builds to an hg.m.o repository (details later) (bug 448014)

No downtime is expected.

If there is any reason why we shouldn’t go ahead with this please e-mail release@mozilla.com