All about the RelEng sheriff

October 26th, 2009

Since February of this year we’ve had a rotating RelEng “sheriff” available. We started it to make a couple of things better:

  • Improve response time on critical issues
  • Avoid having the whole team distracted with infrastructure issues

By and large, this has been an improvement for us and we think, for developers as well. Serious issues are dealt with more quickly; developers and the developer sheriff have someone specific to go to with acute issues that come up. Internally, this has helped us focus more, too. With the RelEng sheriff dealing with triage and other acute issues the rest of us are able to focus on our other work without distraction.


What is the RelEng sheriff responsible for?


Who is the RelEng sheriff?

The RelEng sheriff is rotated weekly. You can find out who the current RelEng sheriff is by looking at the schedule.


How to get a hold of the RelEng sheriff

The best place to find them is on IRC, in #build or #developers. They should be wearing a ‘|buildduty’ tag at the end of their nick. You can also get our attention in other ways, if IRC doesn’t work for you:

Bugs and IRC pokes are the preferred methods but any will work. Also note that the RelEng sheriff is only around during their normal working day, which can be PDT/PST, EDT/EST, or NZDT/NZST. If a RelEng sheriff isn’t around, someone can be reached in #build.


What can your sheriff do you for you?

The on-duty Releng Sheriff would be more than happy to do any of the following for you:

  • Trigger any sort of build or test run you need, including:
    • Extra unit test or Talos runs of any given build
    • Retriggering builds that fail for spurious reasons
  • Deal with any nightly updates that fail
  • Help debug possible build machine issues
  • Help debug test issues that you cannot reproduce yourself
  • Answer questions you may have about build or test infrastructure

The RelEng sheriff is also a good first-contact point for any other random things. They may be able to help you directly but if not, they can certainly point you to the person who can.

After reading this, I hope you have a better understanding of the who, what, and why of the RelEng sheriff. If anything is unclear or absent I’m happy to clarify.

Anatomy of an SDK update

October 2nd, 2009

Over the course of the past week or so I’ve been working on rolling out the Windows 7 SDK to our build machines. Doing so presented two challenges: Getting the SDK to deploy silently and properly, and updating the appropriate build configurations to use it. Neither of these may sound very challenging, and indeed, they didn’t to me either, but because of a combination of factors this ended up becoming a week long ordeal. In this post I will attempt to detangle everything that happened.

Let’s start with the actual SDK installation. Unlike most other reasonable packages, the Windows 7 SDK is not distributed as an MSI package, but rather a collection of MSIs wrapped in an EXE. Unfortunately, this EXE doesn’t enable you to do a customized, silent install – the precise thing we need. Vainly, I thought I could figure out the proper order and magic options to install the enclosed MSIs properly. Needless to say, this failed. To work around this I fell back onto using an Autoit script that would click through the interactive installer for me. It took some fuss, but not too much difficulty to get that working.

Now, the fun part (of deployment). We use a piece of software called OPSI to schedule and perform software installations across our farm of 80 or so Windows VMs. OPSI runs very early in the Windows start-up process, and actually executes as the SYSTEM user. Well, it turns out that the Windows 7 SDK must be installed by a full user, not the SYSTEM account. This seems unnecessary, as we’ve deployed other SDKs through OPSI in the past without issue. After trying to fake it out by setting various environment variables I turned to the OPSI forums for some help. (As an aside, the OPSI developers have been fantastic in their support of our installation, many thanks to them.) It turns out that I’m not the first person to hit problems like this. They pointed me to a template for a script that works around such an issue. The solution ends up being:

  1. Copy installation files to the slave
  2. Create a new user in the Administrators group, set that user to automatically login at next boot
  3. Reboot, and run the package installation at login
  4. Restore the original automatic login, reboot
  5. Cleanup (delete installation files, remove the created user)

This is obviously quite hacky, but it gets the job done.

So! With that in hand (and in repo) we set the SDK to deploy over the course of Wednesday night and Thursday morning. Overall, this went smoothly. For a reason (which I haven’t yet figured out) some of the slaves needed some kicking to do the installation properly.

Remember how I said part 2 of this was updating the build configurations? I had planned to do this on Friday, and even posted a patch in preparation. Well, it turns out that MozillaBuild likes to be smart and find the most recent SDK and compiler for you. This completely slipped my mind while I was doing the deployment and a result, all builds from Thursday (yesterday) morning to Friday (today) morning, including those on mozilla-1.9.1, were done with the Windows 7 SDK. This went unnoticed most of Thursday until I was doing a final test of my build configuration patch.

Here’s where the fun starts for this part. After discovering I’d accidentally changed the SDK for everything I went into a bit of a panic and rapidly started testing some fixes out in our staging environment. During the course of this I discovered that things were worse than I thought. Most builds were using the Windows 7 SDK, but not the “unit test” ones. So we weren’t even using the same SDK for all the builds for a given branch! Getting all of that sorted out was compounded by all of the iterations of path styles (c:/ vs. c:\ vs. /c/) I had to try before I found the magic combination. In the end, I discovered a few things:

  • If you’re specifying LIB/INCLUDE/SDKDIR in a mozconfig, you must use Windows-style paths
  • If you’re specifying PATH in a mozconfig, you CANNOT use Windows-style paths – you must use MSYS style
  • You can’t test for these things properly without clobbering

As I write this the first set of builds that all use the correct SDK are finishing up, and this deployment from hell appears to be nearly over. I want to express a special thanks to the OPSI developers, who were very helpful, and to Nick Thomas and Chris AtLee, for their patience with my countless iterations of build configuration patches. As a final note, let me state explicitly which SDK is being used where:

  • Windows Vista SDK (6.0a): mozilla-1.9.1 builds
  • Windows 7 SDK (7.0): mozilla-central, mozilla-1.9.2, TraceMonkey, Electrolysis, and Places builds

WinCE and WinMO builds are unaffected by this deployment.

Because of the major version bump in mozilla-central, all users of mozilla-central nightlies will be bumped to mozilla-1.9.2 nightlies today. If you want to continue to track the Firefox 3.6 / Gecko 1.9.2 builds no action is required. If you want to track the post-1.9.2 version or absolute “trunk” of Firefox/Gecko you will need to download today’s mozilla-central nightly build, found in the nightly area of the ftp server.

This morning I landed bug 486567 – which cleaned up the try server code significantly. There’s still more to be done there, particularly running unittests on packaged builds once it’s production counterpart lands (bug 383136). Both of these things help us keep the Try Server in sync with the rest of the world – which has always been a problem.

Looking forward a little bit, I’m looking to land a patch that enables e-mail notification for try server builds and unit tests on Tuesday. With this patch, every try submission would result in 6 e-mails to the submitter: (1 per platform/build type combination). Here’s what they’ll look like:
Build:
Your Try Server build (try-1c170baeac1) was successfully completed on linux. It should be available for download at http://build.mozilla.org/tryserver-builds/bhearsum@mozilla.com-try-1c170baeac1

Visit http://hg.mozilla.org/MozillaTry to view the full logs.

Unit test:
Your Try Server unit test (try-1c170baeac1) completed with warnings on linux. It should be available for download at http://build.mozilla.org/tryserver-builds/bhearsum@mozilla.com-try-1c170baeac1

Summary of unittest results:
check: 2/0

Visit http://hg.mozilla.org/MozillaTry to view the full logs.

(The unittest e-mails will have the full results listed, of course).

E-mail notification has been an oft requested feature so I’m really excited that this will be landing soon.

I’m happy to announce that I’ve finally updated the publicly available version of our CentOS 5.0 build reference platform. There are many changes to it since the last released version, most notably a Scratchbox installation and Mercurial. For all the details you can have a look at the reference platform wiki page. Everything up to Version 17 is included on the released version.

You can get it here: ftp://ftp.mozilla.org/pub/mozilla/VMs/

…but only those to RelEng infrastructure.

We in Release Engineering always love it when people take the time and effort to fix a bug, cleanup some code, or otherwise enhance our infrastructure. However, it’s often difficult to take outside patches because they are generally untested. Because of the nature of our code and systems – how tightly it’s tied to infrastructure, limited access to said infrastructure, and how many different systems a single change can touch – it’s nearly impossible for outside contributors to do more than the most basic testing. We don’t have a Try Server that can test patches to RelEng code, and proper testing requires many different machines and can be very time consuming – especially if you’ve never done it.

It’s always unfortunate to see patches sitting for days, weeks, or in rare cases, months, before they get landed. However, we don’t like to land untested patches because it can lead to unnecessary build bustage.

I want to fix this, so over the next few months I’m going to be prioritizing testing your patches every Monday. I will set aside my normal work for a day to help test and get ready to land contributed patches. High priority things such as releases or infrastructure problems will take precedence over this, but that shouldn’t be a common occurrence.

I’ll be keeping an eye out for things, but if you want to me ping directly about a bug please feel free to do so.

Consider this in an experimental state. I’ll be tweaking the process along the way and am very open to improvements here.

It’s the start of a new week and the start of my first shift as RelEng sheriff. Do you suspect rando-orange? Talk to me. Do you suspect machine problems? Talk to me. I’ll be watching dev.tree-management, the releng triage queue, and #developers for issues.

During this time we will be deploying all of the changes listed on this page as well as two Talos bugs: bug 379233 and bug 458093

This will affect the following trees: Firefox (mozilla-central), Firefox3.1 (mozilla-1.9.1), Tracemonkey (tracemonkey), and Mobile (mobile-browser).

As a heads up please note that the Talos changes could affect numbers and will change the layout of the results on the Tinderbox Waterfall.

If there is any reason why we shouldn’t go ahead with this please e-mail release@mozilla.com.

If you are crashing after updating to the latest mozilla-1.9.1 nightly (20090218) try flipping javascript.options.jit.chrome to false – that should fix it. I’ve filed this as bug 479053 but flipping that pref should fix you up in the meantime.

During this time we will be deploying all of the changes listed on this page. Most notably, clobber support for all Mercurial based builds will be landed tomorrow. More on that soon from Chris AtLee soon.

This will affect all trees: Firefox (mozilla-central), Firefox3.1 (mozilla-1.9.1), Tracemonkey (tracemonkey), and Mobile (mobile-browser).

One of these requires a full Buildbot master restart – so expect some spurious burning. We’ll do our best to minimize it and get things green ASAP.

If there is any reason why we shouldn’t go ahead with this please e-mail release@mozilla.com.