Do you have a random or permanent orange bug that you want to debug? Are you having trouble reproducing it? If so, please file a bug in mozilla.org:Release Engineering and we’ll get you access to a machine. If the failure is on Linux you can even bypass this, and download a copy of our ref VM here. Anyone with time and effort is welcome to this offer — you do not need to be an employee of Mozilla.

Serendipity

March 12th, 2010

This week I’ve been continuing to work on issues related to the new Linux build machines mentioned in my last post. I’m hoping to resolve a few test failures on them before they get put back into production. Yesterday my morning was focused on this task and by lunch I was tearing my hair out. Over lunch with some MoTo co-workers we ended up chatting about these and the wonderful Ehsan ended up volunteering to look at one of them. By the end of the day 1/3 of the failures were fixed, and a previously unknown bug was resolved!

Big thanks to Ehsan for spending his valuable time on this!

GRUB, the MBR, BIOS bugs?

March 9th, 2010

Recently we ordered a set of server class Linux machines to supplement our pool of VMs. They are lightning fast, especially compared to VMs, but it’s been a bit of a bumpy ride getting them ready to go to production. Most notably we’ve had an mysterious problem where they would occasionally refuse to boot, halting at a “GRUB _” dialogue. It took awhile, but we believe we have this fixed now.

This problem first occurred on 2 out of 25 slaves. Catlee quickly discovered that it could be fixed with a simple re-installation of Grub, so that’s we did, and moved on. The thought at the time was that the MBR somehow got partially overwritten or otherwise corrupted. A day later, 2 more slaves hit the same issue. Since it was the second time we hit the issue there was some more speculation and digging. We had made a few changes to the machines, including:
* Changing the hard disk controller from “IDE” mode to “AHCI”.
* Changing the kernel to a PAE version.

Both of those were pretty quickly dismissed as the causes. It seemed very unlikely that the kernel version could cause an issue with the bootloader, and the problem didn’t occur instantly after changing the disk controller mode, so that seemed unlikely too. With other important things happening we again moved on.

The next day I did some more googling, this time about GRUB in general, and came across a page detailing the GRUB boot process. In it, it talks about how to dump the contents of the MBR and view it as hex. Seeing that made me *very* eager to compare a working slave vs. a busted one. Unfortunately there was no longer a busted machine to look at.

After 5 or so days without issues, and after all other setup and configuration issue was taken care of we decided to move them to production and deal with the GRUB problems if they arose. As luck would have it, 2 machines refused to boot as they were being moved to production. After booting from a rescue disk and dumping the MBR I found that bytes 0×40 through 0×49 differed against a working slave. I also noticed that the MBR of a busted slave was identical to one that had *never* broken, and thus, never had GRUB re-installed. This seemed to rule out MBR corruption.

With some more information in my hands I looked for some help or pointers from the GRUB developers, on Freenode. One of them pointed me to this section of the GRUB Manual which documents some key bytes of the MBR. Notably, byte 0×40 is described as “The boot drive. If it is 0xFF, use a drive passed by BIOS.”. On a working slave this was set to 0xFF. On a broken one, it was set to 0×80 (which I was told means “first hard drive”). That certainly sounds like something that could affect bootability!

After thinking it over a few times I came to the conclusion that *somehow* 0×80 must end up being the wrong device to boot from. I also realized that no slave which had had GRUB re-installed had failed again. With all of that I became confident that re-installing GRUB would fix the problem permanently. I ran all of this by Catlee who told me that GRUB developers had told him that the BIOS could be re-ordering drives semi-randomly. That piece of information seems to fill in the last bit of the puzzle and I’m more confident than ever that GRUB installation will permanently fix the problem.

It’s still a mystery to me why the BIOS would be re-ordering the drives at random. There’s a “BIOSBugs” page on the GRUB wiki which describes a problem where the BIOS sends the *wrong* boot device. Since relying on the BIOS to send the boot device has fixed our problem I don’t think it’s the same thing. I haven’t been able to find any information on this specific issue, or how to find out what boot device the BIOS is sending the Bootloader, which makes it difficult to truly confirm our fix. If anyone has hit this, or knows how to get at this kind of information I’d love to hear from you.

All about the RelEng sheriff

October 26th, 2009

Since February of this year we’ve had a rotating RelEng “sheriff” available. We started it to make a couple of things better:

  • Improve response time on critical issues
  • Avoid having the whole team distracted with infrastructure issues

By and large, this has been an improvement for us and we think, for developers as well. Serious issues are dealt with more quickly; developers and the developer sheriff have someone specific to go to with acute issues that come up. Internally, this has helped us focus more, too. With the RelEng sheriff dealing with triage and other acute issues the rest of us are able to focus on our other work without distraction.


What is the RelEng sheriff responsible for?


Who is the RelEng sheriff?

The RelEng sheriff is rotated weekly. You can find out who the current RelEng sheriff is by looking at the schedule.


How to get a hold of the RelEng sheriff

The best place to find them is on IRC, in #build or #developers. They should be wearing a ‘|buildduty’ tag at the end of their nick. You can also get our attention in other ways, if IRC doesn’t work for you:

Bugs and IRC pokes are the preferred methods but any will work. Also note that the RelEng sheriff is only around during their normal working day, which can be PDT/PST, EDT/EST, or NZDT/NZST. If a RelEng sheriff isn’t around, someone can be reached in #build.


What can your sheriff do you for you?

The on-duty Releng Sheriff would be more than happy to do any of the following for you:

  • Trigger any sort of build or test run you need, including:
    • Extra unit test or Talos runs of any given build
    • Retriggering builds that fail for spurious reasons
  • Deal with any nightly updates that fail
  • Help debug possible build machine issues
  • Help debug test issues that you cannot reproduce yourself
  • Answer questions you may have about build or test infrastructure

The RelEng sheriff is also a good first-contact point for any other random things. They may be able to help you directly but if not, they can certainly point you to the person who can.

After reading this, I hope you have a better understanding of the who, what, and why of the RelEng sheriff. If anything is unclear or absent I’m happy to clarify.

Anatomy of an SDK update

October 2nd, 2009

Over the course of the past week or so I’ve been working on rolling out the Windows 7 SDK to our build machines. Doing so presented two challenges: Getting the SDK to deploy silently and properly, and updating the appropriate build configurations to use it. Neither of these may sound very challenging, and indeed, they didn’t to me either, but because of a combination of factors this ended up becoming a week long ordeal. In this post I will attempt to detangle everything that happened.

Let’s start with the actual SDK installation. Unlike most other reasonable packages, the Windows 7 SDK is not distributed as an MSI package, but rather a collection of MSIs wrapped in an EXE. Unfortunately, this EXE doesn’t enable you to do a customized, silent install – the precise thing we need. Vainly, I thought I could figure out the proper order and magic options to install the enclosed MSIs properly. Needless to say, this failed. To work around this I fell back onto using an Autoit script that would click through the interactive installer for me. It took some fuss, but not too much difficulty to get that working.

Now, the fun part (of deployment). We use a piece of software called OPSI to schedule and perform software installations across our farm of 80 or so Windows VMs. OPSI runs very early in the Windows start-up process, and actually executes as the SYSTEM user. Well, it turns out that the Windows 7 SDK must be installed by a full user, not the SYSTEM account. This seems unnecessary, as we’ve deployed other SDKs through OPSI in the past without issue. After trying to fake it out by setting various environment variables I turned to the OPSI forums for some help. (As an aside, the OPSI developers have been fantastic in their support of our installation, many thanks to them.) It turns out that I’m not the first person to hit problems like this. They pointed me to a template for a script that works around such an issue. The solution ends up being:

  1. Copy installation files to the slave
  2. Create a new user in the Administrators group, set that user to automatically login at next boot
  3. Reboot, and run the package installation at login
  4. Restore the original automatic login, reboot
  5. Cleanup (delete installation files, remove the created user)

This is obviously quite hacky, but it gets the job done.

So! With that in hand (and in repo) we set the SDK to deploy over the course of Wednesday night and Thursday morning. Overall, this went smoothly. For a reason (which I haven’t yet figured out) some of the slaves needed some kicking to do the installation properly.

Remember how I said part 2 of this was updating the build configurations? I had planned to do this on Friday, and even posted a patch in preparation. Well, it turns out that MozillaBuild likes to be smart and find the most recent SDK and compiler for you. This completely slipped my mind while I was doing the deployment and a result, all builds from Thursday (yesterday) morning to Friday (today) morning, including those on mozilla-1.9.1, were done with the Windows 7 SDK. This went unnoticed most of Thursday until I was doing a final test of my build configuration patch.

Here’s where the fun starts for this part. After discovering I’d accidentally changed the SDK for everything I went into a bit of a panic and rapidly started testing some fixes out in our staging environment. During the course of this I discovered that things were worse than I thought. Most builds were using the Windows 7 SDK, but not the “unit test” ones. So we weren’t even using the same SDK for all the builds for a given branch! Getting all of that sorted out was compounded by all of the iterations of path styles (c:/ vs. c:\ vs. /c/) I had to try before I found the magic combination. In the end, I discovered a few things:

  • If you’re specifying LIB/INCLUDE/SDKDIR in a mozconfig, you must use Windows-style paths
  • If you’re specifying PATH in a mozconfig, you CANNOT use Windows-style paths – you must use MSYS style
  • You can’t test for these things properly without clobbering

As I write this the first set of builds that all use the correct SDK are finishing up, and this deployment from hell appears to be nearly over. I want to express a special thanks to the OPSI developers, who were very helpful, and to Nick Thomas and Chris AtLee, for their patience with my countless iterations of build configuration patches. As a final note, let me state explicitly which SDK is being used where:

  • Windows Vista SDK (6.0a): mozilla-1.9.1 builds
  • Windows 7 SDK (7.0): mozilla-central, mozilla-1.9.2, TraceMonkey, Electrolysis, and Places builds

WinCE and WinMO builds are unaffected by this deployment.

Because of the major version bump in mozilla-central, all users of mozilla-central nightlies will be bumped to mozilla-1.9.2 nightlies today. If you want to continue to track the Firefox 3.6 / Gecko 1.9.2 builds no action is required. If you want to track the post-1.9.2 version or absolute “trunk” of Firefox/Gecko you will need to download today’s mozilla-central nightly build, found in the nightly area of the ftp server.

This morning I landed bug 486567 – which cleaned up the try server code significantly. There’s still more to be done there, particularly running unittests on packaged builds once it’s production counterpart lands (bug 383136). Both of these things help us keep the Try Server in sync with the rest of the world – which has always been a problem.

Looking forward a little bit, I’m looking to land a patch that enables e-mail notification for try server builds and unit tests on Tuesday. With this patch, every try submission would result in 6 e-mails to the submitter: (1 per platform/build type combination). Here’s what they’ll look like:
Build:
Your Try Server build (try-1c170baeac1) was successfully completed on linux. It should be available for download at http://build.mozilla.org/tryserver-builds/bhearsum@mozilla.com-try-1c170baeac1

Visit http://hg.mozilla.org/MozillaTry to view the full logs.

Unit test:
Your Try Server unit test (try-1c170baeac1) completed with warnings on linux. It should be available for download at http://build.mozilla.org/tryserver-builds/bhearsum@mozilla.com-try-1c170baeac1

Summary of unittest results:
check: 2/0

Visit http://hg.mozilla.org/MozillaTry to view the full logs.

(The unittest e-mails will have the full results listed, of course).

E-mail notification has been an oft requested feature so I’m really excited that this will be landing soon.

I’m happy to announce that I’ve finally updated the publicly available version of our CentOS 5.0 build reference platform. There are many changes to it since the last released version, most notably a Scratchbox installation and Mercurial. For all the details you can have a look at the reference platform wiki page. Everything up to Version 17 is included on the released version.

You can get it here: ftp://ftp.mozilla.org/pub/mozilla/VMs/

…but only those to RelEng infrastructure.

We in Release Engineering always love it when people take the time and effort to fix a bug, cleanup some code, or otherwise enhance our infrastructure. However, it’s often difficult to take outside patches because they are generally untested. Because of the nature of our code and systems – how tightly it’s tied to infrastructure, limited access to said infrastructure, and how many different systems a single change can touch – it’s nearly impossible for outside contributors to do more than the most basic testing. We don’t have a Try Server that can test patches to RelEng code, and proper testing requires many different machines and can be very time consuming – especially if you’ve never done it.

It’s always unfortunate to see patches sitting for days, weeks, or in rare cases, months, before they get landed. However, we don’t like to land untested patches because it can lead to unnecessary build bustage.

I want to fix this, so over the next few months I’m going to be prioritizing testing your patches every Monday. I will set aside my normal work for a day to help test and get ready to land contributed patches. High priority things such as releases or infrastructure problems will take precedence over this, but that shouldn’t be a common occurrence.

I’ll be keeping an eye out for things, but if you want to me ping directly about a bug please feel free to do so.

Consider this in an experimental state. I’ll be tweaking the process along the way and am very open to improvements here.

It’s the start of a new week and the start of my first shift as RelEng sheriff. Do you suspect rando-orange? Talk to me. Do you suspect machine problems? Talk to me. I’ll be watching dev.tree-management, the releng triage queue, and #developers for issues.