GRUB, the MBR, BIOS bugs?

March 9th, 2010

Recently we ordered a set of server class Linux machines to supplement our pool of VMs. They are lightning fast, especially compared to VMs, but it’s been a bit of a bumpy ride getting them ready to go to production. Most notably we’ve had an mysterious problem where they would occasionally refuse to boot, halting at a “GRUB _” dialogue. It took awhile, but we believe we have this fixed now.

This problem first occurred on 2 out of 25 slaves. Catlee quickly discovered that it could be fixed with a simple re-installation of Grub, so that’s we did, and moved on. The thought at the time was that the MBR somehow got partially overwritten or otherwise corrupted. A day later, 2 more slaves hit the same issue. Since it was the second time we hit the issue there was some more speculation and digging. We had made a few changes to the machines, including:
* Changing the hard disk controller from “IDE” mode to “AHCI”.
* Changing the kernel to a PAE version.

Both of those were pretty quickly dismissed as the causes. It seemed very unlikely that the kernel version could cause an issue with the bootloader, and the problem didn’t occur instantly after changing the disk controller mode, so that seemed unlikely too. With other important things happening we again moved on.

The next day I did some more googling, this time about GRUB in general, and came across a page detailing the GRUB boot process. In it, it talks about how to dump the contents of the MBR and view it as hex. Seeing that made me *very* eager to compare a working slave vs. a busted one. Unfortunately there was no longer a busted machine to look at.

After 5 or so days without issues, and after all other setup and configuration issue was taken care of we decided to move them to production and deal with the GRUB problems if they arose. As luck would have it, 2 machines refused to boot as they were being moved to production. After booting from a rescue disk and dumping the MBR I found that bytes 0×40 through 0×49 differed against a working slave. I also noticed that the MBR of a busted slave was identical to one that had *never* broken, and thus, never had GRUB re-installed. This seemed to rule out MBR corruption.

With some more information in my hands I looked for some help or pointers from the GRUB developers, on Freenode. One of them pointed me to this section of the GRUB Manual which documents some key bytes of the MBR. Notably, byte 0×40 is described as “The boot drive. If it is 0xFF, use a drive passed by BIOS.”. On a working slave this was set to 0xFF. On a broken one, it was set to 0×80 (which I was told means “first hard drive”). That certainly sounds like something that could affect bootability!

After thinking it over a few times I came to the conclusion that *somehow* 0×80 must end up being the wrong device to boot from. I also realized that no slave which had had GRUB re-installed had failed again. With all of that I became confident that re-installing GRUB would fix the problem permanently. I ran all of this by Catlee who told me that GRUB developers had told him that the BIOS could be re-ordering drives semi-randomly. That piece of information seems to fill in the last bit of the puzzle and I’m more confident than ever that GRUB installation will permanently fix the problem.

It’s still a mystery to me why the BIOS would be re-ordering the drives at random. There’s a “BIOSBugs” page on the GRUB wiki which describes a problem where the BIOS sends the *wrong* boot device. Since relying on the BIOS to send the boot device has fixed our problem I don’t think it’s the same thing. I haven’t been able to find any information on this specific issue, or how to find out what boot device the BIOS is sending the Bootloader, which makes it difficult to truly confirm our fix. If anyone has hit this, or knows how to get at this kind of information I’d love to hear from you.

All about the RelEng sheriff

October 26th, 2009

Since February of this year we’ve had a rotating RelEng “sheriff” available. We started it to make a couple of things better:

  • Improve response time on critical issues
  • Avoid having the whole team distracted with infrastructure issues

By and large, this has been an improvement for us and we think, for developers as well. Serious issues are dealt with more quickly; developers and the developer sheriff have someone specific to go to with acute issues that come up. Internally, this has helped us focus more, too. With the RelEng sheriff dealing with triage and other acute issues the rest of us are able to focus on our other work without distraction.


What is the RelEng sheriff responsible for?


Who is the RelEng sheriff?

The RelEng sheriff is rotated weekly. You can find out who the current RelEng sheriff is by looking at the schedule.


How to get a hold of the RelEng sheriff

The best place to find them is on IRC, in #build or #developers. They should be wearing a ‘|buildduty’ tag at the end of their nick. You can also get our attention in other ways, if IRC doesn’t work for you:

Bugs and IRC pokes are the preferred methods but any will work. Also note that the RelEng sheriff is only around during their normal working day, which can be PDT/PST, EDT/EST, or NZDT/NZST. If a RelEng sheriff isn’t around, someone can be reached in #build.


What can your sheriff do you for you?

The on-duty Releng Sheriff would be more than happy to do any of the following for you:

  • Trigger any sort of build or test run you need, including:
    • Extra unit test or Talos runs of any given build
    • Retriggering builds that fail for spurious reasons
  • Deal with any nightly updates that fail
  • Help debug possible build machine issues
  • Help debug test issues that you cannot reproduce yourself
  • Answer questions you may have about build or test infrastructure

The RelEng sheriff is also a good first-contact point for any other random things. They may be able to help you directly but if not, they can certainly point you to the person who can.

After reading this, I hope you have a better understanding of the who, what, and why of the RelEng sheriff. If anything is unclear or absent I’m happy to clarify.

Anatomy of an SDK update

October 2nd, 2009

Over the course of the past week or so I’ve been working on rolling out the Windows 7 SDK to our build machines. Doing so presented two challenges: Getting the SDK to deploy silently and properly, and updating the appropriate build configurations to use it. Neither of these may sound very challenging, and indeed, they didn’t to me either, but because of a combination of factors this ended up becoming a week long ordeal. In this post I will attempt to detangle everything that happened.

Let’s start with the actual SDK installation. Unlike most other reasonable packages, the Windows 7 SDK is not distributed as an MSI package, but rather a collection of MSIs wrapped in an EXE. Unfortunately, this EXE doesn’t enable you to do a customized, silent install – the precise thing we need. Vainly, I thought I could figure out the proper order and magic options to install the enclosed MSIs properly. Needless to say, this failed. To work around this I fell back onto using an Autoit script that would click through the interactive installer for me. It took some fuss, but not too much difficulty to get that working.

Now, the fun part (of deployment). We use a piece of software called OPSI to schedule and perform software installations across our farm of 80 or so Windows VMs. OPSI runs very early in the Windows start-up process, and actually executes as the SYSTEM user. Well, it turns out that the Windows 7 SDK must be installed by a full user, not the SYSTEM account. This seems unnecessary, as we’ve deployed other SDKs through OPSI in the past without issue. After trying to fake it out by setting various environment variables I turned to the OPSI forums for some help. (As an aside, the OPSI developers have been fantastic in their support of our installation, many thanks to them.) It turns out that I’m not the first person to hit problems like this. They pointed me to a template for a script that works around such an issue. The solution ends up being:

  1. Copy installation files to the slave
  2. Create a new user in the Administrators group, set that user to automatically login at next boot
  3. Reboot, and run the package installation at login
  4. Restore the original automatic login, reboot
  5. Cleanup (delete installation files, remove the created user)

This is obviously quite hacky, but it gets the job done.

So! With that in hand (and in repo) we set the SDK to deploy over the course of Wednesday night and Thursday morning. Overall, this went smoothly. For a reason (which I haven’t yet figured out) some of the slaves needed some kicking to do the installation properly.

Remember how I said part 2 of this was updating the build configurations? I had planned to do this on Friday, and even posted a patch in preparation. Well, it turns out that MozillaBuild likes to be smart and find the most recent SDK and compiler for you. This completely slipped my mind while I was doing the deployment and a result, all builds from Thursday (yesterday) morning to Friday (today) morning, including those on mozilla-1.9.1, were done with the Windows 7 SDK. This went unnoticed most of Thursday until I was doing a final test of my build configuration patch.

Here’s where the fun starts for this part. After discovering I’d accidentally changed the SDK for everything I went into a bit of a panic and rapidly started testing some fixes out in our staging environment. During the course of this I discovered that things were worse than I thought. Most builds were using the Windows 7 SDK, but not the “unit test” ones. So we weren’t even using the same SDK for all the builds for a given branch! Getting all of that sorted out was compounded by all of the iterations of path styles (c:/ vs. c:\ vs. /c/) I had to try before I found the magic combination. In the end, I discovered a few things:

  • If you’re specifying LIB/INCLUDE/SDKDIR in a mozconfig, you must use Windows-style paths
  • If you’re specifying PATH in a mozconfig, you CANNOT use Windows-style paths – you must use MSYS style
  • You can’t test for these things properly without clobbering

As I write this the first set of builds that all use the correct SDK are finishing up, and this deployment from hell appears to be nearly over. I want to express a special thanks to the OPSI developers, who were very helpful, and to Nick Thomas and Chris AtLee, for their patience with my countless iterations of build configuration patches. As a final note, let me state explicitly which SDK is being used where:

  • Windows Vista SDK (6.0a): mozilla-1.9.1 builds
  • Windows 7 SDK (7.0): mozilla-central, mozilla-1.9.2, TraceMonkey, Electrolysis, and Places builds

WinCE and WinMO builds are unaffected by this deployment.

This morning I landed bug 486567 – which cleaned up the try server code significantly. There’s still more to be done there, particularly running unittests on packaged builds once it’s production counterpart lands (bug 383136). Both of these things help us keep the Try Server in sync with the rest of the world – which has always been a problem.

Looking forward a little bit, I’m looking to land a patch that enables e-mail notification for try server builds and unit tests on Tuesday. With this patch, every try submission would result in 6 e-mails to the submitter: (1 per platform/build type combination). Here’s what they’ll look like:
Build:
Your Try Server build (try-1c170baeac1) was successfully completed on linux. It should be available for download at http://build.mozilla.org/tryserver-builds/bhearsum@mozilla.com-try-1c170baeac1

Visit http://hg.mozilla.org/MozillaTry to view the full logs.

Unit test:
Your Try Server unit test (try-1c170baeac1) completed with warnings on linux. It should be available for download at http://build.mozilla.org/tryserver-builds/bhearsum@mozilla.com-try-1c170baeac1

Summary of unittest results:
check: 2/0

Visit http://hg.mozilla.org/MozillaTry to view the full logs.

(The unittest e-mails will have the full results listed, of course).

E-mail notification has been an oft requested feature so I’m really excited that this will be landing soon.

I’m happy to announce that I’ve finally updated the publicly available version of our CentOS 5.0 build reference platform. There are many changes to it since the last released version, most notably a Scratchbox installation and Mercurial. For all the details you can have a look at the reference platform wiki page. Everything up to Version 17 is included on the released version.

You can get it here: ftp://ftp.mozilla.org/pub/mozilla/VMs/

bug 358845 pointed out that the ‘mZ’ we report for Codesighs tests is meaningless for Firefox. As such, we have stopped running it. This is just a quick note to let people know not to panic, it’s fine! The ‘Z’ number is still being reported and valid.

During this time we will be upgrading both the Firefox 3 and Firefox 3.1 Unit test Buildbots to a newer version (0.7.9). In order to avoid interrupting running builds we will be closing the tree at 4am PDT and stopping any new builds from being scheduled. Once all builds have finished we will perform the upgrade and open the tree again. Depending on the timing this could take anywhere from 20 minutes to a few hours. The tree should be open again no later than 7am PDT.

If there is any reason why we shouldn’t go ahead with this please e-mail release@mozilla.com

Buildbot 0.7.9 was released a couple weeks ago. Today I will be importing it into our CVS tree, upgrading the aging 0.7.7 code. Before doing an import I will branch the current tip of trunk as BUILDBOT_0_7_7_BRANCH. This will let us easily checkout 0.7.7 code and if necessary, land 0.7.7 specific fixes.

Any checkouts of mozilla/tools/buildbot on HEAD will update to the new code with ‘cvs up’. If you don’t want this to happen you should delete your checkout and get the BUILDBOT_0_7_7_BRANCH (‘cvs co -r BUILDBOT_0_7_7_BRANCH’).

Here’s some of the more exciting fixes and enhancements:

  • The /buildslaves page now highlights disconnected slaves, making it easier to see that information at-a-glance
  • Build properties can now be passed via a Scheduler
  • Build properties can be set with ShellCommand’s through the new ‘SetProperty’ BuildStep

Don’t panic. If all goes well nothing will change from a nightly-build-user perspective. We are going to be moving to a more sane system for the generation of nightly .mar files and AUS2 snippets. Details below, but first, some background.

Rob Helmer talked a lot about AUS and updates, mostly regarding releases. What he did not mention the silly path that our nightly updates take.

All of our nightly updates currently take the following path:

  1. Nightly build happens – this includes a complete MAR and complete AUS snippet
  2. Cronjob on a specific build machine performs some magic and generates a partial MAR and partial AUS snippet

Once these changes land our 3.1 builds will take the following path:

  1. Nightly build happens – this includes complete AND partial MAR, complete AND partial AUS snippet

(We could probably support Tinderbox driven builds (2.x, 3.x) pretty easily too, if someone wants to write the patch.)

I think it’s pretty obvious that this is a more sensible way to do things. As it stands now if we lose the VM that generates partials we lose all nightly updates. The hidden benefit here is that updates for releases and nightlies will have fewer differences between them. This means that problems with that system will be caught in the nightlies and *not* during a live release. (NB: We will still need to do snippets for older than n-1 builds, eg. 3.0->3.0.2 during the release process).

I’m still doing testing of this but I hope to land these changes in mozilla-central/tools/update-packaging by the end of next week.

Special note for build folks from SeaMonkey and Thunderbird: I will be replacing the CreateCompleteUpdateSnippet and snippet uploading ShellCommands with a couple Makefile targets – you may want to do the same.

Useful symbols for OS X builds!

September 8th, 2008

Some of you may have noticed bug 448616. It turns out that we have not had proper breakpad symbols for OS X builds *ever* on 3.1 builds. This caused our OS X crash reports to be sigificantly less useful We’ve had a heck of a time isolating the problem. It took quite a few manual builds before we could even reproduce the problem. There were a few rounds of diagnosis involved before the problem became clear: Running ‘make package’ before ‘make buildsymbols’ breaks OS X symbols. It seems that ‘make package’ on Mac strips objdir/ppc/dist/universal, which is where we generate our symbols from!

With that knowledge in hand I deployed a fix….and unwittingly broke Linux nightlies. As part of my fix, I had changed around the order in which we do symbols, updates, and packaging. Doing so had tripped another weirdness in our build system: On Linux, ‘make -C tools/update-packaging’ (creating a full mar) *must* be done after ‘make package’. On Mac and Windows we’re able to create a mar at any point, however.

So, the correct order to do post-build processing in is thus: Symbols, Packaging, Updates.

This fix is now deployed in production and nightly builds have been re-spun. The latest Mac nightly of 3.1 should have proper breakdpad symbols (and in turn, make our crash reports useful). (Unfortunately, due to bug 454198 line numbers are only available in the raw dump, but still, this is a big improvement.) Big thanks to Lukas and Ted for all their help here. Any regressions due to this should be filed in mozilla.org:Release Engineering. \

As a side note, we ended up upgrading to Xcode 3.1 in a failed attempt to fix this bug. Included with it is GCC 4.2, which sadly bails out very early in our build process.