DNS cache poisioning - is your name server patched?

As many of you have probably read, there is a lot of buzz about the recent multi-vendor DNS vulnerability. The details have to do with weak transaction IDs used by caching name servers and the ability to modify those cached DNS records if you can predict the transaction ID. Patches to all the major DNS systems are out and Mozilla DNS servers have been patched for some time, even though our publicly accessible name servers are not recursive or caching name servers.

While we have done all we can by patching our systems, you should check/yell/complain to your upstream DNS provider and apply pressure to get their servers patched as they mostly likely cache name records for you. There are a lot of tools out there to check if your favorite caching name server is vulnerable - http://www.doxpara.com/ and http://entropy.dns-oarc.net/test/ are two that I have seen used.

Build storage issues - resolved!

This is a very technical and detailed debrief. For those who want the short version - it’s fixed :-) Other people, read on.

As many of you already know - we had some pretty serious issues over the past weeks with the storage system that supports the build/unit test environment. We have resolved the issues and wanted to give everyone a run down of the issues that we found, what we have done to resolve them and what open tasks are left.

The issue manifested itself in a few ways. We saw slow transfers, scsi aborts, reservation failures and VM guest level corruption. This started as a very rare occurrence and over time became more and more frequent to the point that we could not keep a small number of i/o intensive VMs up for 1 hour and had trouble getting them off. We started troubleshooting the issue a few weeks ago, and finally came to a total resolution early this week. Here is a summary of the issues, how they came to be and how we resolved them:

* http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=226424 (A filer may exhibit poor performance due to WAFL holding on to too many network
buffers and not releasing them in a timely fashion.)
To fix this, we had to do an upgrade to 7.2.4 - that has been completed.

* NetApp LUN’s were created of the wrong LUN type.
This was caused by a error in the LUN creation workflow causing the LUN to be set to the default value (Solaris). 3 out of 4 LUNs were of type Solaris causing blocks to not be written efficiently to the disk (the 4k VMWare blocks were written offset to the true disk geometry). Reading of the LUN would cause many read aheads and at times overwhelm the filer due to the inefficient layout on disk. To remedy this we migrated data off, re-created all of the LUNs and re-migrated the data back.

* NetApp igroup’s set to the wrong type.
Initially Netapp advised that linux igroup type (what maps the LUN to various hosts) were OK for use with VMWare. This was incorrect causing improper scsi reservations and iscsi timeouts. NetApp is updating their internal documentation to reflect this change.

* Network setup issues
Initial setup from NetApp advised us to setup the network in a specific configuration (one link to each upstream switch with a virtual interface bonding them). After further investigation, I found this is *not* the best practice and in fact causing issues with dead HBA paths. To correct this temporarily, we disabled one of the links, having single uplinks (still with redundant heads)

All of these issues created major performance degradation and block level access/corruption problems. They have all been resolved at this point. We still need to adjust the network interfaces to be more redundant.

Special thanks to the release engineering team has been *incredibly* patient with us as we worked through this. I know how frustrating it was and they kept a smile (well, kind of) through the situation - really helped us keep pushing forward to a solution. Thanks also to mrz for the amazing amount of work he put into this…very dedicated to finding a solution no matter what time it was.

Virtual utility computing, finally a reality (mostly)

For some time vendors of all kinds have been pitching the promise of utility computing for heterogeneous systems/infrastructure. “Just throw more servers in the pool, no need to think! It’s easy!” - yea, right. It’s an easier task if all your systems are running the same application - great load balancers have been around a long time to help with this, but not when they are running different applications and have different load profiles. The promise as pitched is quite compelling - if your compute pool is running hot, just add more resources to a large pool of cpu/memory/disk, allowing the “system” to load balance, handle failures, and alert of issues. The issue has been that the tool sets to implement, manage and maintain have been lacking - weak partitioning tools, HA schemes that didn’t work in all situations, difficult migrations or downtime events to add more resources.

Recently mrz took on the task of upgrading our VMWare infrastructure to the latest update - 3.5. As part of that upgrade, he investigated and subsequently implemented dynamic resource scheduler or DRS. Once up, he moved over all the ops/IT virtual machines (VMs) to the resource pool.

After seeing how it works, I think truly utility computing may finally have arrived. Once migrated to the resource pool, the system took inventory of all the VMs, inspected the load (both cpu and memory) of all the esx hosts, then made recommendations on how to load balance them - but thats not all. With *no* downtime, it automatically performed these migrations, with no user interaction via vmotion (which in itself is pretty amazing). Super slick. It keeps tabs on how loaded the esx systems are, will perform VM migrations automatically if cpu load or memory usage on servers change, allows for reservation of cpu/memory and gives partitioning functionality in case you want to make sure 2 VM’s aren’t ever on the same esx host.

After working with the system a bit - it became quite apparent that the idea of utility computing for heterogeneous machines is finally here, center stage. No longer do we need to manually balance pools of VMs, worry about downtime for upgrades, etc - it’s all handled. Quite literally, as we need more capacity we simply add more hosts to the resource pool.

As far as it’s come, I still think there are some areas for improvement (which may have already been solved, I just haven’t heard about them):
* Storage - still have to allocate a VM to an array of some kind. While there are ways to move the VM storage around while live, it’s not tied into the DRS system and is a manual process. I’d love to treat storage (both space and iops) as a resource pool and have dynamic allocation of storage resources.
* Management tools - I know of some third party tools that help with this, but I think this should be a focus point for VMWare. They have the most information on the system and really should make metrics a top priority.

Regardless, it’s come a long way, and using virtualization for the right areas of your infrastructure can really pay off. It’s really made a real difference to our users and admins in the areas of management, power usage and overall server utilization. Now, if someone could *please* compete with VMWare - some competition (no, Xen is not there yet, but it up and coming) would truly be welcomed.

Network outage report - 3/18/08, 8:01pm PDT - 9:25 pm PDT

We had a network outage at our San Jose datacenter tonight from 8:01 pm PDT until 9:25 pm PDT on March 18. From initial investigation, it appears that one of the switches in a blade server chassis had a software issue, causing a network-wide broadcast storm. Overall effect was that the switching fabric for our San Jose datacenter was unusable.

To mitigate this issue going forward, we have make two changes.

  • Modified the port-channels connecting the core switches to downstream switches to better handle a port-channel member failure.
  • We also further tuned broadcast storm protection on every switch port to limit the amount of broadcast & multicast traffic any one device is allowed to send.

Furthermore, we have a priority case open with the vendor to determine the cause of the issue as we did capture debug logs. This was in no way related to the scheduled downtime we were in, it just happened to coincide. We apologize for any inconvenience this may have caused. We’ll continue to follow up with the vendor to make sure this issue does not happen again.

Call out for Mirrors

One Mozilla’s biggest assets is our mirror network. It allows us to update over 100 million users in under 48 hours with security updates, host and push extensions, and much more - all with donated server space and bandwidth, giving us the ability to focus our efforts on supporting the development community and making all the Mozilla products as reliable, secure and feature-rich as possible.

We’d like to build up our mirror network to be even stronger! I am making a call to the community to help us find other mirror sources. Already Paul Vixie from the Internet Software Consortium has stepped up and donated 3gb/s of mirror peak capacity (!). Details on what is required can be found here: http://www.mozilla.org/mirroring.html. While we are always happy to take any mirror donation, we are specifically looking for mirrors which can handle in excess of 100mb/s during peak traffic times. Please contact me directly if you have any ideas of people/organizations/companies that might be willing to donate either bandwidth or mirror space.

Bugzilla improvement project update

As you may know from my last post on Bugzilla, a lot of improvements/fixes are in the works. Wanted to give everyone an update on what has gone live on bugzilla.mozilla.org so far:

- Send mail in the background after confirming to the user instead of waiting for the mail to be sent while the user waits (related to bug 284184 - local backport)
- Fix for a regression from our last upgrade involving mid-air collision detection (bug 413258/415490)
- Fix for a problem with the OpenSearch plugin (bug 411844)
- Allow searching for ‘—’ in versions and milestones (bug 362436)
- Fix for Subject lines in emails being improperly line-broken and erratically spaced (bug 411544)
- Add a References header to notification mails to assist with threading in mail clients (bug 376453)

All in all, a very good beginning, but we have much more in store. Work has already begun on phase 1 of our project, scheduled to be complete in Q1 of 2008. We are very excited about the Bugzilla improvements, hoping it really helps improve productivity for the project!

Related, we are always looking for people to help out. If you are interested in working on making Bugzilla better for Mozilla and the rest of the OSS community, please get in contact me. The more people we have working on this, the faster the improvements come :-)

Bugzilla Improvements

Bugzilla basically runs Mozilla - it’s core to almost everything we do from tracking core Firefox bugs to tracking Marketing events to operating as our IT ticket queue. Quite simply, we wouldn’t be who we are today without it. With all its greatness, there are quite a few things that don’t quite fit the workflow that is Mozilla, and other bugs that are simply annoying.

So, Schrep asked me to kick off a project to address some of the issues we have with Bugzilla and really invest some time and effort to improve Bugzilla for Mozilla, and the rest of the community. I’ve started by rounding up an initial set of improvements after talking to some of the heavy users within Mozilla, asking for their top complaints and suggestions to improve efficiency in using Bugzilla. Here is what I have come up with.

They gave me plenty of things to work on, but I wanted to open it up to others. I’ve added a section to the bottom of the wiki asking for suggestions - please keep your edits there. If you want to vote up another’s suggestion, just add a +1 to their line. I’ll take the top suggestions/defects and add them into the schedule. Keep in mind we won’t be able to do everything, and are limited in terms of capacity but we are throwing some full time weight behind this to help get this moving.

All our changes are planned to first be applied to BMO, then ported to Bugzilla trunk, so all the code will show up in upstream versions of Bugzilla. We hope to make a difference and move Bugzilla forward in ease of use, performance and innovation.

On a side note, I am looking for community Bugzilla members to help - if you are a Bugzilla developer or know someone who would be willing to help, we’ll take all the help we can get! Contact me at justin at mozilla dot com.

Open source for the OpenVPN win

I was reminded of the power of open source software yet again this weekend. A little background:

We here at Mozilla are big fans of OpenVPN. When we rebuilt our datacenter, we did a large search for the right VPN solution. Mozilla’s requirements were somewhat specific:

* Had to work with all three platforms (mac, linux, windows)
* Needed to work with our LDAP infrastructure (i.e. not AD)
* Needed to work through NAT
* We needed to be able to give each user granular per-host access
* We wanted a solution that would allow just Mozilla traffic to traverse the VPN rather than forcing all traffic through the VPN

We looked at many options, most of which were commercial closed-source solutions (given the lack of options). Ideally, a client-less, SSL-based solution would have been ideal, but it was clear Firefox (!) and Mac support was not ready. We decided on OpenVPN as it met all of our requirements and had the added benifit of being open source and free!

We’ve been happily using openvpn with TunnelBlick as our mac client. Justdave even created a custom installer for our users (pretty slick Dave :-) ). But along comes Leopard - with changes such that the low level network drivers don’t function anymore (along with other issues in the GUI). With some research, mrz found that a OS X tuntap development team just released new drivers which support Leopard. Still, openvpn won’t connect, TunnelBlick won’t run, etc, so this weekend I set out to fix the issues. After 3-4 hours of figuring out how the TunnelBlick build setup works, fixing some bugs and adding in the new drivers, I have a working version of TunnelBlick, openvpn and tuntap drivers on Leopard.

What’s the point of this rant? I could have *never* fixed this with a closed source VPN client. I’d be hamstrung by Cisco (yes, Cisco John) or some other network vendor while they gave me the normal story that Mac is not a large enough platform to dedicate resources too (nevermind that 90+% of Mozilla engineers use Mac hardware). Being able to look at the source, build system and composition of each of these apps made it possible to figure out what the issue was, fix it, and post this build for anyone else who needs it.

Makes me remember why what we do here at Mozilla is so important. So, if you need a Leopard version of TunnelBlick (with tuntap drivers and openvpn 2.0.9 with lzo support), here you go.

go go gadget funnelcake

We recently ran an experiment, code named funnelcake (see polvi’s blog post for more details) - this was an interesting project from IT’s perspective for a few reasons.

First a little background - for one 24 hour period, we would need to serve *all* en-US and de downloads which originate from our website - not a small number. We estimate ~500k downloads a day overall, with a large percentage being en-US and de. Why would we want to host the downloads when we have an excellent mirror network setup, happily serving up our bits? We were interested in gathering statistics on how many people started, aborted or completed the downloads. We could do some of this by adding an FTP server of our own into bouncer, but is much more interesting to get an idea of the behavior seeing *all* the traffic. Also, we can correlate the logs later to number of active users and website behavior. Plus 24 hours won’t kill my 95th percentile bandwidth bills :-)
Second, seeing all of the traffic allows us to get a great view of the diversity, amount and frequency of downloads. As you’ll see below, it was quite an increase in our normal traffic.

Third, it’s a great test to stress test our infrastructure, verifying we don’t have any unexpected bottlenecks or performance issues. The good news here is the systems passed with flying colors.

Our setup was pretty simple - we built out three download servers with the archive.mozilla.org nfs share mounted. Slapped apache on them, added them to bouncer and we were off to the races. Here are the traffic graphs (you can probably tell when we switched things over):

Furthermore, Apache really impressed me. The servers were pushing upwards of 80mbs each off nfs, with a load of… 0.00 and cpu hovering around 5%. We sometimes got the occasional 0.10 spike, but all in all, pretty amazing. Graphs from one of the machines:


All in all, I was very happy with lack if impact on the systems and continued good performance.

China, similar to you and me

I’m about mid way through my first trip to China (in Beijing) - first time to the far east for that matter, and I have to say it’s a pretty interesting place. I’ve been all over europe and north america, but what has really struck me is how Beijing is similar to many other major international cities I’ve been to. Sure it’s got it’s unique attractions, food, people and activities - but isn’t so different that I can’t function or don’t know how to fit in - in fact quite the opposite.

Now let me preface this by saying I am in the outer section of the city in a tech park, and haven’t had time to go into the heart of the city (which I hope to do). But on the 12 hour (!) plane ride over, I had this notion that coming to China would be extremely exotic with very different ways of doing things.

Sure, the Internet access is not the best (i.e Great Firewall, international congestion, etc), food can be…adventurous (chicken neck, frog, snail, turtle, donkey, and others were all on the menu at tonight’s restaurant), the weather & pollution aren’t the best, politics aren’t in line with what I’d vote for, but all in all - it’s just a city, and a great one at that. People eat and hang out a lot, get work done in similar fashions and live their lives.

I think the differences in how people work, live, and interact in different cultures is incredibly interesting - hence why I think I am enjoying my time here so much. The trip has really highlighted that while there are a lot of differences in the way we choose to live, we often forget just how similar we all are :-)
More technical (read: nerdy) posts later on the Great Firewall, Internet access, colo’s, and more.