Network outage report - 3/18/08, 8:01pm PDT - 9:25 pm PDT

March 20th, 2008 by justin

We had a network outage at our San Jose datacenter tonight from 8:01 pm PDT until 9:25 pm PDT on March 18. From initial investigation, it appears that one of the switches in a blade server chassis had a software issue, causing a network-wide broadcast storm. Overall effect was that the switching fabric for our San Jose datacenter was unusable.

To mitigate this issue going forward, we have make two changes.

  • Modified the port-channels connecting the core switches to downstream switches to better handle a port-channel member failure.
  • We also further tuned broadcast storm protection on every switch port to limit the amount of broadcast & multicast traffic any one device is allowed to send.

Furthermore, we have a priority case open with the vendor to determine the cause of the issue as we did capture debug logs. This was in no way related to the scheduled downtime we were in, it just happened to coincide. We apologize for any inconvenience this may have caused. We’ll continue to follow up with the vendor to make sure this issue does not happen again.


Call out for Mirrors

February 19th, 2008 by justin

One Mozilla’s biggest assets is our mirror network. It allows us to update over 100 million users in under 48 hours with security updates, host and push extensions, and much more - all with donated server space and bandwidth, giving us the ability to focus our efforts on supporting the development community and making all the Mozilla products as reliable, secure and feature-rich as possible.

We’d like to build up our mirror network to be even stronger! I am making a call to the community to help us find other mirror sources. Already Paul Vixie from the Internet Software Consortium has stepped up and donated 3gb/s of mirror peak capacity (!). Details on what is required can be found here: http://www.mozilla.org/mirroring.html. While we are always happy to take any mirror donation, we are specifically looking for mirrors which can handle in excess of 100mb/s during peak traffic times. Please contact me directly if you have any ideas of people/organizations/companies that might be willing to donate either bandwidth or mirror space.


Bugzilla improvement project update

February 12th, 2008 by justin

As you may know from my last post on Bugzilla, a lot of improvements/fixes are in the works. Wanted to give everyone an update on what has gone live on bugzilla.mozilla.org so far:

- Send mail in the background after confirming to the user instead of waiting for the mail to be sent while the user waits (related to bug 284184 - local backport)
- Fix for a regression from our last upgrade involving mid-air collision detection (bug 413258/415490)
- Fix for a problem with the OpenSearch plugin (bug 411844)
- Allow searching for ‘—’ in versions and milestones (bug 362436)
- Fix for Subject lines in emails being improperly line-broken and erratically spaced (bug 411544)
- Add a References header to notification mails to assist with threading in mail clients (bug 376453)

All in all, a very good beginning, but we have much more in store. Work has already begun on phase 1 of our project, scheduled to be complete in Q1 of 2008. We are very excited about the Bugzilla improvements, hoping it really helps improve productivity for the project!

Related, we are always looking for people to help out. If you are interested in working on making Bugzilla better for Mozilla and the rest of the OSS community, please get in contact me. The more people we have working on this, the faster the improvements come :-)


Bugzilla Improvements

December 5th, 2007 by justin

Bugzilla basically runs Mozilla - it’s core to almost everything we do from tracking core Firefox bugs to tracking Marketing events to operating as our IT ticket queue. Quite simply, we wouldn’t be who we are today without it. With all its greatness, there are quite a few things that don’t quite fit the workflow that is Mozilla, and other bugs that are simply annoying.

So, Schrep asked me to kick off a project to address some of the issues we have with Bugzilla and really invest some time and effort to improve Bugzilla for Mozilla, and the rest of the community. I’ve started by rounding up an initial set of improvements after talking to some of the heavy users within Mozilla, asking for their top complaints and suggestions to improve efficiency in using Bugzilla. Here is what I have come up with.

They gave me plenty of things to work on, but I wanted to open it up to others. I’ve added a section to the bottom of the wiki asking for suggestions - please keep your edits there. If you want to vote up another’s suggestion, just add a +1 to their line. I’ll take the top suggestions/defects and add them into the schedule. Keep in mind we won’t be able to do everything, and are limited in terms of capacity but we are throwing some full time weight behind this to help get this moving.

All our changes are planned to first be applied to BMO, then ported to Bugzilla trunk, so all the code will show up in upstream versions of Bugzilla. We hope to make a difference and move Bugzilla forward in ease of use, performance and innovation.

On a side note, I am looking for community Bugzilla members to help - if you are a Bugzilla developer or know someone who would be willing to help, we’ll take all the help we can get! Contact me at justin at mozilla dot com.


Open source for the OpenVPN win

October 13th, 2007 by justin

I was reminded of the power of open source software yet again this weekend. A little background:

We here at Mozilla are big fans of OpenVPN. When we rebuilt our datacenter, we did a large search for the right VPN solution. Mozilla’s requirements were somewhat specific:

* Had to work with all three platforms (mac, linux, windows)
* Needed to work with our LDAP infrastructure (i.e. not AD)
* Needed to work through NAT
* We needed to be able to give each user granular per-host access
* We wanted a solution that would allow just Mozilla traffic to traverse the VPN rather than forcing all traffic through the VPN

We looked at many options, most of which were commercial closed-source solutions (given the lack of options). Ideally, a client-less, SSL-based solution would have been ideal, but it was clear Firefox (!) and Mac support was not ready. We decided on OpenVPN as it met all of our requirements and had the added benifit of being open source and free!

We’ve been happily using openvpn with TunnelBlick as our mac client. Justdave even created a custom installer for our users (pretty slick Dave :-) ). But along comes Leopard - with changes such that the low level network drivers don’t function anymore (along with other issues in the GUI). With some research, mrz found that a OS X tuntap development team just released new drivers which support Leopard. Still, openvpn won’t connect, TunnelBlick won’t run, etc, so this weekend I set out to fix the issues. After 3-4 hours of figuring out how the TunnelBlick build setup works, fixing some bugs and adding in the new drivers, I have a working version of TunnelBlick, openvpn and tuntap drivers on Leopard.

What’s the point of this rant? I could have *never* fixed this with a closed source VPN client. I’d be hamstrung by Cisco (yes, Cisco John) or some other network vendor while they gave me the normal story that Mac is not a large enough platform to dedicate resources too (nevermind that 90+% of Mozilla engineers use Mac hardware). Being able to look at the source, build system and composition of each of these apps made it possible to figure out what the issue was, fix it, and post this build for anyone else who needs it.

Makes me remember why what we do here at Mozilla is so important. So, if you need a Leopard version of TunnelBlick (with tuntap drivers and openvpn 2.0.9 with lzo support), here you go.


go go gadget funnelcake

October 4th, 2007 by justin

We recently ran an experiment, code named funnelcake (see polvi’s blog post for more details) - this was an interesting project from IT’s perspective for a few reasons.

First a little background - for one 24 hour period, we would need to serve *all* en-US and de downloads which originate from our website - not a small number. We estimate ~500k downloads a day overall, with a large percentage being en-US and de. Why would we want to host the downloads when we have an excellent mirror network setup, happily serving up our bits? We were interested in gathering statistics on how many people started, aborted or completed the downloads. We could do some of this by adding an FTP server of our own into bouncer, but is much more interesting to get an idea of the behavior seeing *all* the traffic. Also, we can correlate the logs later to number of active users and website behavior. Plus 24 hours won’t kill my 95th percentile bandwidth bills :-)
Second, seeing all of the traffic allows us to get a great view of the diversity, amount and frequency of downloads. As you’ll see below, it was quite an increase in our normal traffic.

Third, it’s a great test to stress test our infrastructure, verifying we don’t have any unexpected bottlenecks or performance issues. The good news here is the systems passed with flying colors.

Our setup was pretty simple - we built out three download servers with the archive.mozilla.org nfs share mounted. Slapped apache on them, added them to bouncer and we were off to the races. Here are the traffic graphs (you can probably tell when we switched things over):

Furthermore, Apache really impressed me. The servers were pushing upwards of 80mbs each off nfs, with a load of… 0.00 and cpu hovering around 5%. We sometimes got the occasional 0.10 spike, but all in all, pretty amazing. Graphs from one of the machines:


All in all, I was very happy with lack if impact on the systems and continued good performance.


China, similar to you and me

August 28th, 2007 by justin

I’m about mid way through my first trip to China (in Beijing) - first time to the far east for that matter, and I have to say it’s a pretty interesting place. I’ve been all over europe and north america, but what has really struck me is how Beijing is similar to many other major international cities I’ve been to. Sure it’s got it’s unique attractions, food, people and activities - but isn’t so different that I can’t function or don’t know how to fit in - in fact quite the opposite.

Now let me preface this by saying I am in the outer section of the city in a tech park, and haven’t had time to go into the heart of the city (which I hope to do). But on the 12 hour (!) plane ride over, I had this notion that coming to China would be extremely exotic with very different ways of doing things.

Sure, the Internet access is not the best (i.e Great Firewall, international congestion, etc), food can be…adventurous (chicken neck, frog, snail, turtle, donkey, and others were all on the menu at tonight’s restaurant), the weather & pollution aren’t the best, politics aren’t in line with what I’d vote for, but all in all - it’s just a city, and a great one at that. People eat and hang out a lot, get work done in similar fashions and live their lives.

I think the differences in how people work, live, and interact in different cultures is incredibly interesting - hence why I think I am enjoying my time here so much. The trip has really highlighted that while there are a lot of differences in the way we choose to live, we often forget just how similar we all are :-)
More technical (read: nerdy) posts later on the Great Firewall, Internet access, colo’s, and more.


Yes sir, may I have another (update)?

June 29th, 2007 by justin

As many of you may know, we released a major update to offer 1.5.0.12 -> 2.0.0.4. This is significant to the infrastructure for a few reasons. First off, all of these updates will be *full* updates, i.e. full browser downloads - no 300k mar file. That puts a large load on our mirror network (nothing they can’t handle :-) ). Second, as people update from1.5.0.12 -> 2.0.0.4, we anticipate people will need to update addons for compatibility reasons.

We released at 3pm PDT yesterday - here are some of the stats so far:

* Just under 1 million people have been updated from 1.5.0.12 -> 2.0.0.4
* We are updating people at a rate of 20-30 per second (FF2 download rate was about 30/second)
* Mirrors are seeing about a 2-3x increase in bandwidth

I expect to see increased load throughout the day, but so far so good! Huge thanks to webdev with their help to optimize addons.mozilla.org - it’s been a huge win for this release.


Umbrellas make people happy

May 15th, 2007 by justin

Recently there have been a lot of cube moves at MoCo headquarters - IT included. We are back upstairs in building K, but the sun was creating an issue for a few of my guys. Enter Cosco to the rescue:

IMG_1134.jpg
IMG_1132.jpg

Kinda like their own IT oasis. Umbrellas make people happy - I mean, what else could get Aravind smiling like this :-)


New co-pilot :-)

May 6th, 2007 by justin

Not sure if people noticed given how well the team has been doing, but I’ve been a bit hard to reach for the past few weeks. The reason is there is a new addition to the Fitzhugh Family - Ryan Mason Fitzhugh born on 4/22/07 at 9:15am. The whole family is doing great, and while I’ve really enjoyed the time off, I miss my Mozilla friends :-)
I’ll be back at work on Monday - looking forward to going into the 07 Intern season.

IMG_0127.jpg


Next Page »