Mozilla IT

Mozilla IT & Operations

This week in IT: early holiday shopping

Mozilla is growing, and we have to grow our infrastructure right along with it.  A lot of this just happens behind the scenes as we are buying and installing new gear every month.  For today’s post I thought I would give an insight into the broad range of servers we have in the ordering/shipping process right now.

if only it were this easy...

  • New mailbox store – Our last Zimbra hardware upgrade was back in April. We are at the point now where we need to shed some mailboxes off to a second server to help with load.  This is not so much due to the growth in the number of mailboxes we host, but because most of our users have multiple clients hitting Zimbra all of the time. Desktop, laptop, cellphone, tablet..  multiply that by the number of mailboxes and we have to start spreading the love to an additional server before it becomes a problem.
  • New VMWare servers – More VM space for QA, Labs, and more!
  • New load balance test box – We have run into issues with our blade-hosted load balancer boxes.  We will swap a couple out in one of our data centers for some beefier boxes and 10gig ethernet connections.  Depending on our results here we may be reworking our existing load balancing clusters to fit this new model.  In addition, this will provide guidance for our new data center in the spring.
  • New gear for addons.mozilla.org – a whole rack of blades, in fact!  We are also awaiting the arrival of some Fusion IO flash accelerator cards for the AMO database nodes.  This will greatly increase the speed of I/O for those database nodes, ending up in faster results and more capacity.  We did the same for the bugzilla databases recently with great success.
  • Elastic Search servers for Socorro – 8 1U servers (aka “pizza boxes”) are on their way for Socorro.
  • New servers and infrastructure for China – We are prepping infrastructure for a new data center in China that will provide plenty of room and power where our current data center is tight.  This allows us to source some sites locally there as well as cache others there for improved performance throughout the region.

On a logistical note, we are getting a new block of space in our Phoenix data center that all of this gear will go in once it is ready (except for the China servers).  We are close – just a few more pieces to that puzzle and we will start racking this new gear.

That’s all for now, it is time to get back to obsessing over FedEx trackers and logistical details.  Next week, we invent teleportation.

Traffic Distribution

We’ve recently been doing a lot of CDN work (see my last post on getpersonas.com), and out of that has come some interesting data as to our world-wide traffic distribution.

Here’s a breakdown for of the traffic for getpersonas.com, by region:

 

The CDN traffic for http://mozilla.org/firefox is similar… although Asia ranks a bit higher (22%) and North America lower (30%).

This should tell you right away that if we just focus on North America and Europe (as many CDN’s do), we’re going to cut out almost 1/3 of our visitors. Because of this, we’re spending a good bit of time researching CDN performance, trying to find platforms that will improve the user experience for our users in “the other 30%”.

Let’s drill down into South America a bit. This is a region that doesn’t get a lot of love, at least from a CDN perspective. Quite often the nearest CDN node ends up being in Miami or southern California, and is at least 250ms away… multiply that over many page elements, and it’s quite easy to end up with a 10 second page load time. Other times there might be a nearby node, but because of poor ISP peering it’s not actually reachable from very many networks. There’s also a vibrant Mozilla community here, and we’d like to be able to do a better job for them.

 

This lines up fairly well with the overall World Internet Stats: http://www.internetworldstats.com/stats10.htm.

So if you’re in one of these regions, fear not! We do care about you, and things will get better soon. :)

 

RFO: SCL1 outage Oct 16, 2011

On October 13th at 1324 PST Nagios alerted the start of a network event affecting reachability to the SCL1 data center. SCL1 is configured with redundant internet links where a VPN traverses a redundant firewall at both ends. There is also a point-to-point (p2p) that connects directly to SJC1.

The running configuration had the VPN as the active path and the p2p disabled because of an ongoing issue (bug 680463).

Because this path was disabled a complete outage was experienced to SCL1 and all its services which primarily includes the release engineering and build infrastructure.

Upon initial investigation the VPNs, fw1.scl1 and vpn1.sjc1, showed the other was sending a incorrect response while renegotiating the tunnel. Standard non-destructive troubleshooting was attempted to reestablish the tunnel with no success.

In the normal course of troubleshooting fw1.scl1 became unresponsive where on-site presence was required. Once on site fw1.scl1 was restored traffic was shifted from the VPN to the p2p despite it not being confirmed fixed. Basic steps were made to reseat optics and clean fiber patches before traffic was moved.

The review of the logs available did not point to any specific issue why the VPN failed nor why the methods used to recover it failed.

While traffic was being shifted to the p2p the VPN recovered on its own, but the decision was made to stay on the p2p while closely monitoring it being mindful of bug 680463 which has since been resolved.

Netops is investigating configurations to augment link fault and the automatic failover to the standby path and will implement it at a later date.

Complete timeline:

13:24 Initial nagios alert.
13:34 Netops is paged.
13:56 Netops responds.
14:05 Escalation to dmoore (page)
14:13 Escalation to dmoore (phone call)
14:15 Escalation to ravi (page)
14:16 Escalation to ravi (page)
14:16 Ravi responds
14:48 fw1.scl1 becomes unresponsive
16:17 Nagios alerts begin to clear

Getpersonas.com CDN Changes

We have just made a change to a DNS record for getpersonas.com affecting its CDN usage, and thought we’d share some of our findings as to it’s worldwide traffic patterns.

Prior to now, getpersonas-cdn.mozilla.net pointed to a CDN aggregator service at 3crowd. We use this for several things. For example, aus2.mozilla.org (part of the Firefox Automatic Updater Service) goes through 3crowd. It works very well for this type of thing, where what we’re really after is distributing the load across our own infrastructure, in order to gain redundancy.

One of its limitations though is that it is inherently disconnected from the actual CDNs being used. That is, it has no way to know if a particular CDN is really good in one country, but really bad in another. It can’t make that type of determination on-the-fly. By default it just assigns a weight to each CDN, and doles out the traffic in whatever ratios we specify.

For a normal website like getpersonas.com, with broad international appeal, this isn’t quite ideal. For folks in North America or Europe, it works just fine… generally CDNs have pretty good coverage there, and for the most part it won’t make a big difference which one you end up using. Outside of those regions though, CDN coverage is much more hit-and-miss. If you’re in Australia or Argentina, which CDN you get routed to might make a very big difference on your page load times.

For example, here’s a “before” graph of getpersonas.com load time from our Gomez test node in Sydney, Australia (on Telestra, to be precise):

As you can see, it’s quite spikey. If you squint just right you can identify 3 tiers of performance. Not surprisingly, they line up with the 3 different backend services configured in 3crowd.The worst tier is from a local node we control ourselves in San Jose, CA. Clearly, that’s a pretty terrible choice from Australia… and yet it’s getting 1/3 of the overall traffic, simply because 3crowd doesn’t know any better.

This morning I changed this record away from 3crowd towards a new service we’re demoing, Cedexis. This is a similar service, but they maintain a database of CDN provider response times around the world… all the way down to the individual ISP level. Using this database, they can intelligently choose which CDN to use for every single request, theoretically using the fastest one from the user’s location.

Unfortunately it’s too soon to have a useful “after” graph to share. Instead, I’d like to share some information as to just how effective the new service might end up being.

“Global Village” is (as far as our traffic mix is concerned) the 4th biggest ISP in Brazil. They account for approximately 8% of Brazil’s traffic to getpersonas.com. Brazil as a whole is 45% of our South American traffic, which in turn is around 9% of our worldwide traffic. That’s the level at which we can drill down to. Here’s a graph of the “decisions” made by Cedexis for this one ISP in Brazil, over the last 24 hours. Remember, before today it would have been 50/50.


This is 94% vs. 6%. This should give you some idea of just how important choosing the proper CDN could be… far more often than not, we were making the wrong decision for these users. Situations like this generally occur when an ISP has limited peering with upstream providers, and thus does not have a good route to some places. This particular ISP appears to have a good connection to CDNetworks, but a poor one to Edgecast. Obviously this type of thing is more prevalent in smaller ISPs, where they may not be able to get (or afford) more complete peering agreements. You generally won’t see this type of thing with very big ISPs.

Now aggregate this around the world, and you can quickly see that even if the worldwide traffic mix still comes out to approximately 50/50 (and it does… within a few percentage points), making better decisions locally can result in a much better experience for a large group of users.

Over the next few days we should get a good idea of how much this is actually helping… I expect small gains worldwide, and possibly large gains in certain regions. Especially outside of North America, I expect getpersonas.com will be substantially improved, with Asia-Pacific and South America seeing the biggest gains. When we have some data, I’ll post a graph showing the worldwide before-and-after.

New LDAP Infrastructure

This weekend, I rolled out a new LDAP infrastructure. Here are the details:

The why:

At Mozilla we depend heavily on our OpenLDAP-based[1] authentication system. As we’ve grown quite a bit over the past year or so, it has become apparent that our LDAP ecosystem wasn’t scaling accordingly. Until now, we’ve relied mostly on a single master server with a few slaves all behind an aging load balancer that has given us trouble in the past. Most of the slaves were no longer in the pool as their configurations have drifted over the years and a lack of documentation and consistency made it hard to add capacity, make changes or even ensure high availability in the event of a hardware failure. All the LDAP servers were actually servers primarily dedicated to other services, so extra load on those services would have the side effect of making the LDAP service unreliable. As we’ve added a few satellite offices and an extra datacenter that all relied on having some sort of authentication system in sync with our primary LDAP server in our San Jose data center, it was decided that we need to redo this setup with something a little more scalable and with better configuration management.

The what:

Over the past 6 months or so, I’ve done quite a bit of research learning how OpenLDAP works, how it behaves, how it scales and most of all, how it is set up and being used at Mozilla. Understanding the current setup was key to designing a better architecture. The first thing to do was to gather all the information, the configurations of all the existing LDAP servers and merge those into a centralized configuration management tool. As we’ve been using puppet[2],  I wrote a module to manage our OpenLDAP infrastructure. This turned out to be a little more difficult task than anticipated at first, as it had to support managing SSL certificates, a master server, slave servers, intermediary servers (servers that act as a master to other slaves, but are slaves themselves, replicating from the main master), all of which have slightly different configuration directives but overall need a similar configuration that stays in sync. Furthermore, we have some applications that act as frontend management tools for LDAP. Some are used by our team to provision new user accounts, maintain access control groups, etc. but also our phonebook directory app, which allows users to edit their own LDAP entries through a web interface. All these things needed be taken into account when re-architecting the infrastructure. All of the apps were hosted on a single server, which was also the master LDAP server among other things. If that single box failed, we’d be in a bad situation.

The how:

I designed a new infrastructure that splits the various components out a little. The phonebook app would live on a cluster of 6 machines (shared with other, similar apps – think: intranet). Our LDAP master would move to a dedicated server, not used as an authentication backend for other services, but rather have dedicated authentication slaves in each datacenter and office. As we have more capacity in our Phoenix data center, the master would live there along with the webservers serving the phonebook app. Starting out, there are two dedicated slaves replicating from the master in Phoenix, load balanced behind our Zeus load balancer[3] cluster to be used as an authentication backend for any services in our Phoenix datacenter. This setup makes it easy to add capacity as needed, and provides high availability in the event of a failure. There are also two dedicated slaves in our San Jose data center, but rather than have them both replicate from the master in Phoenix, it was decided to add an intermediary server in San Jose. The intermediary server would replicate from the master in Phoenix, and the San Jose slaves would replicate from it. Aside from reducing the cross-datacenter traffic, this provides better data consistency within a datacenter. A similar intermediary server was set up in our Mountain View office to provide a single point for all the other office LDAP servers to replicate from. Each individual Mozilla office has two LDAP slaves and instead of a load balancer, those use a virtual floating IP address that can move from one server to the other using keepalived. We balance our other services such as DNS in this fashion and it reduces the need for extra load balancing equipment in the lower traffic offices, where LDAP is primarily used for wireless connectivity. All servers would be setup completely using puppet, so that in the event of a hardware failure, or the need to add more slaves, adding a few lines to our puppet configurations can make this happen in a matter of minutes without much thought or effort involved. The other piece to the puzzle is our internal addressbook. We use OpenLDAP to provide addressbook lookups for mail clients. In the past, this was done using a single machine that was an LDAP slave. As we scale, we needed better redundancy there too. The addressbook would now live on two machines, also behind a load balancer. As an extra security measure, since this service is directly on the internet, it was decided to change the configuration to make the addressbook “slaves” be simple proxies that only allow the lookup of a few select attributes. A compromised addressbook server would not result in a compromised LDAP database. Win! Speaking of security, the entire LDAP infrastructure was moved to a more secure VLAN, completely inaccessible to or from the internet, with the addressbook being the only thing with any exposure to the internet. Also all ACLS were audited and updated to provide the minimal access necessary, while at the same time using standardized templates with puppet making it easy to add and remove ACLs as needed with proper version control and auditing in place.

The move:

First, I set up all the new hardware and set up the full configuration using puppet. After drawing diagrams and identifying all network flows needed for replication and authentication to work, I worked closely with our network operations team to make sure everything would work as expected. Then I set up our San Jose intermediary server to replicate from the old master in San Jose to ensure that the overall flow of replication would work as expected and began testing various LDAP queries. Meanwhile, I set up the phonebook application in Phoenix on a new cluster of seamicro servers and began testing it against the new master. All of this was unknown, as we were moving from using a local LDAP server to using a remote one, going from RHEL5 to RHEL6, moving from single-hosted to multi-hosted. It was a whole new environment. I worked with the webdev team and our own tools developer to update our LDAP apps to work in the new environment. As we were already rolling out our new offices in San Francisco and Toronto and Paris, I set those up to replicate from the new intermediary server in Mountain View, so that half of the infrastructure has already been in production for a while, and was actually crucial to the testing phase. Once I was satisfied that everything would work properly, I identified what was left to do to get completely off the old infrastructure and move to the new. At this point everything was set up, and it was essentially just a matter of moving the master database to the new master server in Phoenix, changing some DNS names to point at new IPs and making sure that all the clients still worked as expected. All of this needed to happen with minimal downtime as we rely on LDAP being available 24/7 for so many things, including mail, wi-fi, svn, mercurial, our intranet wikis, shell servers, etc. I decided to tackle the project on a Saturday when the least number of people would be affected. I was pretty confident that the move could be done in under two hours, since I spent months preparing for it and ironing out the details. This was mostly the case, and there were only a few brief downtimes during the two hour window where various services failed to authenticate. This was mostly when I had to re-sync the slaves to have them catch up to their new masters. I did run into a few problems though, and here is the postmortem on that:

The Postmortem:

Everything went as planned. I shutdown the old master, copied the full database to the new server, set the intermediary slave in San Jose to replicate from the new master in Phoenix and reset synchronization on the slaves in Phoenix. At the same time I changed the DNS to point at the new load balanced IPs. I started the process around 7am on Saturday of ensuring that the shell servers could talk to the new LDAP servers, fixing some clients that had hardcoded the old master as their LDAP server and added nagios checks to all the new slaves and the new master. At 10am, the maintenance window started, so I shutdown the master and did the move and DNS changes then. By 10:30 San Jose was completely using only the new servers. Then I changed the DNS in Phoenix to point at the new servers and made sure the replication was working properly, both locally and remotely all the way to the remote offices through two intermediary servers. Everything went pretty well. By the end of the maintenance window at noon, I was just double and triple checking that replication was working. Around 1pm, I remembered that although I had made sure that our password reset app was working on our new cluster in Phoenix, I hadn’t ever tested changing my password and ensuring that the change would replicate properly. I tested it and found that it didn’t work. I worked with Rob Tucker, who happened to be online at the time to try to troubleshoot why it wasn’t working. It turned out that there was one piece missing from our new master. A password check module that we’ve had in use for a few years, but was completely undocumented. Rob helped me get the 64-bit version of the module compiled and installed in the new master and we finally got a password change to go through… but still not with the webapp that is user-facing for this purpose. We discovered that the app had one portion where it had hardcoded “localhost” as its LDAP server, rather than honoring the configuration directive. After patching the app, we finally had a successful password change.  At approximately 4pm, I was wrapping up and ready to leave for the evening, when a user e-mailed me to inform me that the phonebook app wouldn’t allow him to change anything in his profile. I couldn’t reproduce the issue, but realized this was because as an LDAP administrator, I have full admin rights to the LDAP database, whereas a regular user has a different set of ACLs that apply. This is when I realized that in all my testing, I neglected to test a normal user account. I hacked on the permissions a bit more on Saturday night to get that working and had the ACLs fixed by midnight (I took a break from 5pm – 11pm). On Sunday morning, I went through and verified that the little surprises I discovered the day before were documented and added to the puppet manifest. I checked in the temprorary patch to the password reset app into svn and tested it again to make sure there were no more glitches. I also fixed the nagios alerts for the Mountain View slaves, which were misconfigured and wouldn’t have alerted to a problem, if there had been one. I’m glad I came back to double check that. :)

What I learned:

Details are important. Now that we have the new infrastructure, my next priority is setting up a stage infrastructure that can be used for the phonebook app, the password reset app and other tools and in general as a place to test and stage changes to our ldap infrastructure.

Testing is important. I should have tested the password reset app. I should have tested the phonebook with a normal user account.

People can be single points of failure. Before I started working on this, there was only one person with the full set of knowledge of our LDAP infrastructure. He left earlier this year to pursue other things and although he had documented most common issues and troubleshooting steps for our infrastructure, there was a huge amount of information that we didn’t have (the password check module for instance), and I had to learn a lot from scratch. I feel like I’ve learned a tremendous amount about how OpenLDAP works in the past 6 months and I enjoy working on it, but it is now extremely important that I don’t become a single point of functionality for our LDAP environment. I’ve worked on documenting all the bits and pieces I have about our infrastructure, gave a tech talk to our team about it and will continue involving other members of my team and documenting all the changes. I hope also that this blog post provides some insight into how it is all set up, to give an idea of what is involved with the setup, explain why when you change your password, it takes ten minutes before the wireless controller in San Francisco notices the change, etc.

The next steps:

Overall, I think our LDAP infrastructure is now infinitely better than it was. It is set up to scale now and it is easy to make documented version controlled changes. I’ve put in a lot of time over the past 6 months planning this out, learning LDAP, and it seems that I’ve been eating, sleeping and breathing OpenLDAP for a while now. However, I still don’t consider myself an expert on the subject and am continually learning. I think there are still a lot of improvements that can be made to the infrastructure and what we have now is pretty much a better scaled version of what we had. The basic configuration directives are the same though. There is likely some more tuning that can be done for better performance, more reliable replication and stability. These are all things I want to pursue, but with production services that are so integral to our infrastructure, it is crucial to take things one step at a time.

Special thanks to Rob Tucker, Fred Wenzel, Corey Shields, Phong Tran, Dumitru Gherman, Pete Fritchman, Michael Coates, Guillaume Destuynder, Adam Newman and the rest of the IT team for helping make this happen.

– Jabba

 

[1] http://www.openldap.org/

[2] http://puppetlabs.com/

[3] http://www.zeus.com/

New Etherpad!

About 3 hours ago we updated the Mozilla Etherpad installation to a more current version. This has been in the works for almost a full year, and has finally come to fruition.

https://etherpad.mozilla.org/

Here’s a short list of the cool features we’re getting with this upgrade:

  • More ports work. You can still connect to the old http://etherpad.mozilla.org:9000/ link, but the :9000 isn’t necessary anymore. The standard port 80 works just as well.
  • SSL Support! However you access the site, you’ll get redirected to https://etherpad.mozilla.org/. This is very cool, especially for the “SSL Everywhere” folks. :)
  • “Team Site” functionality. This is huge, and easily the biggest new feature… people have been asking for something like this for quite a while. Now Mozilla is a pretty open organization, but the reality is there are still some things that can’t be publicly discussed right away. A good example is domain name registrations… people have a habit of swiping them out from under us if we discuss a new domain name before it’s registered.
  • Team Site Pads can be public or private, and can even have their own password, just for that one pad. Let’s say you’ve got a team pad, and you need to let someone not on your team access it… but only that one pad, not any others. Simply make it a public pad, but set a password on it. Your team members can still access it, and now anyone you give the shared password to can also!
  • Team Site Pads can be deleted! This is a common request due to accidental information leaks (passwords, etc). Sadly this doesn’t extend to purely public sites, but it’s still a nice step forward.

Within a couple hours of migrating to this (and on a Friday at 5pm), and despite a bug on the confirmation email preventing it from “just working”, we had 8 different team sites created for various groups… from apps to UX, jetpack to infra. I suspect we’ll see some cross-functional and community team sites eventually as well.

Sadly, there are some bugs still to be worked out, especially in the area of SSL certificates. I’ve created a wiki page, mostly dealing with features and bugs associated with the upgrade: https://wiki.mozilla.org/Etherpad. Feel free to add to it!

 

On a side note: there’s been talk recently about Etherpad Lite, and it’s definitely something we’re considering. We didn’t go with it this time because 1) most of this work was already done by the time we knew about that (this has been in the works a long time), and 2) Etherpad Lite lacks some of the functionality we’re getting here… specifically the Team Sites. It’s in their TODO list though, so I wouldn’t be surprised if we’re on Lite in the future.

 

Let us know how the new system works for you! We’d love to get some feedback on it.

 

- Jake

Working with IT: Bug submissions

During a recent Mozilla all-hands event Laura Thomson held a short presentation, titled “Working with IT”.  Laura was the right person to give it and the feedback that we have gathered is that we need to help people understand how to work with IT, and help you all understand how our infrastructure works.  Expect more brownbags and posts around this topic.

So, let’s start by talking about bugs.

Whether or not bugzilla is the right tool to track and manage IT projects and requests is up for debate. The benefit to using bugzilla is that it integrates with the rest of the project, since it is used for everything else at Mozilla.  That said, let’s talk about how IT works with bugs and how you can help us when you file bugs:

Please do not assume tribal knowledge in bugs.

In the past 30 days, the IT Systems and Ops teams have grown by 5 new sysadmins.  This is great, and they are all ramping up quickly. While we are throwing out numbers, we have seen 119 new bugs added to the Server Operations component in the past 7 days.  We want our new guys to help out in these bugs as quickly as they can.  When submitting a bug, please assume as little tribal knowledge as possible on the other side.  For instance, asking for a setting change in a production site without telling us which site you work on delays the bug while someone either asks for clarification or has to ask the team what you mean.  These are minor delays of course, but when this happens multiple times a day this becomes very inefficient.  If you have a doc to link to giving background on the request you are making, please do it.  If you know the system you are asking for a change on, please make note of it.

Where does my bug go?

The IT team is growing quickly, as is the need to sort our bugs into components lest we spin our wheels all working from one component.  Here is the layout of our components for bugs coming from you as it stands today (note the change in Web Operations):

  • Server Operations: Web Operations – this is where all web related bugs should go.  This is new, and is modified from the old “web content push” component to encompass web server problems, new web projects, and any general request regarding the serving of our websites.
  • Server Operations: Desktop Issues – this is where the desktop team currently works. Laptop issues, software license requests, and help with the office environment should all go here.
  • Server Operations: RelEng – Any issues regarding the release engineering build systems (aka “the build network”) should go here.
  • Server Operations: Netops – Network requests and issues should be filed here
  • Server Operations: Labs – Mozilla Labs IT requests go in here
  • Server Operations: ACL Request – Firewall requests for Netops
  • Server Operations – Everything else that did not fall into one of the above.

Priority and escalation

The default priority for our bugs is “normal”.  We will get to these as soon as we can, and by nature of your request we assume that you want them done as soon as possible.  If this is a request that does not fall under that assumption and you want it to fall under the “nice to have someday” category, mark it as an enhancement. Anything higher than normal demands attention soon.  Our SLA for addressing bugs higher than normal is such:

  • Major – 24 hours
  • Critical – 8 hours
  • Blocker – immediately

These timers work around the clock, and if a bug sits unaddressed beyond those times, our oncall is paged. Blocker IT bugs will page oncall immediately.  We can not guarantee that the request will be resolved within this time (ie: if you file a critical bug for a new cluster of servers, it will take us time to procure them first), but we will have admins aware of it and start working on it.  In addition, we have our own internal prioritization of issues that come in.  If a critical bug in a dev site comes in, that may have to wait for work that we are doing on a production site.

That was a lot to read..

And if you are still with me, thanks for taking the time to understand how we work in bugzilla. By getting bugs filed more efficiently we can spend less of our time refining the bugs and more time fixing them.

IT: General update

A lot has been going on lately in IT, and we haven’t had a chance to make a post about it. I thought it was about time to get out an update about the team and what we’ve been working on.

Personnel changes:

For starters, the IT team has grown quite a bit since I started, and that was only back in March. There’s plenty of new folks I haven’t met in person yet, and also some departures. There’s some training time of course, but our new hires are all smart folks, and are picking up the ropes very well. Some are digging in to old projects that have been back-burnered for a while, and others are diving in to current issues.

At the same time, it’s baby fever in IT! Just a month ago my wife and I had our first, our daughter Zoey. Jeremy is out on leave right now, with his new daughter Mira. He’s also a new parent. Rob will be taking leave any day now for *his* new daughter… just as soon as she’s born. He’s not a new parent, but I don’t expect the first couple weeks will be much easier on him than Jeremy or myself. Finally, Justin (jabba) is on track to have their first baby in late February.

 

Major Projects:

Something I’ve worked on myself- the www.mozilla.org and www.mozilla.com site merge! This is now largely completed, and we’re just in cleanup mode. There’s quite a major follow-up project to redesign this site into a nice Django / Playdoh Python app, instead of the old-and-crufty PHP that it is right now. I was the main IT lead on this, but it wouldn’t have been possible without a lot of work from webdev… especially James Long, Anthony Ricaud, and Fred Wenzel. All I did was deploy their code a few times and tweak some Apache configs… they had to make the 2 code bases actually play nicely together. :)

AMO has been moved to Phoenix! Huge project, and the IT credit goes to Jeremy Orem. On a related note, thanks to his efforts the AMO webdev team is now actually able to do their own code pushes, generally without any significant involvement from IT.

There’s been a lot of work on other new clusters in Phoenix… off the top of my head, there is a new Engagement cluster designed to host short-run sites (glow and twitterparty would have been here, webifyme is here, etc). There is also a new Generic cluster, designed to replace the existing one in SJC. It’s got a good bit more horsepower, as well as being based on RHEL6 instead of 5. Props to Corey Shields for leading both of those 2 cluster rollouts.

We have started a pattern of rolling out “admin” nodes with each cluster. The admin nodes are responsible for pushing new content, running cronjobs, and generally managing the cluster as a whole. In the past we’ve centralized these things onto just a couple admin nodes, doing things for all of our clusters. This works, but gets convoluted fast and doesn’t scale as well as we’d like. So far the new Generic, Engagement, and Addons clusters in Phoenix are set up this way, with more to come. A lot of people have been involved with this, from the puppet modules to the servers themselves.

The puppet training from a couple months ago has come in very handy, and a good number of our classes and modules have been reworked. I’m already noticing it “feels” easier to find things now… not sure if it’s just me getting a handle on our puppet deployment or if it’s actually better, but either way the change is very good. I want to say Justin Dow is largely responsible for this, but honestly so many people have been committing to our puppet repo that it’s hard to keep up with.

Lots of work has gone into our internal inventory system. It is now actually possible to control DHCP allocations from within the inventory system, as well as to define your own key/value pairs for systems in it. This is pretty huge, and there are plans to expand this further, so that inventory becomes more and more of a single source of truth for our infrastructure. This is almost entirely due to the efforts of Rob Tucker.

The Mozilla Developer Network has finally been upgraded to a much newer version of MindTouch. This has been on the plate since at least April, when I took it over, and I believe quite a while before that. Some of the MDN guys are referring to as the most stable and fastest MDN they’ve had in a long, long time… thanks in large part to some good detective work by one of our technical contacts at MindTouch, Brian.

 

Upcoming projects that I can think of off the top of my head:

We are working towards merging our multiple Zeus LB clusters together, to form one super mega-cluster. This will let us improve our global load balancing capabilities, and potentially bring better global caching to a wider number of websites. No official timeline on this, but it’s on the plate… has been for a few months. In a way this is kinda like hosting our own mini-CDN.

We’re planning to move into a new datacenter very soon. Not sure how much info on this can be public, but as anyone could guess this will mean downtime for some systems, replacement for others, and generally things getting swapped around to make the migration as painless as possible. There’s an insanely complicated Gantt chart for this. Datacenter migrations are serious business, and we’ll do our best to minimize any disruptions.

 

I’m sure there’s many other IT projects I’ve forgotten… both recently completed and upcoming. Feel free to drop me a line if you know of any, and I’ll update this post. :)

 

- Jake

This week in IT: A tale of two updates..

Recently we saw instability in one of our admin nodes.  The hardware is a bit old so it is easy to throw that out as a cause.  This is an HP DL360, 4th Generation (current is g7 with g8 right around the corner).  Yet, blaming the hardware should be a last resort.  Server problems are a cause and effect game.  Often the “cause” is a change on the server, and that was the case here.  We had updated to RHEL 5.6 (from 5.5) to pick up critical updates.

It is rare that an update like this will cause instability in a server.  The reason RHEL is such a popular choice in production systems these days is because of the hardware testing and vetting that goes on before they roll out releases and updates.  Yet it seemed that we fell victim to a bug.

Then as we were troubleshooting this issue on our admin node, people.mozilla.com died.  (sorry about that)  This is another DL360 g4 with recent RHEL updates.  It was now obvious that this would be a problem with all of our servers running this combination.

To make a long story short, RHEL was in fact vetted and stable on this particular hardware platform.  The difference in our case is that we were sitting on a firmware update that would have needed downtime to apply.  The combination of kernel and firmware version was a reported RedHat bug, and the fix is to make sure both are updated.  Emergency downtime for many hosts was taken to perform these updates (bug 661420) before the bug was triggered elsewhere.

This brings us to a common dillema in IT: when to apply updates and what updates to apply?  OS level security updates are often non-impacting and require little to no downtime so they are a no brainer.  Plus, they need to be done to maintain a secure system.  On the other hand, firmware updates almost always require a reboot and some downtime to perform.  They can invoke memories of bricked systems due to a failed update (which hardly happen nowadays). They hardly affect security, and as long as a system is running fine they fall under the argument of “if it is not broke, don’t fix it”.  In our case, it broke because we were not proactive in fixing a problem that sat dormant for us.

Moving forward, we will have to be more proactive about updates like these.  Downtime will need to be scheduled, and on nodes with no service redundancy (like people) they will not be popular, but I hope that this illustrates the necessity.  It will be better for us to take systems down on our own terms with advanced notice to the users rather than see them go down at random in the middle of the day.

That’s all for this week.  Next week, we attend a conference and talk to some geeks.

Last week in IT: Puppet Training

Here at Mozilla, we have grown to thousands of servers in a short period of time (and even more individual instances when you count virtual machines). Like most other organizations, we have to rely on tools that help the sysadmins keep their sanity at such scale.  We have picked Puppet as our tool of choice, and are still in the process of migrating older systems into our centralized management while making sure that future servers are built out “in puppet” first before they reach production.  For those of you unfamiliar with Puppet and similar management tools (like cfengine or chef), the idea is simple: you define the way a server should be configured and Puppet will make sure that it is always setup that way.  For instance, in not so simple terms you tell it that apache should be installed with certain vhost configurations, and needs to be restarted after those configurations are put into place.  Puppet makes that happen.  Multiply that action by hundreds of webservers and Puppet shows its value.

A puppet. Courtesy of my kids.

A puppet. Courtesy of my kids.

Last week, most of the IT team spent a few days locked in the Holodeck conference room for training offered up by Puppet Labs.  Between IT and Release Engineering we filled the room with 25 people.

So what now?  We still have the daunting task of integrating our existing infrastructure and old servers into puppet.  This kind of work is easy to put on the backburner with the other projects and bugs at hand, but in the future this work pays off.  Some day those servers will need to be retired, outgrow the current hardware capacity, or need to be moved to a new data center.  Being able to define a server state with puppet makes all of these tasks a lot easier.  More work up front pays off in the long run.

The training brought unfamiliar admins up to speed with Puppet and gave us all new ideas for refactoring some of our current modules and manifests.  We’ll be working hard throughout the coming months to make those changes.  The end result is simple, everything in our infrastructure will be managed by Puppet.  Email, webservers, collaboration tools, development resources.  If it is on our network it needs to be centrally managed. We are making great progress toward this goal and it will lead to great payoffs in the future of Mozilla’s IT infrastructure.

That’s all for this week.  Next week: we resolve a bug!