Mozilla IT

Mozilla IT & Operations

Mozilla Scheduled Maintenance, Subversion (svn.mozilla.org) will be unavailable 01/28/2012 8pm-2am PST (0400 GMT)

We will have a scheduled maintenance window on Saturday, January 28th at 8pm-2am PST (0400 GMT). The following work will take place:

  • Migrate from San Jose to Phoenix
  • Implement fault tolerant infrastructure

During the maintenance period subversion will be unavailable for both reading and writing. Because we’re
switching to a newer version of subversion (and changing the data store to fsfs) the data migration will require a time consuming svnadmin dump / svnadmin load. We anticipate this step alone will take about 4 hours.

 

Time: January 28th 8pm-2am PST
Scheduled downtime: 6 hours
Estimated actual downtime: 4.5 hours
Impact: All subversion related services will be unavailable including viewvc

December 2011 in IT

As you may have noticed, we’ve missed a few of our weekly updates. We’ve had a rather rough go of it lately, and it’s about time for the highlight reel. Two major events stick out in hindsight.

#1, No email for two days

First up on the plate was a major Zimbra outage. This was sparked by a RAID failure in one of our HP Storage Blades, but rapidly escalated into a data-loss situation due to what can only be described as poor backup planning.

The short version: backups were being made regularly, but were not being reliably shipped off of the server. It took a lot of effort from IT and patience from our users (that is: all of Mozilla’s paid staff… thank you!) to get back on track.

Remarkably, this had relatively minimal affect on development or release cadence- it didn’t delay the Rapid Release train or close the tree, and newsgroups were unaffected. It may have caused some features to develop more slowly than they otherwise would have, but the community and the company pulled together to get through the situation with more ease than anyone could have expected.

Fortunately, things on that front are much better now- our Zimbra infrastructure has improved, and the backup strategy has changed such that the same issue is largely negated. We’re better off now than we were 6 months ago, and we have plans to be significantly better yet in another 6 months.

 #2, addons.mozilla.org & versioncheck.addons.mozilla.org

In mid-December we began having major performance issues with sites behind our Phoenix Zeus (load balancer) cluster. This cluster supports a number of production sites, including addons.mozilla.org and versioncheck.addons.mozilla.org.

The issue was caused by an unexpected and extended increase of traffic from Firefox 3.6 users upgrading to Firefox 8 (and the Addons version checks).

This also highlighted 2 architectural design limitations that we were aware of but had not expected to be a problem for some time.

This Zeus cluster sits behind a redundant pair of Juniper SRX firewalls. These protect the Zeus Linux hosts and provide a point to monitor for unwanted activity (IDS). Like any device, these firewalls are limited by the number of concurrent sessions and new sessions per second that they can handle. The additional traffic put us over the threshold, and they started to drop connections.

Moving Zeus from behind the SRX solved one problem while exposing another bottleneck. This time, we learned that the traffic for versioncheck.addons.mozilla.org was overwhelming the 1GbE interfaces on the Zeus cluster and we had to quickly spin up 10GbE Zeus nodes.

These are really just general scaling issues that we were going to need to deal with sooner or later. Unfortunately, we hit a level of scale that didn’t fit the way we had planned (a lot of these improvements were things we had planned on doing in the early part of 2012).

Among the fixes were:

  • TCP stack tuning
  • Upgrading to 10-gigabit Ethernet on the Zeus hosts
  • Routing / firewall changes (including removing the hardware firewall and switching to iptables)
  • Reducing our use of multicast VIPs, opting for multiple “normal” VIPs using DNS to send traffic to all load balancers
  • Segregating backend traffic onto separate load balancers in a different LB cluster
  • Sending traffic to other datacenters (notably: versioncheck.addons.mozilla.org, which is highly cache-able)

There’s more work still to come on this. We are currently experimenting with ScaleArc iDB, a database/SQL aware load balancer, which would theoretically give us database query caching and query distribution. We presently do such load balancing through our main Zeus cluster’s, but they don’t understand any database-specific protocols… it’s just simple TCP-based proxying. A protocol-aware load balancer should provide the same performance benefits for database queries that an HTTP-caching-LB does for web content.

So, what’s the long term fix?

From these events we have drastically altered our approach to how we handle certain parts of our infrastructure. We are very excited to be working on what we are loosely dubbing the “Hyper-Critical Infrastructure” cluster, which is a completely standalone VMware / NetApp installation designed to be as autonomous as we can reasonably make it. For starters this will house Zimbra & LDAP. Sometime after that we will also likely migrate Mana – our internal documentation system based on Confluence – and intranet.mozilla.org. Other extremely critical apps are also fair game.

Zimbra

Specific to Zimbra, we’re spending some time to make sure we make it as scalable and reliable as we need it to be. VMware High Availability and NetApp storage gets us freedom from most kinds of hardware failures, and Zimbra internally supports sharding to multiple servers for scale. We know Zimbra can do the job (it’s used by organizations much, much bigger than Mozilla), and this architecture will get us there.

Zeus Load Balancers

We’re continuing to replace the 1GbE Zeus cluster (HP BL460c) with 10GbE servers (HP DL360 G7) and are working on sharding our traffic into separate Zeus clusters where it makes sense to do so. We are also looking into separating database traffic away from Zeus altogether and onto a protocol-aware load balancer that can cache results.

On a higher level, we’re pushing out services like Cedexis to augment our geo- and performance-based global load balancing. This replaces another external service (3crowd), the discontinued “Zeus GLB” app, and the newer “Zeus Multi-Site Manager” app. This gives us a unified, convenient, and high-performing way to “front” Zeus and distribute traffic efficiently between multiple Zeus clusters.

It wasn’t all bad

Of course a number of good things happened in December as well:

  • BrowserID went to production
  • Ramped up Tegra capacity for Native UI & Android UI testing for Firefox Mobile
  • Our Inventory system got a nice overhaul and a migration to a new cluster
  • Lots of CDN work, including SSL CDN trials (Akamai, Highwinds) and Cedexis experimentation / implementation
  • Turn-up of a new 9-cabinet module in our PHX1 datacenter (fortunate, since it helped significantly with the load balancer issues)
  • A new ESX cluster in PHX1 (apart from the Hyper-Critical one above, which happened later)
  • Dozens of web content pushes – these are so reliable now we don’t even announce them anymore
  • BrowserID & LDAP integration work for the Mozilla Community Directory

It can be easy to forget about the wins, because in general you’re winning whenever something isn’t broken!

We’re hoping to get back on track with more frequent updates… look for more recent info very soon!

Jake

MXR Repo Changes and Additions

We’ve made a few additions and changes to the repos in MXR today, per Bugs 653424 and 675115.

The following repos have been added and are processed daily:

  • comm-aurora
  • comm-beta
  • comm-release
  • mozilla-release
  • l10n-mozilla-release

Additionally the comm-2.0 repo was added as a one-time processing job, so it is indexed and searchable now as well.

The following repos have been moved from daily processing to weekly processing. These seem to be receiving very infrequent updates (less than monthly), so I believe it will not significantly impact anyone.

  • bugzilla2.20
  • bugzilla3.0
  • bugzilla3.2
  • firefox2
  • fuel
  • incubator-central
  • mozilla1.8
  • mozilla1.8.0
  • mozilla1.9.1
  • mozilla2.0
  • l10n-mozilla1.8
  • l10n-mozilla1.8.0
  • l10n-mozilla1.9.1
  • l10n-mozilla2.0
  • tamarin-central
  • webtools
  • webtools-central

If you feel that any of these need more frequent MXR processing, please let me know about it by filing a bug with the IT Request form. You can open a bug in the main Webtools::MXR component as well, but since IT is curently managing the MXR repos it will take longer to make its way to us if you go that route. :)

Jake

Project: SCL3. What powers growth?

What drives Mozilla IT’s data center growth?

You do.

I can break down the machines in the data center into three sections (and this is a very simplistic view):

  1. Classic infrastructure. All the web sites and infrastructure to support Mozilla and Firefox.
  2. Release Engineering. All the infrastructure that builds and tests Firefox.
  3. Services. Right now mostly Firefox Sync but will soon include BrowserID & the App Store.

Each of these scale and grow at different rates but combined come out to about 12kW/month.

To Tree or Not to Tree

This week, our Phoenix datacenter fell prey to a series of brief rolling outages which visibly impacted many of Mozilla’s public services.

I blame fox2mike

Generally speaking, our datacenter architectures are intentionally simple and spanning tree has served us well. However, as we have grown to meet demand, some of our more… venerable datacenters have become convoluted as new applications are shoehorned into old infrastructure.

Two weeks ago, we brought a new expansion online in Phoenix. Little did we suspect this would be the straw which broke the camel’s back. Minor spanning tree events which had previously gone unnoticed quickly escalated into very noticeable spanning tree cascades. Frustratingly, outages would often resolve themselves before netops personnel could log in to diagnose them. Cell phones vibrated at odd hours of the night. Unkind words were spoken.

Ultimately, we traced the fragility to an oversight in our spanning tree design. Although Juniper is our vendor of choice, we do rely on Cisco’s 3120 blade switch for our HP c7000 chassis. This multi-vendor network creates interesting challenges. In this case, we discovered Juniper’s VSTP mode is not entirely compatible with Cisco’s rapid-pvst mode. In JUNOS versions prior to 10.3, VSTP is unable to fully converge with rapid-pvst. For more information, see Juniper KB 18291 (Juniper support account required).

What did we learn?

  1. Be diligent about marking server trunk ports as spanning tree edge ports. Otherwise, these ports will generate topology changes when a server reboots.
  2. There’s no such thing as too much logging. Logging of spanning tree events can alert you to unexpected topology changes (See #1).
  3. Not all spanning tree protocols are created equal. Don’t blindly trust that spanning tree is doing the right thing.

How do we avoid this, moving forward?

We’re taking great pains to eliminate spanning tree entirely from our newest datacenter, SCL3. While we’re not quite ready to make the leap to a unified fabric architecture (such as Juniper’s QFabric or Cisco’s Nexus), modern multi-chassis technologies can still offer significant improvements. In our case, we’ll be deploying Juniper’s XRE line to enable virtual chassis support on our core EX8200 platform.

Juniper's XRE200

Juniper's XRE200

With virtual chassis at every level (core, aggregation, access), we no longer depend on spanning tree for layer 2 redundancy. Instead, we will be able to rely on a link aggregation protocol (such as LACP). This comes with several added benefits:

  • Improved utilization and load balancing of redundant links
  • Faster convergence
  • Capacity for growth
  • Not spanning tree

Once this architecture is vetted in SCL3, retrofitting PHX1 with the XRE devices will become a top priority.

Project SCL3: Containment. Hot or Cold?

The following is a guest post from Tim Guarnieri, Principal, Critical Facilities Practice at Q Builders, Inc. Tim’s been providing assistance and guidance to Mozilla in building out SCL3.

In laying out racks in a data center, you can either contain the cold air or contain the hot air (or do nothing). In this post, Tim talks about both options and why we’re using cold aisle containment.

What is “containment” and why is it important in a data center?

The containment of cold air supply or hot air return in data centers is a practice used primarily to maximize the efficiency of the cooling supply and heat rejection systems.  This helps minimize the cost of running and maintaining these systems. As this cost is being passed on to Mozilla, minimizing it as much as practicable is an important goal of this project.

Containment is also a means of ensuring that the cool air required by today’s high-density server installations is supplied in the correct volumes and at the correct temperature for these servers to operate as efficiently as possible. Since power is expensive, ensuring optimal server efficiency is also a very important goal of this project.

Why “cold aisle” containment?

The data center provider Mozilla chose, Vantage Data Centers, engineered the data center floor to deliver cold air from below and return hot air from above to the central cooling plant. Further, they provided the perforated tiles required to support a cold aisle installation. This saved Mozilla a good deal of money but, equally as important, it is the right solution for the power density Mozilla intends to have in each of its cabinets (i.e. up to 12 kW per cabinet).

Courtesy http://www.42u.com/cooling/cold-aisle-containment.htm

In high power density server installations (e.g. blade servers) , it is imperative that cool air be delivered at the right volume and temperature to the gear. Cold aisle containment ensures that:

  • Hot and cold air aren’t mixed
  • Cold air temperature can be monitored and controlled
  • Cold air supply volumes can be monitored and controlled

As Mozilla’s needs grow and change over time, we can monitor and control the environment to meet the needs of not only today’s server and network infrastructure but tomorrow’s as well.

What about “hot aisle” containment?

Hot aisle containment would work best in a data center environment completely dedicated to Mozilla. In other words, containing hot air implies that the cold air supplied to the data center floor must be somehow “contained” by the boundaries of the data center itself (i.e. ceilings, walls, doors, floors, etc).

Courtesy http://www.42u.com/cooling/hot-aisle-containment.htm

In a data center environment shared with other tenants, as is the case with Mozilla’s space at Vantage (we’re sharing with two other tenants), Mozilla would not be able to monitor or control the temperature or volume of the cold air being supplied to the servers.

Further, Mozilla has no way of knowing (or influencing) the efficiency of other tenant’s designs and installations so the ambient air temperature for the data center module serving all three tenants (we’re only separated by cage walls) is very likely to fluctuate and be higher than the temperature ranges recommended by ASHRAE [edit: ASHRAE helps define recommended data center temperature limits].

ps. Special thanks to http://www.42u.com/ for their hot & cold aisle graphics.

This week in IT: We rack some stuff

While all of the SCL3 work discussed in this blog goes on, we are still growing in other data centers as well.  As I mentioned last week we have a lot of gear going to Phoenix.  This week, 3 network engineers, 2 sysadmins, and 1 data center project manager descended on our Phoenix facility to get some gear up and running in a new pod.  Not much to say here, so this will mostly be a picture post:

New data center pod

We make this data center look goooood!

In a couple of days we fought electrical mismatches, delivery delays, and lack of sunlight to get some new servers and network gear online.  For starters we are working on a new VMWare environment which is being setup and tested now.

Switches and VMWare environment

In addition, we prepped a whole rack of blade chassis for more AMO gear, prepped power in additional racks, racked some Elastic Search gear for Socorro, and worked on out of band management for all gear involved.

MXR Improvements

Over the last few weeks we’ve been making a number of small changes to the MXR web tool (https://mxr.mozilla.org/), and it seems about time to highlight some of the bigger changes.

Background

MXR is the Mozilla Cross-Reference system. It’s basically a convenient way to search a whole lot of Mozilla code for certain things. I’m sure a proper Mozilla developer could tell you all about how awesome and terrible it is, and what it can and cannot do for you. As a sysadmin, I can tell you it’s a rather intensive, complicated set of scripts that have to handle a lot of different code. It pulls code written in many different languages, from multiple different revision control systems, and indexes them. Some code trees are processed more than once a day, the rest are done daily.

As you can imagine, this results in a lot of special cases. MXR breaks it down into basically 3 steps (conveniently split into 3 scripts) – update the tree, generate the cross-reference identifier database, and generate the searchable index.

  1. The first step (update-src.pl) is relatively straightforward. It consists basically of updating the code for whichever tree you pass it as an argument. It’s a lot of special cases, since most trees are handled slightly differently from one another, but all in all it’s about what you’d expect. It’s the equivalent of “hg pull” or “svn update”. The only trouble is some trees are not nearly that simple, and involve quite a bit more work to update properly. In fact some trees are actually nested source repos, which must be updated independently.
  2. The second step (update-xref.pl) is where most of the time is spent. The ‘genxref’ script is a giant mess of Perl regex’s to detect identifiers (function names, etc) in various programming language. You give it a tree, and for every file in that tree it determines the file type and scans it appropriately.
  3. The third step (update-search.pl) is once again relatively simple. Searching is done via “glimpse”, so this is basically running “glimpseindex” on whichever tree you call it with.

The three steps are tied together with an overall calling shell script (update-full-onetree.sh). This is responsible for feeding each of these three scripts the appropriate arguments (the name of the tree), and reporting the overall output.

This calling script is in turn called by another calling script, which calls it for every tree. This is the final link in the chain, and this script goes in cron. There are actually 2 of these- one that cron calls every 4 hours, one that cron calls daily. They’re basically identical except for the list of trees they process.

Now that you know how it works, I can tell you about how it’s better now than it was a month ago. :)

Improvements

First things first: the daily script was no longer reporting its output. It was working, but not telling us about it. This was due to a newly-introduced “rsync” job, which was running with the verbose flag. This caused the output to be far larger than it should be, which broke the reporting. I removed the verbose flag, and all is back to normal.

With that out of the way, I could see that the “4-hour” job was taking approximately 2 hours to run, and the daily job as much as 30 hours in some cases. That’ll typically make any sysadmin a bit squeamish, and indeed it is non-ideal. However, these jobs have no built-in parallelism, and this server has multiple CPU cores available… meaning, the overlapping wasn’t normally a big problem.

At this point I wanted to improve the running time directly. Step 2 above is by far the longest-running (and hardest on CPU time), so I started there with the very nice NYTProf Perl profiler. This made it clear that most of the time was being spent in a few of the more complicated regex’s used in this script. I was able to make some very small improvements, but ultimately nothing appeared to be severely wrong- they weren’t doing anything terrible, and in fact appeared to have already been tuned by someone who is frankly better at it than I am. I did get some great suggestions on ways to improve the whole process, however, which may yet be implemented some day.

Having quickly scanned over this issue and found nothing of significance, I moved on to “update-search.pl”… the 3rd and final step. I skipped NYTProf here, because the usage is quite obviously tied up in “glimpseindex”. I was able to make a few small tweaks here, which I believe may ultimately result in faster searching. Specifically, I upgraded our indexes from the default “tiny” size to the mid-level “small” size. This roughly doubled our disk space used for indexes, but should have provided a nice boost to search performance. This also makes the indexing take longer, but it still ends up being much faster than the “step 2″ above, so I consider it a good trade. Unfortunately as I mentioned, IT doesn’t have a lot of face-time with this app, so it’s hard to judge for myself just how much faster searching really is.

Finally, the biggest change- parallelization of MXR jobs. After some time looking around, I couldn’t see any reason why we could not process multiple trees simultaneously. So, I set out to do just that.

I wanted to start simple, just to get something in place to prove that it would work. Thus, I re-worked the 4-hour cron job into a simple loop over the list of 4-hour trees, and had the loop execute 4 tasks at once and background them, then “wait” for completion before continuing. This is extremely simple using nothing more than standard shell semantics:

PARALLEL_JOBS=4
count=0
for i in $TREES; do
    echo -n "Starting $i at "; date
    nice -n 19 ./update-full-onetree.sh -cron $i &
    let count+=1
    [[ $((count%PARALLEL_JOBS)) -eq 0 ]] && wait
done
wait

This works beautifully, for what it is. The problem is that it doesn’t always keep 4 jobs running. It starts 4 jobs, waits for all 4 to complete, then starts 4 more. This results in large gaps where we would ideally like to be running another tree, but instead just sit and wait. If the 4 jobs started at the same time all take about the same amount of time to run, it’s pretty good. But if one job takes an hour and the other 3 take only 5 minutes, you’ve got a lot of idle time. Effectively, in some cases it’s not much better than serial execution.

Still, this was enough to bring the running time on the 4-hour job down from 2 hours to about 45 minutes. A very nice win. I knew I would have to come back to this later.

You may notice this doesn’t actually change the amount of work being done, it just crams it all into a smaller amount of time. While that’s absolutely correct, I believe it still results in a performance win for searching- these update jobs are rather disk-intensive, and tend to obliterate the cache. By grouping them up, we effectively obliterate the cache for a shorter amount of time, meaning the *rest* of the time should benefit from somewhat less turbulent disk caching. Basically, longer periods of “good” and shorter periods of “bad”.

What I really wanted to do though was to actually do less work during an update cycle. One simple way to get this is to detect whether anything has changed since the last execution, and if nothing has, to skip as much as possible.

Clearly, we still need to do step 1, or we won’t actually know if anything has changed. Once that’s out of the way we can make a determination. Theoretically it should be possible to do this very efficiently… *if* the tree’s revision control system will tell you that. Unfortunately, this is a brick wall for a generic solution. I could write up specific checks for each individual tree, but I really wanted to have something more easily maintainable- I already knew it was a pain to do this, simply by working on the script for step 1. In the end, I settled for a simple “find” command, to look for any files modified more recently than the last time the tree was processed. It’s a lot more overhead, but “works every time”. The overhead is actually negligible compared to the runtime of step 2 and 3 anyway, so it actually ends up not mattering too much.

This results in a significant reduction in IOPS and CPU cycles consumed. It should also cause far less cache destruction, as most of the files are never actually read. Of course there is still some caching problems caused by this “find” command, but all in all it should be a vast reduction from actually reading each and every file.

With this in place, I was really starting to feel the inefficiency from the simple shell-based parallelization scheme. When a tree doesn’t get changed, it’s runtime gets very short- but this is a worst-case for that algorithm, so it devolves almost back into serial execution! Of course the overall runtime is still pretty good (30-45min, depending on which trees have or haven’t changed), but the efficiency is getting worse- there’s more dead time in between jobs. So I took the next step.

GNU Parallel to the Rescue

If you haven’t heard of GNU Parallel, I highly recommend it. There is an exceptionally good tutorial video on using parallel, here. Suffice it to say, I rewrote the above loop to look like this:

echo "$TREES" | parallel -j+0 'echo -n "{}: Starting " && date && nice -n 19 ./update-full-onetree.sh -cron {}; echo -n "{}: Ending " && date'

I’ll admit it’s not nearly as pretty to look at, but it has one very nice improvement- it will make sure that there is always the right number of jobs executing. If a job finishes, it will start the next one right away.

With this in place, the 4-hour job is now reduced to a maximum of 30 minutes… and a minimum of eight. Compare that to the original time of about 2 hours, every time, and you can easily see this is a massive improvement.

I’m making the same set of changes to the ‘daily’ job, and although I expect some great things, it unfortunately has one significant weakness: one of the trees alone takes about 17 hours to process, and gets pretty constant usage. So this will largely eliminate the situation where this job can ‘overrun’ itself, and will consolidate the vast majority of the trees to be completed very quickly, the overall job will still be bounded by this one tree.

Future Improvements

There are quite a few places where MXR could be improved further, but unfortunately much of it may be beyond my capabilities and I’ll need to seek some outside assistance.

  • Update step 2 to be multi-threaded internally. This will speed up jobs like the 17-hour monster.
  • Update step 2 to intelligently skip files within a tree if they haven’t changed since last processed. This would cut down on the per-job runtime significantly. As you can imagine, most files don’t change every day- changes are concentrated into only a very small percentage of files. If we can quickly skip the others, we can save a whole lot of wasted effort. The trick will be to do this without relying on a revision-control system, since there’s so little uniformity in that area.
  • Update step 3 to be more intelligent about generating indexes. This will take some playing with “glimpseindex”, but the short version is I believe we are discarding the previous index every time and starting from scratch. Obviously this is non-ideal.
  • Move MXR to a better home. It should be quite possible for the web app to live on a separate machine from the backend processing, which should help with caching. For that matter the web frontend could likely be a normal load-balanced cluster, and gain improved query time through horizontal scaling. Even the processing could be split up across multiple machines (and GNU parallel in fact makes this incredibly straightforward).

New Trees Soon!

Lastly, there are at least 2 trees I’m looking very closely at adding in the very near future. I don’t want to spoil the surprise, but both have been requested more than once, and have very good reasons for being added. Having dug into MXR quite a bit recently, I’m in a much better position now to do this than in the past. Look for an update on this soon. :)

Project: SCL3

(The short version is over here. This is the long version.)

Quick Introduction

We’re building out a new 1MW data center in Santa Clara, CA at Vantage Data Centers!

Project: SCL3 is our latest data center build out. Ever since we started talking about this internally we knew we wanted to blog about it, to talk about how we’re doing it, why we’re doing it, and why we’re doing it this way or that way.

Be open about it.

This is just the first of many blog posts. Through this blog we’ll talk about this process. We’ll share with you what drove us here, why we picked the rack layout we did, why we provisioned the power we did.

We’ll look towards you, too, to give input on some of the directions we’ve taken. Some of you are better experts than we are.

What is “SCL3″?

We use a three letter code followed by a number to name our facilities, both Mozilla Spaces and data centers.  This code seems reasonably mnemonic and doesn’t collide with too many three letter airport/IATA codes.  You’ll see us use these codes when we talk about hardware or servers at various locations.

We have three data centers in the Bay Area and are building a fourth:

  1. SJC1: 55 So. Market Street, Market Post Tower (MPT)
  2. SCL1: Internap, Santa Clara
  3. SCL2: Layer42, Santa Clara
  4. SCL3: Vantage, Santa Clara

We have others like PHX1, AMS1.

Quick Background

Back in July 2006 Mozilla moved from a small collection of ten racks to its first data center at 55 So. Market Street, San Jose.

Within the next year Mozilla IT opened a small presence in Amsterdam and followed it up with a presence in Beijing (which, coincidentally, we’re expanding in Q4). In 2010 Mozilla IT opened up a location in Phoenix at i/o Data Centers.

Towards the end of 2009 we knew were going to eventually run out of provisioned power in San Jose. Throughout 2010 and into 2011 Mozilla IT picked up additional data center space in the Bay Area with Layer42 & Internap to accommodate our continued growth.

In the Bay Area alone we’re consuming about 500kW. When we shopped around for what eventually became the Phoenix location, we knew that 500kW was around the tipping point to transition from retail to wholesale data center space.

As we started 2011, we started to think about “what do we do next?”. What do we need so we can move quickly, so we can handle the next generation of problems?

We knew:

  • We need power. Data centers are really about providing power (and cooling it). We’ll run out of power before space.
  • Operationally, managing three locations in the same area is harder. We want to consolidate.
  • The infrastructure, all the web servers that support Mozilla, keeps growing.
  • We hate downtime

Mozilla IT spent most of Q1/Q2 2011 searching around the Bay Area.  Sometime over the summer we decided to partner with Vantage Data Centers and have been extremely busy since then in lease negotiations and in getting started on the build-out of the space.

Over the next several weeks, we’ll talk about some of the designs, some of the reasons we needed more space/power and show pictures & video of the space.

ps. You’ve already joined the Mozilla Community Directory @ https://mozillians.org/ right?  Read this blog post if you forgot why it’s important!