<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Justin's Blog &#187; Infrastructure</title>
	<atom:link href="http://blog.mozilla.com/justin/category/infrastructure/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/justin</link>
	<description>Mozilla engineering operations...in brief</description>
	<lastBuildDate>Fri, 25 Jul 2008 03:15:40 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Build storage issues &#8211; resolved!</title>
		<link>http://blog.mozilla.com/justin/2008/06/16/build-storage-issues-resolved/</link>
		<comments>http://blog.mozilla.com/justin/2008/06/16/build-storage-issues-resolved/#comments</comments>
		<pubDate>Mon, 16 Jun 2008 13:52:44 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Colo]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Mozilla]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/justin/?p=21</guid>
		<description><![CDATA[This is a very technical and detailed debrief.  For those who want the short version &#8211; it&#8217;s fixed    Other people, read on.
&#8211;
As many of you already know &#8211; we had some pretty serious issues over the past weeks with the storage system that supports the build/unit test environment.  We have [...]]]></description>
			<content:encoded><![CDATA[<p>This is a very technical and detailed debrief.  For those who want the short version &#8211; it&#8217;s fixed <img src='http://blog.mozilla.com/justin/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />   Other people, read on.</p>
<p>&#8211;</p>
<p>As many of you already know &#8211; we had some pretty serious issues over the past weeks with the storage system that supports the build/unit test environment.  We have resolved the issues and wanted to give everyone a run down of the issues that we found, what we have done to resolve them and what open tasks are left.</p>
<p>The issue manifested itself in a few ways.  We saw slow transfers, scsi aborts, reservation failures and VM guest level corruption.  This started as a very rare occurrence and over time became more and more frequent to the point that we could not keep a small number of i/o intensive VMs up for 1 hour and had trouble getting them off.  We started troubleshooting the issue a few weeks ago, and finally came to a total resolution early this week.  Here is a summary of the issues, how they came to be and how we resolved them:</p>
<p>*  http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&amp;Display=226424 (A filer may exhibit poor performance due to WAFL holding on to too many network<br />
buffers and not releasing them in a timely fashion.)<br />
To fix this, we had to do an upgrade to 7.2.4 &#8211; that has been completed.</p>
<p>*  NetApp LUN&#8217;s were created of the wrong LUN type.<br />
This was caused by a error in the LUN creation workflow causing the LUN to be set to the default value (Solaris).  3 out of 4 LUNs were of type Solaris causing blocks to not be written efficiently to the disk (the 4k VMWare blocks were written offset to the true disk geometry).  Reading of the LUN would cause many read aheads and at times overwhelm the filer due to the inefficient layout on disk.  To remedy this we migrated data off, re-created all of the LUNs and re-migrated the data back.</p>
<p>* NetApp igroup&#8217;s set to the wrong type.<br />
Initially Netapp advised that linux igroup type (what maps the LUN to various hosts) were OK for use with VMWare.  This was incorrect causing improper scsi reservations and iscsi timeouts.  NetApp is updating their internal documentation to reflect this change.</p>
<p>* Network setup issues<br />
Initial setup from NetApp advised us to setup the network in a specific configuration (one link to each upstream switch with a virtual interface bonding them).  After further investigation, I found this is *not* the best practice and in fact causing issues with dead HBA paths.  To correct this temporarily, we disabled one of the links, having single uplinks (still with redundant heads)</p>
<p>All of these issues created major performance degradation and block level access/corruption problems.  They have all been resolved at this point.  We still need to adjust the network interfaces to be more redundant.  </p>
<p>Special thanks to the release engineering team has been *incredibly* patient with us as we worked through this.  I know how frustrating it was and they kept a smile (well, kind of) through the situation &#8211; really helped us keep pushing forward to a solution.  Thanks also to mrz for the amazing amount of work he put into this&#8230;very dedicated to finding a solution no matter what time it was.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/justin/2008/06/16/build-storage-issues-resolved/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Network outage report &#8211; 3/18/08, 8:01pm PDT &#8211; 9:25 pm PDT</title>
		<link>http://blog.mozilla.com/justin/2008/03/20/network-outage-report-31808-801pm-pdt-925-pm-pdt/</link>
		<comments>http://blog.mozilla.com/justin/2008/03/20/network-outage-report-31808-801pm-pdt-925-pm-pdt/#comments</comments>
		<pubDate>Thu, 20 Mar 2008 23:39:56 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Colo]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Mozilla]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/justin/2008/03/20/network-outage-report-31808-801pm-pdt-925-pm-pdt/</guid>
		<description><![CDATA[We had a network outage at our San Jose datacenter tonight from 8:01 pm PDT until 9:25 pm PDT on March 18.  From initial investigation, it appears that one of the switches in a blade server chassis had a software issue, causing a network-wide broadcast storm.  Overall effect was that the switching fabric [...]]]></description>
			<content:encoded><![CDATA[<p>We had a network outage at our San Jose datacenter tonight from 8:01 pm PDT until 9:25 pm PDT on March 18.  From initial investigation, it appears that one of the switches in a blade server chassis had a software issue, causing a network-wide broadcast storm.  Overall effect was that the switching fabric for our San Jose datacenter was unusable.</p>
<p>To mitigate this issue going forward, we have make two changes.  	</p>
<ul>
<li> Modified the port-channels connecting the core switches to downstream switches to better handle a port-channel member failure.
<li> We also further tuned broadcast storm protection on every switch port to limit the amount of broadcast &amp; multicast traffic any one device is allowed to send.
</ul>
<p>Furthermore, we have a priority case open with the vendor to determine the cause of the issue as we did capture debug logs.  This was in no way related to the scheduled downtime we were in, it just happened to coincide.  We apologize for any inconvenience this may have caused.  We&#8217;ll continue to follow up with the vendor to make sure this issue does not happen again.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/justin/2008/03/20/network-outage-report-31808-801pm-pdt-925-pm-pdt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Call out for Mirrors</title>
		<link>http://blog.mozilla.com/justin/2008/02/19/call-out-for-mirrors/</link>
		<comments>http://blog.mozilla.com/justin/2008/02/19/call-out-for-mirrors/#comments</comments>
		<pubDate>Wed, 20 Feb 2008 05:27:48 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Mozilla]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/justin/2008/02/19/call-out-for-mirrors/</guid>
		<description><![CDATA[One Mozilla&#8217;s biggest assets is our mirror network. It allows us to update over 100 million users in under 48 hours with security updates, host and push extensions, and much more &#8211; all with donated server space and bandwidth, giving us the ability to focus our efforts on supporting the development community and making all [...]]]></description>
			<content:encoded><![CDATA[<p>One Mozilla&#8217;s biggest assets is our mirror network. It allows us to update over 100 million users in under 48 hours with security updates, host and push extensions, and much more &#8211; all with donated server space and bandwidth, giving us the ability to focus our efforts on supporting the development community and making all the Mozilla products as reliable, secure and feature-rich as possible.</p>
<p>We&#8217;d like to build up our mirror network to be even stronger! I am making a call to the community to help us find other mirror sources. Already Paul Vixie from the <a href="http://www.isc.org">Internet Software Consortium</a> has stepped up and donated 3gb/s of mirror peak capacity (!). Details on what is required can be found here: <a href="http://www.mozilla.org/mirroring.html">http://www.mozilla.org/mirroring.html</a>. While we are always happy to take any mirror donation, we are specifically looking for mirrors which can handle in excess of 100mb/s during peak traffic times. Please contact me directly if you have any ideas of people/organizations/companies that might be willing to donate either bandwidth or mirror space.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/justin/2008/02/19/call-out-for-mirrors/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Open source for the OpenVPN win</title>
		<link>http://blog.mozilla.com/justin/2007/10/13/open-source-for-the-openvpn-win/</link>
		<comments>http://blog.mozilla.com/justin/2007/10/13/open-source-for-the-openvpn-win/#comments</comments>
		<pubDate>Sat, 13 Oct 2007 18:54:36 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Infrastructure]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/justin/2007/10/13/open-source-for-the-openvpn-win/</guid>
		<description><![CDATA[I was reminded of the power of open source software yet again this weekend.  A little background:
We here at Mozilla are big fans of OpenVPN.  When we rebuilt our datacenter, we did a large search for the right VPN solution.  Mozilla&#8217;s requirements were somewhat specific:
*  Had to work with all three [...]]]></description>
			<content:encoded><![CDATA[<p>I was reminded of the power of open source software yet again this weekend.  A little background:</p>
<p>We here at Mozilla are big fans of <a href="http://openvpn.net/">OpenVPN</a>.  When we rebuilt our datacenter, we did a large search for the right VPN solution.  Mozilla&#8217;s requirements were somewhat specific:</p>
<p>*  Had to work with all three platforms (mac, linux, windows)<br />
*  Needed to work with our <a href="http://www.openldap.org/">LDAP infrastructure</a> (i.e. not AD)<br />
*  Needed to work through NAT<br />
*  We needed to be able to give each user granular per-host access<br />
*  We wanted a solution that would allow just Mozilla traffic to traverse the VPN rather than forcing all traffic through the VPN</p>
<p> We looked at many options, most of which were <a href="http://www.cisco.com/en/US/products/sw/secursw/ps2308/">commercial</a> <a href="http://www.juniper.net/products_and_services/ssl_vpn_secure_access/index.html">closed-source</a> <a href="http://www.fortinet.com/solutions/vpn.html">solutions</a> (given the lack of options).  Ideally, a client-less, SSL-based solution would have been ideal, but it was clear Firefox (!) and Mac support was not ready.  We decided on OpenVPN as it met all of our requirements and had the added benifit of being open source and free!</p>
<p>We&#8217;ve been happily using openvpn with <a href="http://www.tunnelblick.net/">TunnelBlick</a> as our mac client.  Justdave even created a custom installer for our users (pretty slick Dave <img src='http://blog.mozilla.com/justin/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  ).  But along comes Leopard &#8211; with changes such that the low level network drivers don&#8217;t function anymore (along with other issues in the GUI).  With some research, mrz found that a <a href="http://www-user.rhrk.uni-kl.de/~nissler/tuntap/">OS X tuntap development team</a> just released new drivers which support Leopard.  Still, openvpn won&#8217;t connect, TunnelBlick won&#8217;t run, etc, so this weekend I set out to fix the issues.  After 3-4 hours of figuring out how the TunnelBlick build setup works, fixing some bugs and adding in the new drivers, I have a working version of TunnelBlick, openvpn and tuntap drivers on Leopard.</p>
<p>What&#8217;s the point of this rant?  I could have *never* fixed this with a closed source VPN client.  I&#8217;d be hamstrung by Cisco (yes, Cisco John) or some other network vendor while they gave me the normal story that Mac is not a large enough platform to dedicate resources too (nevermind that 90+% of Mozilla engineers use Mac hardware).  Being able to look at the source, build system and composition of each of these apps made it possible to figure out what the issue was, fix it, and post this build for anyone else who needs it.</p>
<p>Makes me remember why what we do here at Mozilla is so important.  So, if you need a Leopard version of TunnelBlick (with tuntap drivers and openvpn 2.0.9 with lzo support), <a href="http://people.mozilla.com/~justin/Tunnelblick-Leopard-3.0b5.dmg">here</a> you go.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/justin/2007/10/13/open-source-for-the-openvpn-win/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>go go gadget funnelcake</title>
		<link>http://blog.mozilla.com/justin/2007/10/04/go-go-gadget-funnelcake/</link>
		<comments>http://blog.mozilla.com/justin/2007/10/04/go-go-gadget-funnelcake/#comments</comments>
		<pubDate>Thu, 04 Oct 2007 19:00:36 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Infrastructure]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/justin/2007/10/04/go-go-gadget-funnelcake/</guid>
		<description><![CDATA[We recently ran an experiment, code named funnelcake (see polvi&#8217;s blog post for more details) &#8211; this was an interesting project from IT&#8217;s perspective for a few reasons.  
First a little background &#8211; for one 24 hour period, we would need to serve *all* en-US and de downloads which originate from our website &#8211; [...]]]></description>
			<content:encoded><![CDATA[<p>We recently ran an experiment, code named funnelcake (see <a href="http://blog.mozilla.com/metrics/2007/09/12/the-funnel/">polvi&#8217;s blog post</a> for more details) &#8211; this was an interesting project from IT&#8217;s perspective for a few reasons.  </p>
<p>First a little background &#8211; for one 24 hour period, we would need to serve *all* en-US and de downloads which originate from our website &#8211; not a small number.  We estimate ~500k downloads a day overall, with a large percentage being en-US and de.  Why would we want to host the downloads when we have an excellent mirror network setup, happily serving up our bits?  We were interested in gathering statistics on how many people started, aborted or completed the downloads.  We could do some of this by adding an FTP server of our own into bouncer, but is much more interesting to get an idea of the behavior seeing *all* the traffic.  Also, we can correlate the logs later to number of active users and website behavior.  Plus 24 hours won&#8217;t kill my 95th percentile bandwidth bills <img src='http://blog.mozilla.com/justin/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>Second, seeing all of the traffic allows us to get a great view of the diversity, amount and frequency of downloads.  As you&#8217;ll see below, it was quite an increase in our normal traffic.</p>
<p>Third, it&#8217;s a great test to stress test our infrastructure, verifying we don&#8217;t have any unexpected bottlenecks or performance issues.  The good news here is the systems passed with flying colors.</p>
<p>Our setup was pretty simple &#8211; we built out three download servers with the archive.mozilla.org nfs share mounted.  Slapped apache on them, added them to bouncer and we were off to the races.  Here are the traffic graphs (you can probably tell when we switched things over):</p>
<p><img src="http://people.mozilla.com/~justin/total.png"></p>
<p>Furthermore, Apache really impressed me.  The servers were pushing upwards of 80mbs each off nfs, with a load of&#8230; 0.00 and cpu hovering around 5%.  We sometimes got the occasional 0.10 spike, but all in all, pretty amazing.  Graphs from one of the machines:</p>
<p><img src="http://people.mozilla.com/~justin/dm-download04-traffic.png"><br />
<img src="http://people.mozilla.com/~justin/dm-download04-cpu.png"></p>
<p>All in all, I was very happy with lack if impact on the systems and continued good performance.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/justin/2007/10/04/go-go-gadget-funnelcake/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Yes sir, may I have another (update)?</title>
		<link>http://blog.mozilla.com/justin/2007/06/29/yes-sir-may-i-have-another-update/</link>
		<comments>http://blog.mozilla.com/justin/2007/06/29/yes-sir-may-i-have-another-update/#comments</comments>
		<pubDate>Fri, 29 Jun 2007 12:56:42 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Infrastructure]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/justin/2007/06/29/yes-sir-may-i-have-another-update/</guid>
		<description><![CDATA[As many of you may know, we released a major update to offer 1.5.0.12 -&#62; 2.0.0.4.  This is significant to the infrastructure for a few reasons.  First off, all of these updates will be *full* updates, i.e. full browser downloads &#8211; no 300k mar file.  That puts a large load on our [...]]]></description>
			<content:encoded><![CDATA[<p>As many of you may know, we released a major update to offer 1.5.0.12 -&gt; 2.0.0.4.  This is significant to the infrastructure for a few reasons.  First off, all of these updates will be *full* updates, i.e. full browser downloads &#8211; no 300k mar file.  That puts a large load on our mirror network (nothing they can&#8217;t handle <img src='http://blog.mozilla.com/justin/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  ).  Second, as people update from1.5.0.12 -&gt; 2.0.0.4, we anticipate people will need to update addons for compatibility reasons.</p>
<p>We released at 3pm PDT yesterday &#8211; here are some of the stats so far:</p>
<p>*  Just under 1 million people have been updated from 1.5.0.12 -&gt; 2.0.0.4<br />
*  We are updating people at a rate of 20-30 per second (FF2 download rate was about 30/second)<br />
*  Mirrors are seeing about a 2-3x increase in bandwidth</p>
<p>I expect to see increased load throughout the day, but so far so good!  Huge thanks to webdev with their help to optimize addons.mozilla.org &#8211; it&#8217;s been a huge win for this release.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/justin/2007/06/29/yes-sir-may-i-have-another-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>So I need Parallels to run VMWare?  What?</title>
		<link>http://blog.mozilla.com/justin/2006/07/11/so-i-need-parallels-to-run-vmware-what/</link>
		<comments>http://blog.mozilla.com/justin/2006/07/11/so-i-need-parallels-to-run-vmware-what/#comments</comments>
		<pubDate>Tue, 11 Jul 2006 15:10:58 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Infrastructure]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/justin/2006/07/11/so-i-need-parallels-to-run-vmware-what/</guid>
		<description><![CDATA[Here at Mozilla, we have been pretty happy VMWare customers (aside from some of the p2v migration hell).  We are moving quite a bit of our infrastructure to VMs and it seems to be working out well.  
Enter VMWare 3.0.
You might ask &#8220;why would you want to use the first rev a such [...]]]></description>
			<content:encoded><![CDATA[<p>Here at Mozilla, we have been pretty happy VMWare customers (aside from some of the p2v migration hell).  We are moving quite a bit of our infrastructure to VMs and it seems to be working out well.  </p>
<p>Enter VMWare 3.0.</p>
<p>You might ask &#8220;why would you want to use the first rev a such a new release&#8221;?  Good point &#8211; problem is, VMWare only supports one, that&#8217;s right <em>one</em> 4 port ethernet card (Intel quad port).  No problem &#8211; I order our servers with the supported Intel PRO/1000 Quad Port Server Adapter.  This adapter has worked fine in other older boxes so I have no reason to think it won&#8217;t work.  Well, Intel has discontinued the &#8220;MT&#8221; version of the cards, and only sells the &#8220;GT&#8221; version of the cards.  After 5 days of tedious interaction with VMware web support, we find why 2.5.3 won&#8217;t see the card &#8211; 2.5.3 only supports the out of date and impossible to get &#8220;MT&#8221; version of the card.</p>
<p>VMWare 2.5.3 only supports one quad port card that you can&#8217;t get, with no ETA on &#8220;GT&#8221; version support for 2.5.x &#8211; great &#8211; guess we&#8217;ll try 3.0.  So we get 3.0 install and it recognizes the &#8220;GT&#8221; version of the card &#8211; yay.</p>
<p>Now the real fun starts &#8211; I go to the web admin console that I have used in the past to configure the host &#8211; wait &#8211; there is nothing here?!?  No network configuration, no VM config options, nothing!  It appears VMWare has moved all the configuration to a .Net, windows-only application called &#8220;Infrastructure Client&#8221;.  I use a Mac, much of my team runs Linux.  Why in the world would you take a web-based, platform independent configuration tool and move it to a .Net application?</p>
<p>So you ask why I need <a href="http://www.parallels.com/">Parallels</a> to run VMWare?  That&#8217;s why (I&#8217;m on a Mac, and need a Windows OS to configure VMWare).  And it sucks.  I think VMWare has a really great product and one that can help scale our infrastructure, but why focus the admin tools on just one OS?  Seems the whole company is built around virtualization and cross-platform support, but has taken a step backwards with this change in 3.0.  Hopefully VMWare will fix this issue by producing upcoming versions of Infrastructure Client by supporting it on different platforms.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/justin/2006/07/11/so-i-need-parallels-to-run-vmware-what/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
