<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Justin's Blog &#187; Colo</title>
	<atom:link href="http://blog.mozilla.com/justin/category/colo/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/justin</link>
	<description>Mozilla engineering operations...in brief</description>
	<lastBuildDate>Fri, 25 Jul 2008 03:15:40 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Build storage issues &#8211; resolved!</title>
		<link>http://blog.mozilla.com/justin/2008/06/16/build-storage-issues-resolved/</link>
		<comments>http://blog.mozilla.com/justin/2008/06/16/build-storage-issues-resolved/#comments</comments>
		<pubDate>Mon, 16 Jun 2008 13:52:44 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Colo]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Mozilla]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/justin/?p=21</guid>
		<description><![CDATA[This is a very technical and detailed debrief.  For those who want the short version &#8211; it&#8217;s fixed    Other people, read on.
&#8211;
As many of you already know &#8211; we had some pretty serious issues over the past weeks with the storage system that supports the build/unit test environment.  We have [...]]]></description>
			<content:encoded><![CDATA[<p>This is a very technical and detailed debrief.  For those who want the short version &#8211; it&#8217;s fixed <img src='http://blog.mozilla.com/justin/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />   Other people, read on.</p>
<p>&#8211;</p>
<p>As many of you already know &#8211; we had some pretty serious issues over the past weeks with the storage system that supports the build/unit test environment.  We have resolved the issues and wanted to give everyone a run down of the issues that we found, what we have done to resolve them and what open tasks are left.</p>
<p>The issue manifested itself in a few ways.  We saw slow transfers, scsi aborts, reservation failures and VM guest level corruption.  This started as a very rare occurrence and over time became more and more frequent to the point that we could not keep a small number of i/o intensive VMs up for 1 hour and had trouble getting them off.  We started troubleshooting the issue a few weeks ago, and finally came to a total resolution early this week.  Here is a summary of the issues, how they came to be and how we resolved them:</p>
<p>*  http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&amp;Display=226424 (A filer may exhibit poor performance due to WAFL holding on to too many network<br />
buffers and not releasing them in a timely fashion.)<br />
To fix this, we had to do an upgrade to 7.2.4 &#8211; that has been completed.</p>
<p>*  NetApp LUN&#8217;s were created of the wrong LUN type.<br />
This was caused by a error in the LUN creation workflow causing the LUN to be set to the default value (Solaris).  3 out of 4 LUNs were of type Solaris causing blocks to not be written efficiently to the disk (the 4k VMWare blocks were written offset to the true disk geometry).  Reading of the LUN would cause many read aheads and at times overwhelm the filer due to the inefficient layout on disk.  To remedy this we migrated data off, re-created all of the LUNs and re-migrated the data back.</p>
<p>* NetApp igroup&#8217;s set to the wrong type.<br />
Initially Netapp advised that linux igroup type (what maps the LUN to various hosts) were OK for use with VMWare.  This was incorrect causing improper scsi reservations and iscsi timeouts.  NetApp is updating their internal documentation to reflect this change.</p>
<p>* Network setup issues<br />
Initial setup from NetApp advised us to setup the network in a specific configuration (one link to each upstream switch with a virtual interface bonding them).  After further investigation, I found this is *not* the best practice and in fact causing issues with dead HBA paths.  To correct this temporarily, we disabled one of the links, having single uplinks (still with redundant heads)</p>
<p>All of these issues created major performance degradation and block level access/corruption problems.  They have all been resolved at this point.  We still need to adjust the network interfaces to be more redundant.  </p>
<p>Special thanks to the release engineering team has been *incredibly* patient with us as we worked through this.  I know how frustrating it was and they kept a smile (well, kind of) through the situation &#8211; really helped us keep pushing forward to a solution.  Thanks also to mrz for the amazing amount of work he put into this&#8230;very dedicated to finding a solution no matter what time it was.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/justin/2008/06/16/build-storage-issues-resolved/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Network outage report &#8211; 3/18/08, 8:01pm PDT &#8211; 9:25 pm PDT</title>
		<link>http://blog.mozilla.com/justin/2008/03/20/network-outage-report-31808-801pm-pdt-925-pm-pdt/</link>
		<comments>http://blog.mozilla.com/justin/2008/03/20/network-outage-report-31808-801pm-pdt-925-pm-pdt/#comments</comments>
		<pubDate>Thu, 20 Mar 2008 23:39:56 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Colo]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Mozilla]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/justin/2008/03/20/network-outage-report-31808-801pm-pdt-925-pm-pdt/</guid>
		<description><![CDATA[We had a network outage at our San Jose datacenter tonight from 8:01 pm PDT until 9:25 pm PDT on March 18.  From initial investigation, it appears that one of the switches in a blade server chassis had a software issue, causing a network-wide broadcast storm.  Overall effect was that the switching fabric [...]]]></description>
			<content:encoded><![CDATA[<p>We had a network outage at our San Jose datacenter tonight from 8:01 pm PDT until 9:25 pm PDT on March 18.  From initial investigation, it appears that one of the switches in a blade server chassis had a software issue, causing a network-wide broadcast storm.  Overall effect was that the switching fabric for our San Jose datacenter was unusable.</p>
<p>To mitigate this issue going forward, we have make two changes.  	</p>
<ul>
<li> Modified the port-channels connecting the core switches to downstream switches to better handle a port-channel member failure.
<li> We also further tuned broadcast storm protection on every switch port to limit the amount of broadcast &amp; multicast traffic any one device is allowed to send.
</ul>
<p>Furthermore, we have a priority case open with the vendor to determine the cause of the issue as we did capture debug logs.  This was in no way related to the scheduled downtime we were in, it just happened to coincide.  We apologize for any inconvenience this may have caused.  We&#8217;ll continue to follow up with the vendor to make sure this issue does not happen again.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/justin/2008/03/20/network-outage-report-31808-801pm-pdt-925-pm-pdt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What a view&#8230;for a server.</title>
		<link>http://blog.mozilla.com/justin/2006/06/23/what-a-view...for-a-server./</link>
		<comments>http://blog.mozilla.com/justin/2006/06/23/what-a-view...for-a-server./#comments</comments>
		<pubDate>Fri, 23 Jun 2006 16:17:30 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[Colo]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/justin/2006/06/23/what-a-view...for-a-server./</guid>
		<description><![CDATA[We are moving colo&#8217;s!  Exciting, but a lot of coordination, planning and work.  The new location is in a cage with 2x the space and right of refusal on more if needed.   Furthermore we are quadrupling our bandwidth with a new network architecture &#8211; thanks Matthew!  More to come, but [...]]]></description>
			<content:encoded><![CDATA[<p>We are moving colo&#8217;s!  Exciting, but a lot of coordination, planning and work.  The new location is in a cage with 2x the space and right of refusal on more if needed.   Furthermore we are quadrupling our bandwidth with a new network architecture &#8211; thanks Matthew!  More to come, but here are some initial pictures of the cage (still being built).</p>
<p><a href="http://people.mozilla.org/~justin/new_colo/">Pictures</a></p>
<p>P.S.  I wanted to thank Meer/SVColo for all their services and help over the years.  They have been great to work with, and we&#8217;ll be sad to go.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/justin/2006/06/23/what-a-view...for-a-server./feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
