<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>mrz&#039;s noise &#187; networking</title>
	<atom:link href="http://blog.mozilla.com/mrz/category/networking/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/mrz</link>
	<description>noise from a mozilla IT/Operations wrangler</description>
	<lastBuildDate>Tue, 09 Feb 2010 04:41:30 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Phoenix to San Jose, in 18ms</title>
		<link>http://blog.mozilla.com/mrz/2010/02/08/phoenix-to-san-jose-in-18ms/</link>
		<comments>http://blog.mozilla.com/mrz/2010/02/08/phoenix-to-san-jose-in-18ms/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 04:41:30 +0000</pubDate>
		<dc:creator>mrz</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[networking]]></category>
		<category><![CDATA[Phoenix]]></category>
		<category><![CDATA[sanjose]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/mrz/?p=706</guid>
		<description><![CDATA[
[root@ip-ns01 ~]# mtr www.mozilla.com --report
ip-ns01.phx.mozilla.org           Snt: 10    Loss%  Last   Avg  Best  Wrst StDev
10.8.75.1                         [...]]]></description>
			<content:encoded><![CDATA[<pre>
[root@ip-ns01 ~]# mtr www.mozilla.com --report
ip-ns01.phx.mozilla.org           Snt: 10    Loss%  Last   Avg  Best  Wrst StDev
10.8.75.1                                     0.0%   0.4   0.4   0.4   0.4   0.0
v500.core1.phx.mozilla.net                    0.0%   1.1   1.3   1.0   3.1   0.6
xe-1-1-0.border1.phx.mozilla.net              0.0%   0.7   0.7   0.7   0.8   0.0
64.124.201.177                                0.0%   1.1   1.1   1.1   1.1   0.0
ge-0-3-0.mpr3.lax9.us.above.net               0.0%   9.4  13.0   9.4  44.3  11.0
xe-0-1-0.er1.lax9.us.above.net                0.0%   9.5  13.7   9.4  51.4  13.3
xe-0-1-0.mpr1.lax12.us.above.net              0.0%  94.3  21.3   9.3  94.3  27.8
xe2-3.cr01.lax01.mzima.net                    0.0%  10.0  14.0  10.0  23.0   4.6
xe1-0.cr01.lax02.mzima.net                    0.0%  16.9  15.7  10.2  22.8   4.8
te1-3.cr02.sjc02.us.mzima.net                 0.0%  18.1  22.5  18.1  30.0   4.9
ge1-mozilla.cust.sjc02.mzima.net              0.0%  18.4  18.6  18.4  19.1   0.2
v8.core2.sj.mozilla.com                       0.0%  18.4  19.7  18.2  30.8   3.9
mozcom.acelb.sj.mozilla.com                   0.0%  18.5  18.6  18.4  19.5   0.3
</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/mrz/2010/02/08/phoenix-to-san-jose-in-18ms/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fx 3.0.7 release &amp; this morning&#8217;s network performance issues</title>
		<link>http://blog.mozilla.com/mrz/2009/03/05/fx-307-release-this-mornings-network-performance-issues/</link>
		<comments>http://blog.mozilla.com/mrz/2009/03/05/fx-307-release-this-mornings-network-performance-issues/#comments</comments>
		<pubDate>Fri, 06 Mar 2009 01:17:47 +0000</pubDate>
		<dc:creator>mrz</dc:creator>
				<category><![CDATA[Mozilla]]></category>
		<category><![CDATA[networking]]></category>
		<category><![CDATA[release]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/mrz/?p=413</guid>
		<description><![CDATA[In computers systems (and with others) there are often bottlenecks and removing those often reveals new ones.  Today&#8217;s an example of just that.
During a normal release we have tools we can use to adjust the rate at which we offer updates.  We use this to reduce load on the back end systems or [...]]]></description>
			<content:encoded><![CDATA[<p>In computers systems (and with others) there are often <a href="http://en.wikipedia.org/wiki/Bottleneck_%28engineering%29">bottlenecks</a> and removing those often reveals new ones.  Today&#8217;s an example of just that.</p>
<p>During a normal release we have tools we can use to adjust the rate at which we offer updates.  We use this to reduce load on the back end systems or to help reduce load on the download mirrors.</p>
<p>Our preference is to do a release completely unthrottled so users get timely updates.</p>
<p>During the Firefox 3.0.6 release we had a number of system problems that prevented us from releasing updates unthrottled.  These were all detailed in the <a href="https://wiki.mozilla.org/Releases/Firefox_3.0.6/Post_Mortem#IT">Post Mortem</a>.</p>
<p>To the Operations Team&#8217;s credit (and I&#8217;m <u>serious</u> here), most of those issues were removed prior to yesterday&#8217;s Firefox 3.0.7 release and by 9am this morning we were cranking along &#8211; no throttling.</p>
<p>Unfortunately the Mirror Network started showing pressure and instead of throttling back on the release, we opted to augment the Mirror Network with our own download servers in San Jose.</p>
<p>That pushed our aggregate bandwidth out of San Jose to nearly 3Gbps:</p>
<p><a href="http://blog.mozilla.com/mrz/files/2009/03/globalbw10am.png"><img class="alignnone size-full wp-image-414" title="Global Bandwidth over 2Gbps" src="http://blog.mozilla.com/mrz/files/2009/03/globalbw10am.png" alt="Global Bandwidth over 2Gbps" width="603" height="286" /></a><a href="http://blog.mozilla.com/mrz/files/2009/03/globalbw10am.png"></a></p>
<p>At around this time offsite monitors starting alerting about a sharp increase in page load times to various Mozilla website properties.  Took a bit to track down but the <a href="http://blog.mozilla.com/it/2009/02/17/mozilla-scheduled-downtime-02172009-7pm-11pm-pst-0300-0700-02182009-utc/">newly turned up</a> Level 3 peer was saturated:</p>
<p><a href="http://blog.mozilla.com/mrz/files/2009/03/l3-flat.png"><img class="alignnone size-full wp-image-415" title="Level3" src="http://blog.mozilla.com/mrz/files/2009/03/l3-flat.png" alt="Level3" width="603" height="286" /></a></p>
<p>Any outbound traffic whose best route was out through Level3 was impacted.  We fixed this temporarily by turning down Level3.</p>
<p><i>(I should note that our design requirements for upstream transit is at least two connections per provider so we can push 2Gbps.  Level 3 is no exception, however, the second connection has been offline because Derek was seeing a lot of packet loss across the optical connection which coincidentally got resolved today.)</i></p>
<p>These problems are solvable and we&#8217;ve had plans to put tools in place to balance load during situations like this.  Unfortunately, today&#8217;s issues came up a lot quicker than we had planned.</p>
<p>A couple things we&#8217;ll be looking at before the next release:</p>
<ol>
<li>Evaluating <a href="http://www.internap.com/internet-services/internet-access/network-performance/fcp/">Internap&#8217;s FCP</a> to dynamically shift traffic based on cost and performance metrics. (And as luck would have it, this showed up this afternoon!)</li>
<li>Looking to see how we can better balance outbound traffic outside of using FCP.</li>
<li>Adding capacity to our Mirror Network (can <a href="http://www.mozilla.org/mirroring.html">you help</a>?).</li>
<li>Evaluating options around upgrading from several 1GE upstream connections to 10GE connections.</li>
</ol>
<p>This is a great problem to have, to be sure, and a far cry from the panic three  years ago of <i>&#8220;OMG we&#8217;re about to push 100Mbps!&#8221;</i>.  </p>
<p>I&#8217;m really interested in how others have gone about solving problems like this.  Leave me comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/mrz/2009/03/05/fx-307-release-this-mornings-network-performance-issues/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>More traffic analysis with NetFlow</title>
		<link>http://blog.mozilla.com/mrz/2009/02/19/more-traffic-analysis-with-netflow/</link>
		<comments>http://blog.mozilla.com/mrz/2009/02/19/more-traffic-analysis-with-netflow/#comments</comments>
		<pubDate>Thu, 19 Feb 2009 23:52:25 +0000</pubDate>
		<dc:creator>mrz</dc:creator>
				<category><![CDATA[Mozilla]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[networking]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/mrz/?p=311</guid>
		<description><![CDATA[
I&#8217;ve been working on a couple capacity planning projects and have been knee deep in bandwidth metrics.  That, combined with turning up Level3 yesterday got me looking more into where that bandwidth is coming from (and Reed asked).
The first chart shows a breakdown by protocol.  Shouldn&#8217;t be any surprise that 82% of Mozilla&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.mozilla.com/mrz/files/2009/02/bandwidth-by-port.png"><img class="alignleft size-medium wp-image-322" title="Bandwidth usage by port" src="http://blog.mozilla.com/mrz/files/2009/02/bandwidth-by-port-300x180.png" alt="Bandwidth usage by port" width="300" height="180" /></a></p>
<p>I&#8217;ve been working on a couple capacity planning projects and have been knee deep in bandwidth metrics.  That, combined with turning up <a href="http://blog.mozilla.com/mrz/2009/02/18/level3-post-bgp-turn-up/">Level3 yesterday</a> got me looking more into where that bandwidth is coming from (and Reed asked).</p>
<p>The first chart shows a breakdown by protocol.  Shouldn&#8217;t be any surprise that 82% of Mozilla&#8217;s traffic is web related (SSL being the larger which, of course, make sense since it is out to destroy me).  The &#8220;Other&#8221; category was filled with services under 1%.</p>
<p><a href="http://blog.mozilla.com/mrz/files/2009/02/bandwidth-by-site.png"><img class="alignright size-medium wp-image-323" title="Bandwidth usage by site" src="http://blog.mozilla.com/mrz/files/2009/02/bandwidth-by-site-300x192.png" alt="Bandwidth usage by site" width="300" height="192" /></a><br />
I took a look at the same bandwidth data and broke it down by source.  Most of the traffic is from <a href="https://addons.mozilla.org/"><code>addons.mozilla.org</code></a> (which isn&#8217;t surprising &#8211; it alone causes all my SSL scaling headaches).</p>
<p>The other sites aren&#8217;t too surprising to me after a couple rounds of load balancer testing.  I expected and do see <code>fxfeeds.mozilla.com</code> (<a href="http://www.mozilla.com/en-US/firefox/livebookmarks.html">Firefox Live Bookmarks</a>), <code>versioncheck.addons.mozilla.org</code> and <code>services.addons.mozilla.org.</code></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/mrz/2009/02/19/more-traffic-analysis-with-netflow/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Level3, post BGP turn up</title>
		<link>http://blog.mozilla.com/mrz/2009/02/18/level3-post-bgp-turn-up/</link>
		<comments>http://blog.mozilla.com/mrz/2009/02/18/level3-post-bgp-turn-up/#comments</comments>
		<pubDate>Wed, 18 Feb 2009 23:11:09 +0000</pubDate>
		<dc:creator>mrz</dc:creator>
				<category><![CDATA[Mozilla]]></category>
		<category><![CDATA[networking]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/mrz/?p=301</guid>
		<description><![CDATA[Derek turned up BGP peering with Level3 this morning out of San Jose (this was supposed to be part of last night&#8217;s maintenance but he ran into turn up problems on Level3&#8217;s side).

We&#8217;ve grown (bandwidth-wise) to the extent that having two transit providers wasn&#8217;t sufficient.   Should either fail, I wouldn&#8217;t have felt comfortable [...]]]></description>
			<content:encoded><![CDATA[<p>Derek turned up BGP peering with Level3 this morning out of San Jose (this was supposed to be part of <a href="http://blog.mozilla.com/it/2009/02/17/mozilla-scheduled-downtime-02172009-7pm-11pm-pst-0300-0700-02182009-utc/">last night&#8217;s maintenance</a> but he ran into turn up problems on Level3&#8217;s side).<br />
<a href="http://blog.mozilla.com/mrz/files/2009/02/sj-outboundbw.png"><img class="alignleft size-medium wp-image-302" title="sj-outboundbw" src="http://blog.mozilla.com/mrz/files/2009/02/sj-outboundbw-300x142.png" alt="sj-outboundbw" width="300" height="142" /></a></p>
<p>We&#8217;ve grown (bandwidth-wise) to the extent that having two transit providers wasn&#8217;t sufficient.   Should either fail, I wouldn&#8217;t have felt comfortable pushing ~800Mbps out a single provider without any backup.</p>
<p>The trick will be making sure to hit bandwidth commits across all three&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/mrz/2009/02/18/level3-post-bgp-turn-up/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Switch IOS Upgrade Post-Mortem</title>
		<link>http://blog.mozilla.com/mrz/2009/01/12/switch-ios-upgrade-post-mortem/</link>
		<comments>http://blog.mozilla.com/mrz/2009/01/12/switch-ios-upgrade-post-mortem/#comments</comments>
		<pubDate>Mon, 12 Jan 2009 18:00:09 +0000</pubDate>
		<dc:creator>mrz</dc:creator>
				<category><![CDATA[Mozilla]]></category>
		<category><![CDATA[networking]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/mrz/?p=243</guid>
		<description><![CDATA[Last night Derek attempted to upgrade IOS on core1 &#38; core2 to pickup software support for Cisco’s ACE module.
We ran into several issues and postponed the upgrade on core2 until those issues are resolved.

The switch&#8217;s Compact Flash cards weren&#8217;t formatted in a format that rommon (the low level boot loader) could read and the switch [...]]]></description>
			<content:encoded><![CDATA[<p>Last night <a href="http://blog.mozilla.com/it/2009/01/11/mozilla-scheduled-downtime-01112009-8pm-11pm-pst-0500-0700-01122009-utc/">Derek attempted to upgrade</a> IOS on <code>core1</code> &amp; <code>core2</code> to pickup software support for <a href="http://www.cisco.com/en/US/products/ps6906/index.html">Cisco’s ACE module</a>.</p>
<p>We ran into several issues and postponed the upgrade on <code>core2</code> until those issues are resolved.</p>
<ol>
<li>The switch&#8217;s Compact Flash cards weren&#8217;t formatted in a format that <code>rommon</code> (the low level boot loader) could read and the switch failed to load any OS when rebooted (<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=473084">bug 473084</a>).<br />
<blockquote><p><i>&lt;rant&gt;</i><br />
Unfortunately IOS didn&#8217;t flag that as an error when reading/writing to it. It didn&#8217;t even flag an error when the boot variable was set to boot off of it which is stupid because IOS clearly knew as shown in the log when I had remote-hands re-seat the card:</p>
<p><code> Jan 11 22:44:23 core2 3927937: Jan 11 22:44:21.978 PDT: %PCMCIAFS-SP-5-DIBERR: PCMCIA disk 0 is formatted from a different router or PC. A format in this router is required before an image can be booted from this device</code><br />
<i>&lt;/rant&gt;</i></p></blockquote>
</li>
<li>Two of the VMware ESX storage arrays are single-homed, connected to one switch (<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=473113">bug 473113</a>).  This is fallout from <a href="http://blog.mozilla.com/justin/2008/06/16/build-storage-issues-resolved/">previous NetApp performance issues</a> that were forgotten and never addressed and caused a number of build VMs to go offline (<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=473112">bug 473112</a>) .</li>
<li>A number of non-user facing, multi-homed hosts went offline.  All of the RHEL Linux servers have an active/standby network setup.  In several cases the standby interface didn&#8217;t work or wasn&#8217;t properly configured.  This was more of an annoyance to the IT Team than to anyone else but did cause outages for some backend services (most notably the VMware <a href="http://www.vmware.com/products/vi/vc/">VC</a> server and <code>mradm01</code>, one of the Nagios servers).</li>
</ol>
<p>We&#8217;ll be addressing those issues before scheduling the remaining upgrade to <code>core2</code>.  We&#8217;ll also be looking at implementing some routine (perhaps quarterly) test of the infrastructure in a controlled environment to ensure its high &#8220;availbility-ness&#8221;.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/mrz/2009/01/12/switch-ios-upgrade-post-mortem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Uptime</title>
		<link>http://blog.mozilla.com/mrz/2009/01/11/uptime/</link>
		<comments>http://blog.mozilla.com/mrz/2009/01/11/uptime/#comments</comments>
		<pubDate>Sun, 11 Jan 2009 20:37:02 +0000</pubDate>
		<dc:creator>mrz</dc:creator>
				<category><![CDATA[Mozilla]]></category>
		<category><![CDATA[networking]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/mrz/?p=233</guid>
		<description><![CDATA[I&#8217;m fairly conservative when it comes to upgrading switches.  I generally only upgrade to pick up security fixes.  It&#8217;s rare that I&#8217;ll upgrade just to stay current.
Tonight&#8217;s one of those rare exceptions.  Derek is upgrading both core1 and core2 to pickup software support for Cisco&#8217;s ACE module.  We&#8217;re planning on using [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m fairly conservative when it comes to upgrading switches.  I generally only upgrade to pick up security fixes.  It&#8217;s rare that I&#8217;ll upgrade just to stay current.</p>
<p>Tonight&#8217;s one of those rare exceptions.  <a href="http://blog.mozilla.com/it/2009/01/11/mozilla-scheduled-downtime-01112009-8pm-11pm-pst-0500-0700-01122009-utc/">Derek is upgrading</a> both <code>core1</code> and <code>core2</code> to pickup software support for <a href="http://www.cisco.com/en/US/products/ps6906/index.html">Cisco&#8217;s ACE module</a>.  We&#8217;re planning on using this in conjunction with the Zeus ZXTM cluster <a href="http://blog.mozilla.com/mrz/2008/12/04/load-balancer-performance-issues-fxfeedsmozillaorg-versioncheck/">we setup</a>.</p>
<p>A little sad though.  It&#8217;s going to reset <code>uptime</code>:</p>
<blockquote><p><code>core1#sh ver | inc uptime<br />
core1 uptime is 2 years, 28 weeks, 2 days, 20 hours, 40 minutes</code></p>
<p><code>core2#sh ver | inc uptime<br />
core2 uptime is 2 years, 28 weeks, 2 days, 20 hours, 51 minutes</code></p></blockquote>
<p>Love &#8216;em or hate &#8216;em, Cisco gear is solid.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/mrz/2009/01/11/uptime/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>border2 upgrade done, one more to go</title>
		<link>http://blog.mozilla.com/mrz/2008/11/19/border2-upgrade-done-one-more-to-go/</link>
		<comments>http://blog.mozilla.com/mrz/2008/11/19/border2-upgrade-done-one-more-to-go/#comments</comments>
		<pubDate>Wed, 19 Nov 2008 18:22:21 +0000</pubDate>
		<dc:creator>mrz</dc:creator>
				<category><![CDATA[Mozilla]]></category>
		<category><![CDATA[networking]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/mrz/?p=160</guid>
		<description><![CDATA[Completed one of the router upgrades I mentioned the other day lastnight.  One real issue I ran into was the built in sup-bootflash: was too small to hold the IOS image I wanted and I wasted some amount of time deleting/squeezing sup-bootflash: and remembering the boot system syntax to boot off disk0:.
Thought I&#8217;d share some [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;">Completed one of the router upgrades I <a href="http://blog.mozilla.com/mrz/2008/11/17/router-upgrades-san-jose/">mentioned</a> the other day lastnight.  One real issue I ran into was the built in <tt>sup-bootflash:</tt> was too small to hold the IOS image I wanted and I wasted some amount of time deleting/squeezing <tt>sup-bootflash:</tt> and remembering the <tt>boot system</tt> syntax to boot off <tt>disk0:</tt>.</p>
<p style="text-align: left;">Thought I&#8217;d share some before-and-after notes.</p>
<p style="text-align: left;"><a href="http://blog.mozilla.com/mrz/files/2008/11/border2-cpu-sup720.png"><img class="alignnone size-full wp-image-161" title="border2 CPU" src="http://blog.mozilla.com/mrz/files/2008/11/border2-cpu-sup720.png" alt="" width="500" height="207" /></a></p>
<p style="text-align: left;"><a href="http://blog.mozilla.com/mrz/files/2008/11/border2-mem-sup720.png"><img class="alignnone size-full wp-image-162" title="border2, memory" src="http://blog.mozilla.com/mrz/files/2008/11/border2-mem-sup720.png" alt="" width="500" height="289" /></a></p>
<p style="text-align: left;"><a href="http://blog.mozilla.com/mrz/files/2008/11/border2-mem-sup720.png"><br />
</a></p>
<p style="text-align: left;">There&#8217;s a chunk of missing time when <tt>border2</tt> was offline and I was rebuilding the config but it sure was worth it.  The only CPU spike was shortly after all my BGP peers came back up and there was the necessary <tt><em>BGP Scanner</em></tt> run.</p>
<p style="text-align: left;">A couple more before-and-after snapshots:</p>
<p>FIB Usage (look more at the %Used for IPv4 routes):</p>
<pre style="text-align: left;">border2#show platform hardware capacity | beg L3
L3 Forwarding Resources
             FIB TCAM usage:                     Total        Used       %Used
                  72 bits (IPv4, MPLS, EoM)     245760      244699        100%
                 144 bits (IP mcast, IPv6)        8192        1498         18%

                     detail:      Protocol                    Used       %Used
                                  IPv4                      244699        100%
                                  MPLS                           0          0%
                                  EoM                            0          0%
                                  IPv6                        1495         18%</pre>
<pre style="text-align: left;">border2#show platform hardware capacity | beg L3
L3 Forwarding Resources
             FIB TCAM usage:                     Total        Used       %Used
                  72 bits (IPv4, MPLS, EoM)     802816      267860         33%
                 144 bits (IP mcast, IPv6)      122880        1496          1%

                     detail:      Protocol                    Used       %Used
                                  IPv4                      267860         33%
                                  MPLS                           0          0%
                                  EoM                            0          0%
                                  IPv6                        1493          1%</pre>
<p>Maximum Routes (changed this with <tt>mls cef maximum-routes ip 768</tt> and <tt>mls cef maximum-routes mpls 1</tt> &#8211; I remain hopeful IPv6 will take off):</p>
<pre style="text-align: left;">border2#show mls cef max
FIB TCAM maximum routes :
=======================
Current :-
-------
 IPv4                - 239k
 MPLS                - 1k (default)
 IPv6 + IP Multicast - 8k (default)

border2#show mls cef max
FIB TCAM maximum routes :
=======================
Current :-
-------
IPv4                - 768k
MPLS                - 1k
IPv6 + IP Multicast - 120k (default)</pre>
<p><tt>border1</tt> gets upgraded Thursday night.  Hopefully last night wasn&#8217;t disruptive for anyone.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/mrz/2008/11/19/border2-upgrade-done-one-more-to-go/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Router upgrades, San Jose</title>
		<link>http://blog.mozilla.com/mrz/2008/11/17/router-upgrades-san-jose/</link>
		<comments>http://blog.mozilla.com/mrz/2008/11/17/router-upgrades-san-jose/#comments</comments>
		<pubDate>Mon, 17 Nov 2008 18:50:26 +0000</pubDate>
		<dc:creator>mrz</dc:creator>
				<category><![CDATA[Mozilla]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[networking]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/mrz/?p=150</guid>
		<description><![CDATA[A couple months ago I mentioned how things have grown in the past two years at Mozilla.  Back then we barely pushed any traffic to the Internet and survived on less than a dozen app servers.
Things have changed.  I&#8217;ll highlight just a couple of them:

Active Firefox users grew from roughly 20 million users to over [...]]]></description>
			<content:encoded><![CDATA[<p>A couple months ago I <a href="http://blog.mozilla.com/mrz/2008/09/04/i-mozilla-need-a-network-engineer/">mentioned how things have grown</a> in the past two years at Mozilla.  Back then we barely pushed any traffic to the Internet and survived on less than a dozen app servers.</p>
<p>Things have changed.  I&#8217;ll highlight just a couple of them:</p>
<ul>
<li>Active Firefox users grew from roughly 20 million users to over 70 million</li>
<li>Mozilla&#8217;s outbound traffic has grown from ~150Mbps to well over 800Mbps (and over 1.5Gbps during release periods)</li>
<li>BGP routers on the Internet <a href="http://bgp.potaroo.net/">have grown from something around 200k to more than 250k</a></li>
</ul>
<p>That last bullet point brings us to today.</p>
<p>The two BGP speaking routers in San Jose both have Sup32 (the &#8220;CPU&#8221; of the router) and they have a limit to the maximum number of routes they can hold in their FIB TCAM (&#8220;route lookup table&#8221;).  Routes that can&#8217;t fit in the FIB TCAM end up being forwarded in software at the cost of CPU.  The more traffic we push, the high the CPU tends to run and lately it&#8217;s been running close to the point of uncomfortable.</p>
<p>I&#8217;m routinely getting alert emails:</p>
<blockquote><p><tt>border1.sj.mozilla.com five minute load average 62% exceeds 60%<br />
border2.sj.mozilla.com five minute load average 83% exceeds 60%</tt></p></blockquote>
<p><a href="http://blog.mozilla.com/mrz/files/2008/11/border2-cpu-2yrs.png"><img class="size-medium wp-image-151 alignleft" title="CPU usage, 2 yrs" src="http://blog.mozilla.com/mrz/files/2008/11/border2-cpu-2yrs-300x124.png" alt="" width="300" height="124" /></a></p>
<p>And from trend graphs, it&#8217;s quite obvious.</p>
<p>I will be upgrading the Sup32s this week to Sup720-3BXLs.  I plan on doing one Tuesday and the other Thursday.  For the most part, this should be non-user impacting.  Most of the headache is going to be in the backend, moving router interfaces around, moving <a href="http://www.cacti.net/">cacti</a> graphs around and updating aggregtate graphs.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/mrz/2008/11/17/router-upgrades-san-jose/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cisco wireless problems, multicast failures</title>
		<link>http://blog.mozilla.com/mrz/2008/10/20/cisco-wireless-problems-multicast-failures/</link>
		<comments>http://blog.mozilla.com/mrz/2008/10/20/cisco-wireless-problems-multicast-failures/#comments</comments>
		<pubDate>Mon, 20 Oct 2008 23:07:50 +0000</pubDate>
		<dc:creator>mrz</dc:creator>
				<category><![CDATA[Mozilla]]></category>
		<category><![CDATA[networking]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/mrz/?p=120</guid>
		<description><![CDATA[This post is written entirely out of frustration.  For what seems like months I&#8217;ve been on-and-off troubleshooting wireless connectivity issues with Cisco.
I&#8217;ll give a little background first.

At Mozilla&#8217;s main campus I&#8217;m using a Cisco 3845 ISR with two NM-WLC Wireless LAN controllers and have a total of 9 APs covering two buildings.
I broadcast two SSIDs [...]]]></description>
			<content:encoded><![CDATA[<p>This post is written entirely out of frustration.  For what seems like months I&#8217;ve been on-and-off troubleshooting wireless connectivity issues with Cisco.</p>
<p>I&#8217;ll give a little background first.</p>
<p><a href="http://blog.mozilla.com/mrz/files/2008/10/cisco-wifi.png"><img class="alignleft size-medium wp-image-121" title="cisco-wifi" src="http://blog.mozilla.com/mrz/files/2008/10/cisco-wifi-300x131.png" alt="" width="300" height="131" /></a><br />
At Mozilla&#8217;s main campus I&#8217;m using a Cisco 3845 ISR with two NM-WLC Wireless LAN controllers and have a total of 9 APs covering two buildings.</p>
<p>I broadcast two SSIDs &#8211; a guest one and a WPA/WPA2 Enterprise one.  Both wireless networks are bridged through the ISR onto the appropriate wired network through a BVI.</p>
<h1><em><strong>Problem #1</strong></em></h1>
<p>My first issue was mostly around client authentication.  Mozilla has a heavy percentage of Mac users and most had some sort of issue authenticating.  This problem became worse when the MacBook Airs came out and with some of the new gen MacBook Pros.  None of the Airs could authentication and a large number of the Pros started failing.  And not a single iPhone could authenticate.</p>
<p>Cisco&#8217;s default response was to:</p>
<ol>
<li>Update my wireless drivers on OSX</li>
<li>Update the firmware on the WLC</li>
</ol>
<p>#1 is impossible, #2 I did and no fix.  Finally after a month of pushing and two days of bringing in Aruba gear to prove to Cisco it wasn&#8217;t an OSX issue, Cisco found a solution.  The default EAP timeout was set to one second with a one second retry.  You had one second to type your password correctly and you had one chance to retry it.  Changing both of those to something more reasonable resolved most of the issues for Airs, Pros and iPhones.</p>
<p>(I don&#8217;t believe this was well documented &#8211; it&#8217;s not exposed through the webui WLC interface either and took TAC a long time to come up with this recommendation.  Look for <code>config advanced eap identity-request-timeout</code> &amp; <code>config advanced eap identity-request-retries</code>.)</p>
<h1><em><strong>Problem #2</strong></em></h1>
<p>The second problem is more involved and has been a problem since day one but hasn&#8217;t really been end-user affecting.  Most users will notice that wired users can not see wireless users&#8217; iTunes libraries (and visa versa).</p>
<p>That&#8217;s just a symptom of the problem. Anything that relies on mDNS/Bonjour fails to work between wired and wireless users, including finding network-based Time Machine servers.</p>
<p>This manifested itself again when certain users couldn&#8217;t sync their <a href="http://www.culturedcode.com/things/">Things</a> content with their iPhone.  In troubleshooting, we (Justin) noticed that it used multicast to try to find devices to sync with.</p>
<p>I&#8217;ve narrowed down the problem to the following:</p>
<ol>
<li>multicast traffic is not forwarded intra-WLC or inter-WLC</li>
<li>mulitcast traffic is not bridged out the BVI</li>
</ol>
<p>From a wired host I ran:</p>
<blockquote><p><code>tcpdump -n ip multicast and ether host 00:17:f2:09:d8:ea</code></p></blockquote>
<p>and am unable to see any multicast data from my wireless host (it&#8217;s entirely possible that I don&#8217;t understand mDNS or how to use <code>tcpdump</code> well enough to troubleshoot this either).  As best as I can tell, the WLC is configured to process multicast:</p>
<pre style="padding-left: 30px;">(BS-WLC01) &gt;show network sum

RF-Network Name............................. mozilla
Web Mode.................................... Disable
Secure Web Mode............................. Enable
Secure Web Mode Cipher-Option High.......... Disable
Secure Shell (ssh).......................... Enable
Telnet...................................... Disable
Ethernet Multicast Mode..................... Enable   Mode: Mcast  239.0.1.2
Ethernet Broadcast Mode..................... Enable</pre>
<p>Cisco appears to have no clue on this either.  The last response from TAC on this was:</p>
<blockquote><p>I checked our query and found no response as of this time. I researched and found no similar devices in combination related to the matter. Be assured that I will make necessary follow-up and will provide you an update as soon as I receive a reply.</p></blockquote>
<p>This worked without problems when I had that Aruba hardware for a couple days so I know this is not an OSX client issue &#8211; I wasable to stream from my iTunes library on my MacBook (wireless, on Aruba) to my wired desktop.</p>
<p>Cisco, why is this so hard to get working?!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/mrz/2008/10/20/cisco-wireless-problems-multicast-failures/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>I (Mozilla) need a Network Engineer</title>
		<link>http://blog.mozilla.com/mrz/2008/09/04/i-mozilla-need-a-network-engineer/</link>
		<comments>http://blog.mozilla.com/mrz/2008/09/04/i-mozilla-need-a-network-engineer/#comments</comments>
		<pubDate>Thu, 04 Sep 2008 18:49:53 +0000</pubDate>
		<dc:creator>mrz</dc:creator>
				<category><![CDATA[Mozilla]]></category>
		<category><![CDATA[networking]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/mrz/?p=100</guid>
		<description><![CDATA[When I started at Mozilla two years ago the biggest challenge was handling a release and pushing very close to 100Mbps.  Right around that point the firewalls would fall over.  We had one data center and essentially one provider and I could count the number of app servers on my two hands.
That was two years [...]]]></description>
			<content:encoded><![CDATA[<p>When I started at Mozilla two years ago the biggest challenge was handling a release and pushing very close to 100Mbps.  Right around that point the firewalls would fall over.  We had one data center and essentially one provider and I could count the number of app servers on my two hands.</p>
<p>That was two years ago.  Today&#8217;s steady state is around 600Mbps and it&#8217;s not uncommon to push closer to 1.5Gbps during a release.  We have a growing global presence with four data centers and have enough redundancy built in that it wasn&#8217;t any problem to lose one provider this past weekend (well, it was a problem but it wasn&#8217;t user impacting).  </p>
<p>The environment has grown from a bunch of switches strung together with a mess of cables to an orderly mess of cables and a switching infrastructure that&#8217;s allowed us to be more nimble and do more complicated things more easily, often without ever having to physically visit the data center(s).  And there are something like 51 app servers.</p>
<p>The network has grown and I need help.  I&#8217;m looking for someone to join the team and continue to grow and support Mozilla&#8217;s network and systems infrastructure (job description is after the jump).</p>
<p>If you&#8217;re ready for the challenge and opportunity to serve a community of 200 million Firefox users, send an email to <i>careers at mozilla dot com</i>!</p>
<p><span id="more-100"></span></p>
<h2><strong>Network Engineer Job Description</strong></h2>
<p>As a member of our IT team, you will assume a pivotal role in creating the company&#8217;s core high-volume systems and network infrastructure and participate in key design decisions. You will be expected to come up to speed quickly to meet technical goals and challenges and share a leadership role in a hard-working and collaborative team. We have high expectations and are looking for a seasoned professional with experience in a wide range of areas. Your time will be split between pure networking and assisting with Mozilla&#8217;s growing Linux &amp; ESX infrastructure.</p>
<p>The network environment consists of Cisco and HP switches and routers, Citrix  Netscaler load balancers, and Cisco and Juniper firewalls.</p>
<h3>Requirements:</h3>
<p>You must be self-motivated, capable of managing your time well, and work  efficiently without close supervision. You place a high value on secure, highly  available, fault-tolerant systems. You are proactive in identifying and  resolving technical challenges, enthusiastically troubleshoot problems when  they occur, and thrive as a collaborative team player.  Key duties include:</p>
<ul>
<li> Provide support in the operation of Mozilla&#8217;s growing global network infrastructure.</li>
<li> Support Mozilla&#8217;s corporate network and remote offices.</li>
<li> Monitor system stability and performance.</li>
<li> Ensure 24&#215;7 operations.</li>
<li> Act as an externally-facing point of contact to facilitate handling of problem reports, and maintain relations with network peers and vendors.</li>
<li> Act as an internally-facing point of contact to escalate technical issues, and communicate network status.</li>
</ul>
<h3>Job Skill Requirements:</h3>
<ul>
<li> Bachelor’s degree in a technical discipline (or equivalent work in IT related field).</li>
<li> 3+ years of experience with enterprise/IT level network infrastructure and/or ISP network operations center/tier 1-2 support.</li>
<li> In-depth knowledge of TCP/IP fundamentals (including Layers 2-7 content switching).</li>
<li> Creative problem solving abilities.</li>
<li> Network routing protocol (OSPF/BGP) knowledge.</li>
<li> Network certifications such as CCNP/JNCIA/JNCIP (or equivalent training/experience) preferred but not required.</li>
<li> Strong knowledge of datacenter design and layout.</li>
<li> Ability to document and update processes.</li>
<li> Experience with standard network change management and configuration policies.</li>
<li> Experience with Unix/Linux administration is required.</li>
<li> Experience and flexibility regarding on-call responsibilities.</li>
<li> Understanding of web application tiers (app, database, caching)</li>
<li> Experience with scalability issues in both the network and application layers</li>
</ul>
<h3>Additional Skills Strongly Desired:</h3>
<ul>
<li> Strong experience with OS deployment and automation</li>
<li> Scripting/tools ability is a plus (Perl, Python or PHP)</li>
<li> Familiarity with Cisco 6500, FWSM, Citrix Netscalers &amp; Load Balancers</li>
<li> Familiarity with iSCSI SANs</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/mrz/2008/09/04/i-mozilla-need-a-network-engineer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
