<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Mozilla IT &#187; General Updates</title>
	<atom:link href="http://blog.mozilla.com/it/category/general-updates/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/it</link>
	<description>Mozilla IT &#38; Operations</description>
	<lastBuildDate>Thu, 26 Jan 2012 20:19:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Mozilla Scheduled Maintenance, Subversion (svn.mozilla.org) will be unavailable 01/28/2012 8pm-2am PST (0400 GMT)</title>
		<link>http://blog.mozilla.com/it/2012/01/26/mozilla-scheduled-maintenance-subversion-svn-mozilla-org-will-be-unavailable-01282012-8pm-2am-pdt/</link>
		<comments>http://blog.mozilla.com/it/2012/01/26/mozilla-scheduled-maintenance-subversion-svn-mozilla-org-will-be-unavailable-01282012-8pm-2am-pdt/#comments</comments>
		<pubDate>Thu, 26 Jan 2012 19:47:12 +0000</pubDate>
		<dc:creator>bhourigan</dc:creator>
				<category><![CDATA[General Updates]]></category>
		<category><![CDATA[Outages]]></category>
		<category><![CDATA[Scheduled Maintenance]]></category>
		<category><![CDATA[l10n]]></category>
		<category><![CDATA[phoenix]]></category>
		<category><![CDATA[subversion]]></category>
		<category><![CDATA[svn]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/it/?p=1684</guid>
		<description><![CDATA[We will have a scheduled maintenance window on Saturday, January 28th at 8pm-2am PST (0400 GMT). The following work will take place: Migrate from San Jose to Phoenix Implement fault tolerant infrastructure During the maintenance period subversion will be unavailable for both reading and writing. Because we&#8217;re switching to a newer version of subversion (and&#8230; <a class="more-link" href="http://blog.mozilla.com/it/2012/01/26/mozilla-scheduled-maintenance-subversion-svn-mozilla-org-will-be-unavailable-01282012-8pm-2am-pdt/" title="Read the rest of &#8220;Mozilla Scheduled Maintenance, Subversion (svn.mozilla.org) will be unavailable 01/28/2012 8pm-2am PST (0400 GMT)&#8221;">Read more</a>]]></description>
			<content:encoded><![CDATA[<p>We will have a scheduled maintenance window on <strong>Saturday, January 28th at 8pm-2am PST (0400 GMT)</strong>. The following work will take place:</p>
<ul>
<li>Migrate from San Jose to Phoenix</li>
<li>Implement fault tolerant infrastructure</li>
</ul>
<p>During the maintenance period subversion will be unavailable for both reading and writing. Because we&#8217;re<br />
switching to a newer version of subversion (and changing the data store to fsfs) the data migration will require a time consuming svnadmin dump / svnadmin load. We anticipate this step alone will take about 4 hours.</p>
<p>&nbsp;</p>
<p><strong>Time</strong>: January 28th 8pm-2am PST<br />
<strong>Scheduled downtime</strong>: 6 hours<br />
<strong>Estimated actual downtime</strong>: 4.5 hours<br />
<strong>Impact</strong>: All subversion related services will be unavailable <em>including</em> viewvc</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/it/2012/01/26/mozilla-scheduled-maintenance-subversion-svn-mozilla-org-will-be-unavailable-01282012-8pm-2am-pdt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>December 2011 in IT</title>
		<link>http://blog.mozilla.com/it/2012/01/24/december-2011-in-it/</link>
		<comments>http://blog.mozilla.com/it/2012/01/24/december-2011-in-it/#comments</comments>
		<pubDate>Tue, 24 Jan 2012 23:35:32 +0000</pubDate>
		<dc:creator>jakem</dc:creator>
				<category><![CDATA[General Updates]]></category>
		<category><![CDATA[addons.mozilla.org]]></category>
		<category><![CDATA[downtime]]></category>
		<category><![CDATA[outage]]></category>
		<category><![CDATA[webops]]></category>
		<category><![CDATA[week in IT]]></category>
		<category><![CDATA[zeus]]></category>
		<category><![CDATA[zimbra]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/it/?p=1620</guid>
		<description><![CDATA[As you may have noticed, we&#8217;ve missed a few of our weekly updates. We&#8217;ve had a rather rough go of it lately, and it&#8217;s about time for the highlight reel. Two major events stick out in hindsight. #1, No email for two days First up on the plate was a major Zimbra outage. This was&#8230; <a class="more-link" href="http://blog.mozilla.com/it/2012/01/24/december-2011-in-it/" title="Read the rest of &#8220;December 2011 in IT&#8221;">Read more</a>]]></description>
			<content:encoded><![CDATA[<p><em>As you may have noticed, we&#8217;ve missed a few of our weekly updates. We&#8217;ve had a rather rough go of it lately, and it&#8217;s about time for the highlight reel. Two major events stick out in hindsight.</em></p>
<h2>#1, No email for two days</h2>
<p>First up on the plate was a major Zimbra outage. This was sparked by a RAID failure in one of our HP Storage Blades, but rapidly escalated into a data-loss situation due to what can only be described as poor backup planning.</p>
<p><em>The short version</em>: backups were being made regularly, but were not being reliably shipped off of the server. It took a lot of effort from IT and patience from our users (that is: all of Mozilla&#8217;s paid staff&#8230; thank you!) to get back on track.</p>
<p>Remarkably, this had relatively minimal affect on development or release cadence- it didn&#8217;t delay the Rapid Release train or close the tree, and newsgroups were unaffected. It may have caused some features to develop more slowly than they otherwise would have, but the community and the company pulled together to get through the situation with more ease than anyone could have expected.</p>
<p>Fortunately, things on that front are much better now- our Zimbra infrastructure has improved, and the backup strategy has changed such that the same issue is largely negated. We&#8217;re better off now than we were 6 months ago, and we have plans to be significantly better yet in another 6 months.</p>
<h2> #2, addons.mozilla.org &amp; versioncheck.addons.mozilla.org</h2>
<p>In mid-December we began having major performance issues with sites behind our Phoenix Zeus (load balancer) cluster. This cluster supports a number of production sites, including <tt>addons.mozilla.org</tt> and <tt>versioncheck.addons.mozilla.org</tt>.</p>
<p>The issue was caused by an unexpected and extended increase of traffic from Firefox 3.6 users upgrading to Firefox 8 (and the Addons version checks).</p>
<p>This also highlighted 2 architectural design limitations that we were aware of but had not expected to be a problem for some time.</p>
<p style="padding-left: 30px;">This Zeus cluster sits behind a redundant pair of Juniper SRX firewalls. These protect the Zeus Linux hosts and provide a point to monitor for unwanted activity (IDS). Like any device, these firewalls are limited by the number of concurrent sessions and new sessions per second that they can handle. The additional traffic put us over the threshold, and they started to drop connections.</p>
<p style="padding-left: 30px;">Moving Zeus from behind the SRX solved one problem while exposing another bottleneck. This time, we learned that the traffic for <tt>versioncheck.addons.mozilla.org</tt> was overwhelming the 1GbE interfaces on the Zeus cluster and we had to quickly spin up 10GbE Zeus nodes.</p>
<p>These are really just general scaling issues that we were going to need to deal with sooner or later. Unfortunately, we hit a level of scale that didn&#8217;t fit the way we had planned (a lot of these improvements were things we had planned on doing in the early part of 2012).</p>
<p>Among the fixes were:</p>
<ul>
<li>TCP stack tuning</li>
<li>Upgrading to 10-gigabit Ethernet on the Zeus hosts</li>
<li>Routing / firewall changes (including removing the hardware firewall and switching to iptables)</li>
<li>Reducing our use of multicast VIPs, opting for multiple &#8220;normal&#8221; VIPs using DNS to send traffic to all load balancers</li>
<li>Segregating backend traffic onto separate load balancers in a different LB cluster</li>
<li>Sending traffic to other datacenters (notably: <tt>versioncheck.addons.mozilla.org</tt>, which is highly cache-able)</li>
</ul>
<p>There&#8217;s more work still to come on this. We are currently experimenting with <a href="http://www.scalearc.com/">ScaleArc iDB</a>, a database/SQL aware load balancer, which would theoretically give us database query caching and query distribution. We presently do such load balancing through our main Zeus cluster&#8217;s, but they don&#8217;t understand any database-specific protocols&#8230; it&#8217;s just simple TCP-based proxying. A protocol-aware load balancer should provide the same performance benefits for database queries that an HTTP-caching-LB does for web content.</p>
<h2>So, what&#8217;s the long term fix?</h2>
<p>From these events we have drastically altered our approach to how we handle certain parts of our infrastructure. We are very excited to be working on what we are loosely dubbing the &#8220;Hyper-Critical Infrastructure&#8221; cluster, which is a completely standalone VMware / NetApp installation designed to be as autonomous as we can reasonably make it. For starters this will house Zimbra &amp; LDAP. Sometime after that we will also likely migrate Mana &#8211; our internal documentation system based on Confluence &#8211; and <tt>intranet.mozilla.org</tt>. Other extremely critical apps are also fair game.</p>
<h3 style="padding-left: 30px;">Zimbra</h3>
<p style="padding-left: 30px;">Specific to Zimbra, we&#8217;re spending some time to make sure we make it as scalable and reliable as we need it to be. VMware High Availability and NetApp storage gets us freedom from most kinds of hardware failures, and Zimbra internally supports sharding to multiple servers for scale. We know Zimbra can do the job (it&#8217;s used by organizations much, much bigger than Mozilla), and this architecture will get us there.</p>
<h3 style="padding-left: 30px;">Zeus Load Balancers</h3>
<p style="padding-left: 30px;">We&#8217;re continuing to replace the 1GbE Zeus cluster (HP BL460c) with 10GbE servers (HP DL360 G7) and are working on sharding our traffic into separate Zeus clusters where it makes sense to do so. We are also looking into separating database traffic away from Zeus altogether and onto a protocol-aware load balancer that can cache results.</p>
<p style="padding-left: 30px;">On a higher level, we&#8217;re pushing out services like Cedexis to augment our geo- and performance-based global load balancing. This replaces another external service (3crowd), the discontinued &#8220;Zeus GLB&#8221; app, and the newer &#8220;Zeus Multi-Site Manager&#8221; app. This gives us a unified, convenient, and high-performing way to &#8220;front&#8221; Zeus and distribute traffic efficiently between multiple Zeus clusters.</p>
<h2>It wasn&#8217;t all bad</h2>
<p>Of course a number of <strong>good</strong> things happened in December as well:</p>
<ul>
<li><a href="https://browserid.org/">BrowserID</a> went to production</li>
<li>Ramped up Tegra capacity for Native UI &amp; Android UI testing for Firefox Mobile</li>
<li>Our Inventory system got a nice overhaul and a migration to a new cluster</li>
<li>Lots of CDN work, including SSL CDN trials (Akamai, Highwinds) and Cedexis experimentation / implementation</li>
<li>Turn-up of a new 9-cabinet module in our PHX1 datacenter (fortunate, since it helped significantly with the load balancer issues)</li>
<li>A new ESX cluster in PHX1 (apart from the Hyper-Critical one above, which happened later)</li>
<li><strong>Dozens</strong> of web content pushes &#8211; these are so reliable now we don&#8217;t even announce them anymore</li>
<li>BrowserID &amp; LDAP integration work for the <a href="http://mozillians.org/">Mozilla Community Directory</a></li>
</ul>
<p>It can be easy to forget about the wins, because in general you&#8217;re winning whenever something isn&#8217;t broken!</p>
<p>We&#8217;re hoping to get back on track with more frequent updates&#8230; look for more recent info very soon!</p>
<p>Jake</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/it/2012/01/24/december-2011-in-it/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>MXR Repo Changes and Additions</title>
		<link>http://blog.mozilla.com/it/2012/01/23/mxr-repo-changes-and-additions/</link>
		<comments>http://blog.mozilla.com/it/2012/01/23/mxr-repo-changes-and-additions/#comments</comments>
		<pubDate>Tue, 24 Jan 2012 05:45:52 +0000</pubDate>
		<dc:creator>jakem</dc:creator>
				<category><![CDATA[General Updates]]></category>
		<category><![CDATA[mxr]]></category>
		<category><![CDATA[webops]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/it/?p=1650</guid>
		<description><![CDATA[We&#8217;ve made a few additions and changes to the repos in MXR today, per Bugs 653424 and 675115. The following repos have been added and are processed daily: comm-aurora comm-beta comm-release mozilla-release l10n-mozilla-release Additionally the comm-2.0 repo was added as a one-time processing job, so it is indexed and searchable now as well. The following&#8230; <a class="more-link" href="http://blog.mozilla.com/it/2012/01/23/mxr-repo-changes-and-additions/" title="Read the rest of &#8220;MXR Repo Changes and Additions&#8221;">Read more</a>]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve made a few additions and changes to the repos in MXR today, per Bugs <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=653424">653424</a> and <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=675115">675115</a>.</p>
<p>The following repos have been added and are processed daily:</p>
<ul>
<li>comm-aurora</li>
<li>comm-beta</li>
<li>comm-release</li>
<li>mozilla-release</li>
<li>l10n-mozilla-release</li>
</ul>
<p>Additionally the comm-2.0 repo was added as a one-time processing job, so it is indexed and searchable now as well.</p>
<p>The following repos have been moved from daily processing to weekly processing. These seem to be receiving very infrequent updates (less than monthly), so I believe it will not significantly impact anyone.</p>
<ul>
<li>bugzilla2.20</li>
<li>bugzilla3.0</li>
<li>bugzilla3.2</li>
<li>firefox2</li>
<li>fuel</li>
<li>incubator-central</li>
<li>mozilla1.8</li>
<li>mozilla1.8.0</li>
<li>mozilla1.9.1</li>
<li>mozilla2.0</li>
<li>l10n-mozilla1.8</li>
<li>l10n-mozilla1.8.0</li>
<li>l10n-mozilla1.9.1</li>
<li>l10n-mozilla2.0</li>
<li>tamarin-central</li>
<li>webtools</li>
<li>webtools-central</li>
</ul>
<p>If you feel that any of these need more frequent MXR processing, please let me know about it by filing a bug with the <a href="https://bugzilla.mozilla.org/form:itrequest">IT Request form</a>. You can open a bug in the main Webtools::MXR component as well, but since IT is curently managing the MXR repos it will take longer to make its way to us if you go that route. <img src='http://blog.mozilla.com/it/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Jake</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/it/2012/01/23/mxr-repo-changes-and-additions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MXR Improvements</title>
		<link>http://blog.mozilla.com/it/2011/11/15/mxr-improvements/</link>
		<comments>http://blog.mozilla.com/it/2011/11/15/mxr-improvements/#comments</comments>
		<pubDate>Tue, 15 Nov 2011 18:10:10 +0000</pubDate>
		<dc:creator>jakem</dc:creator>
				<category><![CDATA[General Updates]]></category>
		<category><![CDATA[mxr]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[webops]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/it/?p=1496</guid>
		<description><![CDATA[Over the last few weeks we&#8217;ve been making a number of small changes to the MXR web tool (https://mxr.mozilla.org/), and it seems about time to highlight some of the bigger changes. Background MXR is the Mozilla Cross-Reference system. It&#8217;s basically a convenient way to search a whole lot of Mozilla code for certain things. I&#8217;m&#8230; <a class="more-link" href="http://blog.mozilla.com/it/2011/11/15/mxr-improvements/" title="Read the rest of &#8220;MXR Improvements&#8221;">Read more</a>]]></description>
			<content:encoded><![CDATA[<p>Over the last few weeks we&#8217;ve been making a number of small changes to the MXR web tool (<tt><a href="https://mxr.mozilla.org/">https://mxr.mozilla.org/</a></tt>), and it seems about time to highlight some of the bigger changes.</p>
<h3>Background</h3>
<p>MXR is the Mozilla Cross-Reference system. It&#8217;s basically a convenient way to search a whole lot of Mozilla code for certain things. I&#8217;m sure a proper Mozilla developer could tell you all about how awesome and terrible it is, and what it can and cannot do for you. As a sysadmin, I can tell you it&#8217;s a rather intensive, complicated set of scripts that have to handle a lot of different code. It pulls code written in many different languages, from multiple different revision control systems, and indexes them. Some code trees are processed more than once a day, the rest are done daily.</p>
<p>As you can imagine, this results in a lot of special cases. MXR breaks it down into basically 3 steps (conveniently split into 3 scripts) &#8211; update the tree, generate the cross-reference identifier database, and generate the searchable index.</p>
<ol>
<li>The first step (update-src.pl) is relatively straightforward. It consists basically of updating the code for whichever tree you pass it as an argument. It&#8217;s a lot of special cases, since most trees are handled slightly differently from one another, but all in all it&#8217;s about what you&#8217;d expect. It&#8217;s the equivalent of &#8220;hg pull&#8221; or &#8220;svn update&#8221;. The only trouble is some trees are not nearly that simple, and involve quite a bit more work to update properly. In fact some trees are actually nested source repos, which must be updated independently.</li>
<li>The second step (update-xref.pl) is where most of the time is spent. The &#8216;genxref&#8217; script is a giant mess of Perl regex&#8217;s to detect identifiers (function names, etc) in various programming language. You give it a tree, and for every file in that tree it determines the file type and scans it appropriately.</li>
<li>The third step (update-search.pl) is once again relatively simple. Searching is done via &#8220;glimpse&#8221;, so this is basically running &#8220;glimpseindex&#8221; on whichever tree you call it with.</li>
</ol>
<p>The three steps are tied together with an overall calling shell script (update-full-onetree.sh). This is responsible for feeding each of these three scripts the appropriate arguments (the name of the tree), and reporting the overall output.</p>
<p>This calling script is in turn called by another calling script, which calls it for every tree. This is the final link in the chain, and this script goes in cron. There are actually 2 of these- one that cron calls every 4 hours, one that cron calls daily. They&#8217;re basically identical except for the list of trees they process.</p>
<p>Now that you know how it works, I can tell you about how it&#8217;s better now than it was a month ago. <img src='http://blog.mozilla.com/it/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<h3>Improvements</h3>
<p>First things first: the daily script was no longer reporting its output. It was working, but not telling us about it. This was due to a newly-introduced &#8220;rsync&#8221; job, which was running with the verbose flag. This caused the output to be far larger than it should be, which broke the reporting. I removed the verbose flag, and all is back to normal.</p>
<p>With that out of the way, I could see that the &#8220;4-hour&#8221; job was taking approximately 2 hours to run, and the daily job as much as 30 hours in some cases. That&#8217;ll typically make any sysadmin a bit squeamish, and indeed it is non-ideal. However, these jobs have no built-in parallelism, and this server has multiple CPU cores available&#8230; meaning, the overlapping wasn&#8217;t normally a big problem.</p>
<p>At this point I wanted to improve the running time directly. Step 2 above is by far the longest-running (and hardest on CPU time), so I started there with the very nice <a href="http://www.perl.org/about/whitepapers/perl-profiling.html">NYTProf</a> Perl profiler. This made it clear that most of the time was being spent in a few of the more complicated regex&#8217;s used in this script. I was able to make some very small improvements, but ultimately nothing appeared to be severely wrong- they weren&#8217;t doing anything terrible, and in fact appeared to have already been tuned by someone who is frankly better at it than I am. I did get some great suggestions on ways to improve the whole process, however, which may yet be implemented some day.</p>
<p>Having quickly scanned over this issue and found nothing of significance, I moved on to &#8220;update-search.pl&#8221;&#8230; the 3rd and final step. I skipped NYTProf here, because the usage is quite obviously tied up in &#8220;glimpseindex&#8221;. I was able to make a few small tweaks here, which I believe may ultimately result in faster searching. Specifically, I upgraded our indexes from the default &#8220;tiny&#8221; size to the mid-level &#8220;small&#8221; size. This roughly doubled our disk space used for indexes, but should have provided a nice boost to search performance. This also makes the indexing take longer, but it still ends up being much faster than the &#8220;step 2&#8243; above, so I consider it a good trade. Unfortunately as I mentioned, IT doesn&#8217;t have a lot of face-time with this app, so it&#8217;s hard to judge for myself just how much faster searching really is.</p>
<p>Finally, the biggest change- parallelization of MXR jobs. After some time looking around, I couldn&#8217;t see any reason why we could not process multiple trees simultaneously. So, I set out to do just that.</p>
<p>I wanted to start simple, just to get something in place to prove that it would work. Thus, I re-worked the 4-hour cron job into a simple loop over the list of 4-hour trees, and had the loop execute 4 tasks at once and background them, then &#8220;wait&#8221; for completion before continuing. This is extremely simple using nothing more than standard shell semantics:</p>
<pre>PARALLEL_JOBS=4
count=0
for i in $TREES; do
    echo -n "Starting $i at "; date
    nice -n 19 ./update-full-onetree.sh -cron $i &amp;
    let count+=1
    [[ $((count%PARALLEL_JOBS)) -eq 0 ]] &amp;&amp; wait
done
wait</pre>
<p>This works beautifully, for what it is. The problem is that it doesn&#8217;t always keep 4 jobs running. It starts 4 jobs, waits for all 4 to complete, then starts 4 more. This results in large gaps where we would ideally like to be running another tree, but instead just sit and wait. If the 4 jobs started at the same time all take about the same amount of time to run, it&#8217;s pretty good. But if one job takes an hour and the other 3 take only 5 minutes, you&#8217;ve got a lot of idle time. Effectively, in some cases it&#8217;s not much better than serial execution.</p>
<p>Still, this was enough to bring the running time on the 4-hour job down from 2 hours to about 45 minutes. A very nice win. I knew I would have to come back to this later.</p>
<p>You may notice this doesn&#8217;t actually change the amount of work being done, it just crams it all into a smaller amount of time. While that&#8217;s absolutely correct, I believe it still results in a performance win for searching- these update jobs are rather disk-intensive, and tend to obliterate the cache. By grouping them up, we effectively obliterate the cache for a shorter amount of time, meaning the *rest* of the time should benefit from somewhat less turbulent disk caching. Basically, longer periods of &#8220;good&#8221; and shorter periods of &#8220;bad&#8221;.</p>
<p>What I really wanted to do though was to actually do less work during an update cycle. One simple way to get this is to detect whether anything has changed since the last execution, and if nothing has, to skip as much as possible.</p>
<p>Clearly, we still need to do step 1, or we won&#8217;t actually know if anything has changed. Once that&#8217;s out of the way we can make a determination. Theoretically it should be possible to do this very efficiently&#8230; *if* the tree&#8217;s revision control system will tell you that. Unfortunately, this is a brick wall for a generic solution. I could write up specific checks for each individual tree, but I really wanted to have something more easily maintainable- I already knew it was a pain to do this, simply by working on the script for step 1. In the end, I settled for a simple &#8220;find&#8221; command, to look for any files modified more recently than the last time the tree was processed. It&#8217;s a lot more overhead, but &#8220;works every time&#8221;. The overhead is actually negligible compared to the runtime of step 2 and 3 anyway, so it actually ends up not mattering too much.</p>
<p>This results in a significant reduction in IOPS and CPU cycles consumed. It should also cause far less cache destruction, as most of the files are never actually read. Of course there is still some caching problems caused by this &#8220;find&#8221; command, but all in all it should be a vast reduction from actually reading each and every file.</p>
<p>With this in place, I was really starting to feel the inefficiency from the simple shell-based parallelization scheme. When a tree doesn&#8217;t get changed, it&#8217;s runtime gets very short- but this is a worst-case for that algorithm, so it devolves almost back into serial execution! Of course the overall runtime is still pretty good (30-45min, depending on which trees have or haven&#8217;t changed), but the efficiency is getting worse- there&#8217;s more dead time in between jobs. So I took the next step.</p>
<h3>GNU Parallel to the Rescue</h3>
<p>If you haven&#8217;t heard of <a href="http://www.gnu.org/s/parallel/">GNU Parallel</a>, I <em>highly</em> recommend it. There is an exceptionally good tutorial video on using parallel, <a href="http://www.youtube.com/watch?v=OpaiGYxkSuQ">here</a>. Suffice it to say, I rewrote the above loop to look like this:<code></code><br />
<code><br />
echo "$TREES" | parallel -j+0 'echo -n "{}: Starting " &amp;&amp; date &amp;&amp; nice -n 19 ./update-full-onetree.sh -cron {}; echo -n "{}: Ending " &amp;&amp; date'<br />
</code></p>
<p>I&#8217;ll admit it&#8217;s not nearly as pretty to look at, but it has one very nice improvement- it will make sure that there is always the right number of jobs executing. If a job finishes, it will start the next one right away.</p>
<p>With this in place, the 4-hour job is now reduced to a maximum of 30 minutes&#8230; and a minimum of <em>eight</em>. Compare that to the original time of about 2 hours, every time, and you can easily see this is a <em>massive</em> improvement.</p>
<p>I&#8217;m making the same set of changes to the &#8216;daily&#8217; job, and although I expect some great things, it unfortunately has one significant weakness: one of the trees alone takes about 17 hours to process, and gets pretty constant usage. So this will largely eliminate the situation where this job can &#8216;overrun&#8217; itself, and will consolidate the vast majority of the trees to be completed very quickly, the overall job will still be bounded by this one tree.</p>
<h3>Future Improvements</h3>
<p>There are quite a few places where MXR could be improved further, but unfortunately much of it may be beyond my capabilities and I&#8217;ll need to seek some outside assistance.</p>
<ul>
<li>Update step 2 to be multi-threaded internally. This will speed up jobs like the 17-hour monster.</li>
<li>Update step 2 to intelligently skip files within a tree if they haven&#8217;t changed since last processed. This would cut down on the per-job runtime <em>significantly</em>. As you can imagine, most files don&#8217;t change every day- changes are concentrated into only a very small percentage of files. If we can quickly skip the others, we can save a whole lot of wasted effort. The trick will be to do this without relying on a revision-control system, since there&#8217;s so little uniformity in that area.</li>
<li>Update step 3 to be more intelligent about generating indexes. This will take some playing with &#8220;glimpseindex&#8221;, but the short version is I believe we are discarding the previous index every time and starting from scratch. Obviously this is non-ideal.</li>
<li>Move MXR to a better home. It should be quite possible for the web app to live on a separate machine from the backend processing, which should help with caching. For that matter the web frontend could likely be a normal load-balanced cluster, and gain improved query time through horizontal scaling. Even the processing could be split up across multiple machines (and GNU parallel in fact makes this incredibly straightforward).</li>
</ul>
<h3>New Trees Soon!</h3>
<p>Lastly, there are at least 2 trees I&#8217;m looking very closely at adding in the very near future. I don&#8217;t want to spoil the surprise, but both have been requested more than once, and have very good reasons for being added. Having dug into MXR quite a bit recently, I&#8217;m in a much better position now to do this than in the past. Look for an update on this soon. <img src='http://blog.mozilla.com/it/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/it/2011/11/15/mxr-improvements/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>This week in IT: early holiday shopping</title>
		<link>http://blog.mozilla.com/it/2011/11/09/this-week-in-it-early-holiday-shopping/</link>
		<comments>http://blog.mozilla.com/it/2011/11/09/this-week-in-it-early-holiday-shopping/#comments</comments>
		<pubDate>Thu, 10 Nov 2011 03:25:40 +0000</pubDate>
		<dc:creator>cshields</dc:creator>
				<category><![CDATA[General Updates]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/it/?p=1468</guid>
		<description><![CDATA[Mozilla is growing, and we have to grow our infrastructure right along with it.  A lot of this just happens behind the scenes as we are buying and installing new gear every month.  For today&#8217;s post I thought I would give an insight into the broad range of servers we have in the ordering/shipping process&#8230; <a class="more-link" href="http://blog.mozilla.com/it/2011/11/09/this-week-in-it-early-holiday-shopping/" title="Read the rest of &#8220;This week in IT: early holiday shopping&#8221;">Read more</a>]]></description>
			<content:encoded><![CDATA[<p>Mozilla is growing, and we have to grow our infrastructure right along with it.  A lot of this just happens behind the scenes as we are buying and installing new gear every month.  For today&#8217;s post I thought I would give an insight into the broad range of servers we have in the ordering/shipping process right now.</p>
<div id="attachment_1469" class="wp-caption aligncenter" style="width: 226px"><a href="http://blog.mozilla.com/it/files/2011/11/add_to_cart.jpg"><img class="size-full wp-image-1469" title="add_to_cart" src="http://blog.mozilla.com/it/files/2011/11/add_to_cart.jpg" alt="" width="216" height="47" /></a><p class="wp-caption-text">if only it were this easy...</p></div>
<ul>
<li>New mailbox store &#8211; Our last Zimbra hardware upgrade was back in April. We are at the point now where we need to shed some mailboxes off to a second server to help with load.  This is not so much due to the growth in the number of mailboxes we host, but because most of our users have multiple clients hitting Zimbra all of the time. Desktop, laptop, cellphone, tablet..  multiply that by the number of mailboxes and we have to start spreading the love to an additional server before it becomes a problem.</li>
<li>New VMWare servers &#8211; More VM space for QA, Labs, and more!</li>
<li>New load balance test box &#8211; We have run into issues with our blade-hosted load balancer boxes.  We will swap a couple out in one of our data centers for some beefier boxes and 10gig ethernet connections.  Depending on our results here we may be reworking our existing load balancing clusters to fit this new model.  In addition, this will provide guidance for our new data center in the spring.</li>
<li>New gear for addons.mozilla.org &#8211; a whole rack of blades, in fact!  We are also awaiting the arrival of some Fusion IO flash accelerator cards for the AMO database nodes.  This will greatly increase the speed of I/O for those database nodes, ending up in faster results and more capacity.  We did the same for the bugzilla databases recently with great success.</li>
<li>Elastic Search servers for Socorro &#8211; 8 1U servers (aka &#8220;pizza boxes&#8221;) are on their way for Socorro.</li>
<li>New servers and infrastructure for China &#8211; We are prepping infrastructure for a new data center in China that will provide plenty of room and power where our current data center is tight.  This allows us to source some sites locally there as well as cache others there for improved performance throughout the region.</li>
</ul>
<p>On a logistical note, we are getting a new block of space in our Phoenix data center that all of this gear will go in once it is ready (except for the China servers).  We are close &#8211; just a few more pieces to that puzzle and we will start racking this new gear.</p>
<p>That&#8217;s all for now, it is time to get back to obsessing over FedEx trackers and logistical details.  Next week, we <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=674083">invent teleportation</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/it/2011/11/09/this-week-in-it-early-holiday-shopping/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Traffic Distribution</title>
		<link>http://blog.mozilla.com/it/2011/10/21/traffic-distribution/</link>
		<comments>http://blog.mozilla.com/it/2011/10/21/traffic-distribution/#comments</comments>
		<pubDate>Fri, 21 Oct 2011 17:28:33 +0000</pubDate>
		<dc:creator>jakem</dc:creator>
				<category><![CDATA[General Updates]]></category>
		<category><![CDATA[webops]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/it/?p=1364</guid>
		<description><![CDATA[We&#8217;ve recently been doing a lot of CDN work (see my last post on getpersonas.com), and out of that has come some interesting data as to our world-wide traffic distribution. Here&#8217;s a breakdown for of the traffic for getpersonas.com, by region: &#160; The CDN traffic for http://mozilla.org/firefox is similar&#8230; although Asia ranks a bit higher&#8230; <a class="more-link" href="http://blog.mozilla.com/it/2011/10/21/traffic-distribution/" title="Read the rest of &#8220;Traffic Distribution&#8221;">Read more</a>]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve recently been doing a lot of CDN work (see my <a href="http://blog.mozilla.com/it/2011/10/19/getpersonas-com-cdn-changes/">last post</a> on getpersonas.com), and out of that has come some interesting data as to our world-wide traffic distribution.</p>
<p>Here&#8217;s a breakdown for of the traffic for getpersonas.com, by region:</p>
<p><a href="http://blog.mozilla.com/it/files/2011/10/personas-region.png"><img class="alignnone size-full wp-image-1369" title="personas-region" src="http://blog.mozilla.com/it/files/2011/10/personas-region.png" alt="" width="580" height="315" /></a></p>
<p>&nbsp;</p>
<p>The CDN traffic for http://mozilla.org/firefox is similar&#8230; although Asia ranks a bit higher (22%) and North America lower (30%).</p>
<p>This should tell you right away that if we just focus on North America and Europe (as many CDN&#8217;s do), we&#8217;re going to cut out almost 1/3 of our visitors. Because of this, we&#8217;re spending a good bit of time researching CDN performance, trying to find platforms that will improve the user experience for our users in &#8220;the other 30%&#8221;.</p>
<p>Let&#8217;s drill down into South America a bit. This is a region that doesn&#8217;t get a lot of love, at least from a CDN perspective. Quite often the nearest CDN node ends up being in Miami or southern California, and is at least 250ms away&#8230; multiply that over many page elements, and it&#8217;s quite easy to end up with a 10 second page load time. Other times there might be a nearby node, but because of poor ISP peering it&#8217;s not actually reachable from very many networks. There&#8217;s also a vibrant Mozilla community here, and we&#8217;d like to be able to do a better job for them.</p>
<p><a href="http://blog.mozilla.com/it/files/2011/10/getpersonas-south-america.png"><img class="alignnone size-full wp-image-1370" title="getpersonas-south-america" src="http://blog.mozilla.com/it/files/2011/10/getpersonas-south-america.png" alt="" width="601" height="315" /></a></p>
<p>&nbsp;</p>
<p>This lines up fairly well with the overall World Internet Stats: <a title="http://www.internetworldstats.com/stats10.htm" href="http://www.internetworldstats.com/stats10.htm">http://www.internetworldstats.com/stats10.htm</a>.</p>
<p>So if you&#8217;re in one of these regions, fear not! We do care about you, and things will get better soon. <img src='http://blog.mozilla.com/it/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/it/2011/10/21/traffic-distribution/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Getpersonas.com CDN Changes</title>
		<link>http://blog.mozilla.com/it/2011/10/19/getpersonas-com-cdn-changes/</link>
		<comments>http://blog.mozilla.com/it/2011/10/19/getpersonas-com-cdn-changes/#comments</comments>
		<pubDate>Thu, 20 Oct 2011 00:11:42 +0000</pubDate>
		<dc:creator>jakem</dc:creator>
				<category><![CDATA[General Updates]]></category>
		<category><![CDATA[webops]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/it/?p=1356</guid>
		<description><![CDATA[We have just made a change to a DNS record for getpersonas.com affecting its CDN usage, and thought we&#8217;d share some of our findings as to it&#8217;s worldwide traffic patterns. Prior to now, getpersonas-cdn.mozilla.net pointed to a CDN aggregator service at 3crowd. We use this for several things. For example, aus2.mozilla.org (part of the Firefox&#8230; <a class="more-link" href="http://blog.mozilla.com/it/2011/10/19/getpersonas-com-cdn-changes/" title="Read the rest of &#8220;Getpersonas.com CDN Changes&#8221;">Read more</a>]]></description>
			<content:encoded><![CDATA[<p>We have just made a change to a DNS record for getpersonas.com affecting its CDN usage, and thought we&#8217;d share some of our findings as to it&#8217;s worldwide traffic patterns.</p>
<p>Prior to now, getpersonas-cdn.mozilla.net pointed to a CDN aggregator service at 3crowd. We use this for several things. For example, aus2.mozilla.org (part of the Firefox Automatic Updater Service) goes through 3crowd. It works very well for this type of thing, where what we&#8217;re really after is distributing the load across our own infrastructure, in order to gain redundancy.</p>
<p>One of its limitations though is that it is inherently disconnected from the actual CDNs being used. That is, it has no way to know if a particular CDN is really good in one country, but really bad in another. It can&#8217;t make that type of determination on-the-fly. By default it just assigns a weight to each CDN, and doles out the traffic in whatever ratios we specify.</p>
<p>For a normal website like getpersonas.com, with broad international appeal, this isn&#8217;t quite ideal. For folks in North America or Europe, it works just fine&#8230; generally CDNs have pretty good coverage there, and for the most part it won&#8217;t make a big difference which one you end up using. Outside of those regions though, CDN coverage is much more hit-and-miss. If you&#8217;re in Australia or Argentina, which CDN you get routed to might make a very big difference on your page load times.</p>
<p>For example, here&#8217;s a &#8220;before&#8221; graph of getpersonas.com load time from our Gomez test node in Sydney, Australia (on Telestra, to be precise):</p>
<p><a href="http://blog.mozilla.com/it/files/2011/10/getpersonas-australia-3crowd1.png"><img title="getpersonas-australia-3crowd" src="http://blog.mozilla.com/it/files/2011/10/getpersonas-australia-3crowd1.png" alt="" width="843" height="289" /></a></p>
<p>As you can see, it&#8217;s quite spikey. If you squint just right you can identify 3 tiers of performance. Not surprisingly, they line up with the 3 different backend services configured in 3crowd.The worst tier is from a local node we control ourselves in San Jose, CA. Clearly, that&#8217;s a pretty terrible choice from Australia&#8230; and yet it&#8217;s getting 1/3 of the overall traffic, simply because 3crowd doesn&#8217;t know any better.</p>
<p>This morning I changed this record away from 3crowd towards a new service we&#8217;re demoing, Cedexis. This is a similar service, but they maintain a database of CDN provider response times around the world&#8230; all the way down to the individual ISP level. Using this database, they can intelligently choose which CDN to use for every single request, theoretically using the fastest one from the user&#8217;s location.</p>
<p>Unfortunately it&#8217;s too soon to have a useful &#8220;after&#8221; graph to share. Instead, I&#8217;d like to share some information as to just how effective the new service might end up being.</p>
<p>&#8220;Global Village&#8221; is (as far as our traffic mix is concerned) the 4th biggest ISP in Brazil. They account for approximately 8% of Brazil&#8217;s traffic to getpersonas.com. Brazil as a whole is 45% of our South American traffic, which in turn is around 9% of our worldwide traffic. That&#8217;s the level at which we can drill down to. Here&#8217;s a graph of the &#8220;decisions&#8221; made by Cedexis for this one ISP in Brazil, over the last 24 hours. Remember, before today it would have been 50/50.</p>
<p><a href="http://blog.mozilla.com/it/files/2011/10/global-village-brazil.png"><img title="global-village-brazil" src="http://blog.mozilla.com/it/files/2011/10/global-village-brazil.png" alt="" width="363" height="214" /><br />
</a></p>
<p>This is 94% vs. 6%. This should give you some idea of just how important choosing the proper CDN could be&#8230; far more often than not, we were making the wrong decision for these users. Situations like this generally occur when an ISP has limited peering with upstream providers, and thus does not have a good route to some places. This particular ISP appears to have a good connection to CDNetworks, but a poor one to Edgecast. Obviously this type of thing is more prevalent in smaller ISPs, where they may not be able to get (or afford) more complete peering agreements. You generally won&#8217;t see this type of thing with very big ISPs.</p>
<p>Now aggregate this around the world, and you can quickly see that even if the worldwide traffic mix still comes out to approximately 50/50 (and it does&#8230; within a few percentage points), making better decisions locally can result in a much better experience for a large group of users.</p>
<p>Over the next few days we should get a good idea of how much this is actually helping&#8230; I expect small gains worldwide, and possibly large gains in certain regions. Especially outside of North America, I expect getpersonas.com will be substantially improved, with Asia-Pacific and South America seeing the biggest gains. When we have some data, I&#8217;ll post a graph showing the worldwide before-and-after.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/it/2011/10/19/getpersonas-com-cdn-changes/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>New LDAP Infrastructure</title>
		<link>http://blog.mozilla.com/it/2011/10/16/new-ldap-infrastructure/</link>
		<comments>http://blog.mozilla.com/it/2011/10/16/new-ldap-infrastructure/#comments</comments>
		<pubDate>Mon, 17 Oct 2011 04:53:21 +0000</pubDate>
		<dc:creator>jabba</dc:creator>
				<category><![CDATA[General Updates]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/it/?p=1340</guid>
		<description><![CDATA[This weekend, I rolled out a new LDAP infrastructure. Here are the details: The why: At Mozilla we depend heavily on our OpenLDAP-based[1] authentication system. As we&#8217;ve grown quite a bit over the past year or so, it has become apparent that our LDAP ecosystem wasn&#8217;t scaling accordingly. Until now, we&#8217;ve relied mostly on a&#8230; <a class="more-link" href="http://blog.mozilla.com/it/2011/10/16/new-ldap-infrastructure/" title="Read the rest of &#8220;New LDAP Infrastructure&#8221;">Read more</a>]]></description>
			<content:encoded><![CDATA[<p>This weekend, I rolled out a new LDAP infrastructure. Here are the details:</p>
<p>The why:</p>
<p>At Mozilla we depend heavily on our OpenLDAP-based[1] authentication system. As we&#8217;ve grown quite a bit over the past year or so, it has become apparent that our LDAP ecosystem wasn&#8217;t scaling accordingly. Until now, we&#8217;ve relied mostly on a single master server with a few slaves all behind an aging load balancer that has given us trouble in the past. Most of the slaves were no longer in the pool as their configurations have drifted over the years and a lack of documentation and consistency made it hard to add capacity, make changes or even ensure high availability in the event of a hardware failure. All the LDAP servers were actually servers primarily dedicated to other services, so extra load on those services would have the side effect of making the LDAP service unreliable. As we&#8217;ve added a few satellite offices and an extra datacenter that all relied on having some sort of authentication system in sync with our primary LDAP server in our San Jose data center, it was decided that we need to redo this setup with something a little more scalable and with better configuration management.</p>
<p>The what:</p>
<p>Over the past 6 months or so, I&#8217;ve done quite a bit of research learning how OpenLDAP works, how it behaves, how it scales and most of all, how it is set up and being used at Mozilla. Understanding the current setup was key to designing a better architecture. The first thing to do was to gather all the information, the configurations of all the existing LDAP servers and merge those into a centralized configuration management tool. As we&#8217;ve been using puppet[2],  I wrote a module to manage our OpenLDAP infrastructure. This turned out to be a little more difficult task than anticipated at first, as it had to support managing SSL certificates, a master server, slave servers, intermediary servers (servers that act as a master to other slaves, but are slaves themselves, replicating from the main master), all of which have slightly different configuration directives but overall need a similar configuration that stays in sync. Furthermore, we have some applications that act as frontend management tools for LDAP. Some are used by our team to provision new user accounts, maintain access control groups, etc. but also our phonebook directory app, which allows users to edit their own LDAP entries through a web interface. All these things needed be taken into account when re-architecting the infrastructure. All of the apps were hosted on a single server, which was also the master LDAP server among other things. If that single box failed, we&#8217;d be in a bad situation.</p>
<p>The how:</p>
<p>I designed a new infrastructure that splits the various components out a little. The phonebook app would live on a cluster of 6 machines (shared with other, similar apps &#8211; think: intranet). Our LDAP master would move to a dedicated server, not used as an authentication backend for other services, but rather have dedicated authentication slaves in each datacenter and office. As we have more capacity in our Phoenix data center, the master would live there along with the webservers serving the phonebook app. Starting out, there are two dedicated slaves replicating from the master in Phoenix, load balanced behind our Zeus load balancer[3] cluster to be used as an authentication backend for any services in our Phoenix datacenter. This setup makes it easy to add capacity as needed, and provides high availability in the event of a failure. There are also two dedicated slaves in our San Jose data center, but rather than have them both replicate from the master in Phoenix, it was decided to add an intermediary server in San Jose. The intermediary server would replicate from the master in Phoenix, and the San Jose slaves would replicate from it. Aside from reducing the cross-datacenter traffic, this provides better data consistency within a datacenter. A similar intermediary server was set up in our Mountain View office to provide a single point for all the other office LDAP servers to replicate from. Each individual Mozilla office has two LDAP slaves and instead of a load balancer, those use a virtual floating IP address that can move from one server to the other using keepalived. We balance our other services such as DNS in this fashion and it reduces the need for extra load balancing equipment in the lower traffic offices, where LDAP is primarily used for wireless connectivity. All servers would be setup completely using puppet, so that in the event of a hardware failure, or the need to add more slaves, adding a few lines to our puppet configurations can make this happen in a matter of minutes without much thought or effort involved. The other piece to the puzzle is our internal addressbook. We use OpenLDAP to provide addressbook lookups for mail clients. In the past, this was done using a single machine that was an LDAP slave. As we scale, we needed better redundancy there too. The addressbook would now live on two machines, also behind a load balancer. As an extra security measure, since this service is directly on the internet, it was decided to change the configuration to make the addressbook &#8220;slaves&#8221; be simple proxies that only allow the lookup of a few select attributes. A compromised addressbook server would not result in a compromised LDAP database. Win! Speaking of security, the entire LDAP infrastructure was moved to a more secure VLAN, completely inaccessible to or from the internet, with the addressbook being the only thing with any exposure to the internet. Also all ACLS were audited and updated to provide the minimal access necessary, while at the same time using standardized templates with puppet making it easy to add and remove ACLs as needed with proper version control and auditing in place.<br />
<a href="http://blog.mozilla.com/it/files/2011/10/ldap-diagram.jpg"><img class="alignnone size-large wp-image-1341" title="ldap-diagram" src="http://blog.mozilla.com/it/files/2011/10/ldap-diagram-1024x519.jpg" alt="" width="664" height="336" /></a></p>
<p>The move:</p>
<p>First, I set up all the new hardware and set up the full configuration using puppet. After drawing diagrams and identifying all network flows needed for replication and authentication to work, I worked closely with our network operations team to make sure everything would work as expected. Then I set up our San Jose intermediary server to replicate from the old master in San Jose to ensure that the overall flow of replication would work as expected and began testing various LDAP queries. Meanwhile, I set up the phonebook application in Phoenix on a new cluster of seamicro servers and began testing it against the new master. All of this was unknown, as we were moving from using a local LDAP server to using a remote one, going from RHEL5 to RHEL6, moving from single-hosted to multi-hosted. It was a whole new environment. I worked with the webdev team and our own tools developer to update our LDAP apps to work in the new environment. As we were already rolling out our new offices in San Francisco and Toronto and Paris, I set those up to replicate from the new intermediary server in Mountain View, so that half of the infrastructure has already been in production for a while, and was actually crucial to the testing phase. Once I was satisfied that everything would work properly, I identified what was left to do to get completely off the old infrastructure and move to the new. At this point everything was set up, and it was essentially just a matter of moving the master database to the new master server in Phoenix, changing some DNS names to point at new IPs and making sure that all the clients still worked as expected. All of this needed to happen with minimal downtime as we rely on LDAP being available 24/7 for so many things, including mail, wi-fi, svn, mercurial, our intranet wikis, shell servers, etc. I decided to tackle the project on a Saturday when the least number of people would be affected. I was pretty confident that the move could be done in under two hours, since I spent months preparing for it and ironing out the details. This was mostly the case, and there were only a few brief downtimes during the two hour window where various services failed to authenticate. This was mostly when I had to re-sync the slaves to have them catch up to their new masters. I did run into a few problems though, and here is the postmortem on that:</p>
<p>The Postmortem:</p>
<p>Everything went as planned. I shutdown the old master, copied the full database to the new server, set the intermediary slave in San Jose to replicate from the new master in Phoenix and reset synchronization on the slaves in Phoenix. At the same time I changed the DNS to point at the new load balanced IPs. I started the process around 7am on Saturday of ensuring that the shell servers could talk to the new LDAP servers, fixing some clients that had hardcoded the old master as their LDAP server and added nagios checks to all the new slaves and the new master. At 10am, the maintenance window started, so I shutdown the master and did the move and DNS changes then. By 10:30 San Jose was completely using only the new servers. Then I changed the DNS in Phoenix to point at the new servers and made sure the replication was working properly, both locally and remotely all the way to the remote offices through two intermediary servers. Everything went pretty well. By the end of the maintenance window at noon, I was just double and triple checking that replication was working. Around 1pm, I remembered that although I had made sure that our password reset app was working on our new cluster in Phoenix, I hadn&#8217;t ever tested changing my password and ensuring that the change would replicate properly. I tested it and found that it didn&#8217;t work. I worked with Rob Tucker, who happened to be online at the time to try to troubleshoot why it wasn&#8217;t working. It turned out that there was one piece missing from our new master. A password check module that we&#8217;ve had in use for a few years, but was completely undocumented. Rob helped me get the 64-bit version of the module compiled and installed in the new master and we finally got a password change to go through&#8230; but still not with the webapp that is user-facing for this purpose. We discovered that the app had one portion where it had hardcoded &#8220;localhost&#8221; as its LDAP server, rather than honoring the configuration directive. After patching the app, we finally had a successful password change.  At approximately 4pm, I was wrapping up and ready to leave for the evening, when a user e-mailed me to inform me that the phonebook app wouldn&#8217;t allow him to change anything in his profile. I couldn&#8217;t reproduce the issue, but realized this was because as an LDAP administrator, I have full admin rights to the LDAP database, whereas a regular user has a different set of ACLs that apply. This is when I realized that in all my testing, I neglected to test a normal user account. I hacked on the permissions a bit more on Saturday night to get that working and had the ACLs fixed by midnight (I took a break from 5pm &#8211; 11pm). On Sunday morning, I went through and verified that the little surprises I discovered the day before were documented and added to the puppet manifest. I checked in the temprorary patch to the password reset app into svn and tested it again to make sure there were no more glitches. I also fixed the nagios alerts for the Mountain View slaves, which were misconfigured and wouldn&#8217;t have alerted to a problem, if there had been one. I&#8217;m glad I came back to double check that. <img src='http://blog.mozilla.com/it/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>What I learned:</p>
<p>Details are important. Now that we have the new infrastructure, my next priority is setting up a stage infrastructure that can be used for the phonebook app, the password reset app and other tools and in general as a place to test and stage changes to our ldap infrastructure.</p>
<p>Testing is important. I should have tested the password reset app. I should have tested the phonebook with a normal user account.</p>
<p>People can be single points of failure. Before I started working on this, there was only one person with the full set of knowledge of our LDAP infrastructure. He left earlier this year to pursue other things and although he had documented most common issues and troubleshooting steps for our infrastructure, there was a huge amount of information that we didn&#8217;t have (the password check module for instance), and I had to learn a lot from scratch. I feel like I&#8217;ve learned a tremendous amount about how OpenLDAP works in the past 6 months and I enjoy working on it, but it is now extremely important that I don&#8217;t become a single point of functionality for our LDAP environment. I&#8217;ve worked on documenting all the bits and pieces I have about our infrastructure, gave a tech talk to our team about it and will continue involving other members of my team and documenting all the changes. I hope also that this blog post provides some insight into how it is all set up, to give an idea of what is involved with the setup, explain why when you change your password, it takes ten minutes before the wireless controller in San Francisco notices the change, etc.</p>
<p>The next steps:</p>
<p>Overall, I think our LDAP infrastructure is now infinitely better than it was. It is set up to scale now and it is easy to make documented version controlled changes. I&#8217;ve put in a lot of time over the past 6 months planning this out, learning LDAP, and it seems that I&#8217;ve been eating, sleeping and breathing OpenLDAP for a while now. However, I still don&#8217;t consider myself an expert on the subject and am continually learning. I think there are still a lot of improvements that can be made to the infrastructure and what we have now is pretty much a better scaled version of what we had. The basic configuration directives are the same though. There is likely some more tuning that can be done for better performance, more reliable replication and stability. These are all things I want to pursue, but with production services that are so integral to our infrastructure, it is crucial to take things one step at a time.</p>
<p>Special thanks to Rob Tucker, Fred Wenzel, Corey Shields, Phong Tran, Dumitru Gherman, Pete Fritchman, Michael Coates, Guillaume Destuynder, Adam Newman and the rest of the IT team for helping make this happen.</p>
<p>&#8211; Jabba</p>
<p>&nbsp;</p>
<p>[1] <a title="The OpenLDAP Project" href="http://www.openldap.org/">http://www.openldap.org/</a></p>
<p>[2] <a title="Puppetlabs" href="http://puppetlabs.com/">http://puppetlabs.com/</a></p>
<p>[3] <a title="Zeus Technology" href="http://www.zeus.com/">http://www.zeus.com/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/it/2011/10/16/new-ldap-infrastructure/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>New Etherpad!</title>
		<link>http://blog.mozilla.com/it/2011/10/07/new-etherpad/</link>
		<comments>http://blog.mozilla.com/it/2011/10/07/new-etherpad/#comments</comments>
		<pubDate>Sat, 08 Oct 2011 02:54:13 +0000</pubDate>
		<dc:creator>jakem</dc:creator>
				<category><![CDATA[General Updates]]></category>
		<category><![CDATA[webops]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/it/?p=1333</guid>
		<description><![CDATA[About 3 hours ago we updated the Mozilla Etherpad installation to a more current version. This has been in the works for almost a full year, and has finally come to fruition. https://etherpad.mozilla.org/ Here&#8217;s a short list of the cool features we&#8217;re getting with this upgrade: More ports work. You can still connect to the&#8230; <a class="more-link" href="http://blog.mozilla.com/it/2011/10/07/new-etherpad/" title="Read the rest of &#8220;New Etherpad!&#8221;">Read more</a>]]></description>
			<content:encoded><![CDATA[<p>About 3 hours ago we updated the Mozilla Etherpad installation to a more current version. This has been in the works for almost a full year, and has finally come to fruition.</p>
<p><a title="https://etherpad.mozilla.org/" href="https://etherpad.mozilla.org/" target="_blank">https://etherpad.mozilla.org/</a></p>
<p>Here&#8217;s a short list of the cool features we&#8217;re getting with this upgrade:</p>
<ul>
<li>More ports work. You can still connect to the old http://etherpad.mozilla.org:9000/ link, but the :9000 isn&#8217;t necessary anymore. The standard port 80 works just as well.</li>
<li>SSL Support! However you access the site, you&#8217;ll get redirected to <strong>https</strong>://etherpad.mozilla.org/. This is very cool, especially for the &#8220;SSL Everywhere&#8221; folks. <img src='http://blog.mozilla.com/it/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
<li>&#8220;Team Site&#8221; functionality. This is huge, and easily the biggest new feature&#8230; people have been asking for something like this for quite a while. Now Mozilla is a pretty open organization, but the reality is there are still some things that can&#8217;t be publicly discussed right away. A good example is domain name registrations&#8230; people have a habit of swiping them out from under us if we discuss a new domain name before it&#8217;s registered.</li>
<li>Team Site Pads can be public or private, and can even have their own password, just for that one pad. Let&#8217;s say you&#8217;ve got a team pad, and you need to let someone not on your team access it&#8230; but only that one pad, not any others. Simply make it a public pad, but set a password on it. Your team members can still access it, and now anyone you give the shared password to can also!</li>
<li>Team Site Pads can be deleted! This is a common request due to accidental information leaks (passwords, etc). Sadly this doesn&#8217;t extend to purely public sites, but it&#8217;s still a nice step forward.</li>
</ul>
<p>Within a couple hours of migrating to this (and on a Friday at 5pm), and despite a bug on the confirmation email preventing it from &#8220;just working&#8221;, we had 8 different team sites created for various groups&#8230; from apps to UX, jetpack to infra. I suspect we&#8217;ll see some cross-functional and community team sites eventually as well.</p>
<p>Sadly, there are some bugs still to be worked out, especially in the area of SSL certificates. I&#8217;ve created a wiki page, mostly dealing with features and bugs associated with the upgrade: <a title="https://wiki.mozilla.org/Etherpad" href="https://wiki.mozilla.org/Etherpad" target="_blank">https://wiki.mozilla.org/Etherpad</a>. Feel free to add to it!</p>
<p>&nbsp;</p>
<p>On a side note: there&#8217;s been talk recently about Etherpad Lite, and it&#8217;s definitely something we&#8217;re considering. We didn&#8217;t go with it this time because 1) most of this work was already done by the time we knew about that (this has been in the works a long time), and 2) Etherpad Lite lacks some of the functionality we&#8217;re getting here&#8230; specifically the Team Sites. It&#8217;s in their TODO list though, so I wouldn&#8217;t be surprised if we&#8217;re on Lite in the future.</p>
<p>&nbsp;</p>
<p>Let us know how the new system works for you! We&#8217;d love to get some feedback on it.</p>
<p>&nbsp;</p>
<p>- Jake</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/it/2011/10/07/new-etherpad/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Working with IT: Bug submissions</title>
		<link>http://blog.mozilla.com/it/2011/09/26/working-with-it-bug-submissions/</link>
		<comments>http://blog.mozilla.com/it/2011/09/26/working-with-it-bug-submissions/#comments</comments>
		<pubDate>Mon, 26 Sep 2011 19:17:38 +0000</pubDate>
		<dc:creator>cshields</dc:creator>
				<category><![CDATA[General Updates]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/it/?p=1320</guid>
		<description><![CDATA[During a recent Mozilla all-hands event Laura Thomson held a short presentation, titled &#8220;Working with IT&#8221;.  Laura was the right person to give it and the feedback that we have gathered is that we need to help people understand how to work with IT, and help you all understand how our infrastructure works.  Expect more&#8230; <a class="more-link" href="http://blog.mozilla.com/it/2011/09/26/working-with-it-bug-submissions/" title="Read the rest of &#8220;Working with IT: Bug submissions&#8221;">Read more</a>]]></description>
			<content:encoded><![CDATA[<p>During a recent Mozilla all-hands event Laura Thomson held a short presentation, titled &#8220;Working with IT&#8221;.  Laura was the right person to give it and the feedback that we have gathered is that we need to help people understand how to work with IT, and help you all understand how our infrastructure works.  Expect more brownbags and posts around this topic.</p>
<p><strong>So, let&#8217;s start by talking about bugs.</strong></p>
<p>Whether or not bugzilla is the right tool to track and manage IT projects and requests is up for debate. The benefit to using bugzilla is that it integrates with the rest of the project, since it is used for everything else at Mozilla.  That said, let&#8217;s talk about how IT works with bugs and how you can help us when you file bugs:</p>
<p><strong>Please do not assume tribal knowledge in bugs.</strong></p>
<p>In the past 30 days, the IT Systems and Ops teams have grown by 5 new sysadmins.  This is great, and they are all ramping up quickly. While we are throwing out numbers, we have seen 119 new bugs added to the Server Operations component in the past 7 days.  We want our new guys to help out in these bugs as quickly as they can.  When submitting a bug, please assume as little tribal knowledge as possible on the other side.  For instance, asking for a setting change in a production site without telling us which site you work on delays the bug while someone either asks for clarification or has to ask the team what you mean.  These are minor delays of course, but when this happens multiple times a day this becomes very inefficient.  If you have a doc to link to giving background on the request you are making, please do it.  If you know the system you are asking for a change on, please make note of it.</p>
<p><strong>Where does my bug go?</strong></p>
<p>The IT team is growing quickly, as is the need to sort our bugs into components lest we spin our wheels all working from one component.  Here is the layout of our components for bugs coming from you as it stands today (note the change in Web Operations):</p>
<ul>
<li><em>Server Operations: Web Operations</em> &#8211; this is where all web related bugs should go.  This is new, and is modified from the old &#8220;web content push&#8221; component to encompass web server problems, new web projects, and any general request regarding the serving of our websites.</li>
<li><em>Server Operations: Desktop Issues</em> &#8211; this is where the desktop team currently works. Laptop issues, software license requests, and help with the office environment should all go here.</li>
<li><em>Server Operations: RelEng</em> &#8211; Any issues regarding the release engineering build systems (aka &#8220;the build network&#8221;) should go here.</li>
<li><em>Server Operations: Netops</em> &#8211; Network requests and issues should be filed here</li>
<li><em>Server Operations: Labs</em> &#8211; Mozilla Labs IT requests go in here</li>
<li><em>Server Operations: ACL Request</em> &#8211; Firewall requests for Netops</li>
<li><em>Server Operations</em> &#8211; Everything else that did not fall into one of the above.</li>
</ul>
<p><strong>Priority and escalation</strong></p>
<p>The default priority for our bugs is &#8220;normal&#8221;.  We will get to these as soon as we can, and by nature of your request we assume that you want them done as soon as possible.  If this is a request that does not fall under that assumption and you want it to fall under the &#8220;nice to have someday&#8221; category, mark it as an enhancement. Anything higher than normal demands attention soon.  Our SLA for addressing bugs higher than normal is such:</p>
<ul>
<li>Major &#8211; 24 hours</li>
<li>Critical &#8211; 8 hours</li>
<li>Blocker &#8211; immediately</li>
</ul>
<p>These timers work around the clock, and if a bug sits unaddressed beyond those times, our oncall is paged. Blocker IT bugs will page oncall immediately.  We can not guarantee that the request will be resolved within this time (ie: if you file a critical bug for a new cluster of servers, it will take us time to procure them first), but we will have admins aware of it and start working on it.  In addition, we have our own internal prioritization of issues that come in.  If a critical bug in a dev site comes in, that may have to wait for work that we are doing on a production site.</p>
<p><strong>That was a lot to read..</strong></p>
<p>And if you are still with me, thanks for taking the time to understand how we work in bugzilla. By getting bugs filed more efficiently we can spend less of our time refining the bugs and more time fixing them.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/it/2011/09/26/working-with-it-bug-submissions/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

