<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blog of Data &#187; Firefox</title>
	<atom:link href="http://blog.mozilla.com/data/tag/firefox/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/data</link>
	<description>Mozilla metrics team technical articles</description>
	<lastBuildDate>Thu, 01 Sep 2011 21:30:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Collecting and analyzing log data via Flume and Hive</title>
		<link>http://blog.mozilla.com/data/2010/08/15/collecting-and-analyzing-log-data-via-flume-and-hive/</link>
		<comments>http://blog.mozilla.com/data/2010/08/15/collecting-and-analyzing-log-data-via-flume-and-hive/#comments</comments>
		<pubDate>Sun, 15 Aug 2010 23:30:26 +0000</pubDate>
		<dc:creator>aphadke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Firefox]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hive]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=207</guid>
		<description><![CDATA[Exponential growth, one of the few problems every organization loves, is usually alleviated by scaling out using clustered computing (Hadoop), CDN, EC2 and myriad of other solutions. While a lot of cycles are spent in making sure each scaled out machine contains requisite libraries, latest code deployments, matching configs, and the whole nine yards, very [...]]]></description>
			<content:encoded><![CDATA[<p>Exponential growth, one of the few problems every organization loves, is usually alleviated by scaling out using clustered computing (Hadoop), CDN, EC2 and myriad of other solutions. While a lot of cycles are spent in making sure each scaled out machine contains requisite libraries, latest code deployments, matching configs, and the whole nine yards, very little time is spent in collecting the log files + data from these machines and analyzing them.</p>
<p>Few reasons why log collection is usually at tail of priorities:</p>
<ol>
<li>Nagios alerts usually do a good job of monitoring for critical situations. The scripts make sure the app&#8217;s always online by grep&#8217;ing for &#8220;ERROR, WARN&#8221; and other magic terms in logs, but what about errors that occur often but don&#8217;t bring down the app completely?</li>
<li>Web-analytics give us all information we need. -Yes on a macroscopic view, but it&#8217;s really hard for an analytical software to provide fine granularity, such as how many hits did we receive pertaining to a given country for a given page for a given time-period?</li>
<li>Ganglia graphs help us find out what machine/s are under heavy load &#8211; Absolutely, but trying to figure what triggered the load in first place is not always easy.</li>
</ol>
<p>Chukwa, Scribe and Flume are headed in the right direction, but the  final piece of puzzle of analyzing the data still remained unsolved,  until few weeks back as we, at Mozilla, started integrating Flume with  Hive.</p>
<div id="attachment_209" class="wp-caption aligncenter" style="width: 298px"><img class="size-full wp-image-209" title="Merge everything" src="http://blog.mozilla.com/data/files/2010/08/288px-MUTCD_W4-3.svg_.png" alt="" width="288" height="288" /><p class="wp-caption-text">Merge everything - Image courtesy Wikipedia.org</p></div>
<p><a title="Flume" href="http://archive.cloudera.com/cdh/3/flume/UserGuide.html">Flume</a> is an open-source distributed log collection software that can be installed on multiple machines for monitoring log files with data slurped to a single HDFS location. The out of box solution only solved part of our problem of collecting data, but we needed a way to query it and thereby make intelligent decisions based on the results.</p>
<p>The teams first foray was to add a <a href="https://issues.cloudera.org/browse/FLUME-29">gzip patch</a> that compressed the log data before transferring the files to HDFS. Once the data was transferred, we needed a way to query it. Our current production Hadoop cluster consists of modest 20 machines, has excellent monitoring in terms of nagios and ganglia, but the question of what might we be missing always lingered on our heads. A list of basic things needed to be taken care of while integrating <a href="https://issues.cloudera.org/browse/FLUME-77">Flume with Hive</a> was created:</p>
<ol>
<li>How do we handle fail-overs when the HIVE metastore service, a possible single point of failure for HIVE goes down?</li>
<li>How to query data by the day and hour.</li>
<li>Can separate tables be used for different log locations?</li>
<li>Can we split a single log line in its respective columns?</li>
</ol>
<p><strong>1. Handling fail-overs:</strong> We are currently running HIVE metastore in remote mode using MySQL. More information on metastore setup can be found at http://wiki.apache.org/hadoop/Hive/AdminManual/MetastoreAdmin. Flume node-agents reliable handle fail-overs by<strong> </strong>maintaining checksums at regular intervals and making sure data isn&#8217;t inserted twice. The same principle was extended by adding marker points. i.e. a file containing HQL query and the location of data will be written to HDFS after every successful FLUME roll-over. Flume agents would look at a common location for pending HIVE writes before writing any log data to HDFS, attempt to move the data inside HIVE and only delete the marker file if successful. In situations where two or more flume agents attempt to move files to HIVE partition, one of them will encounter an innocuous HDFS file not found error and proceed as usual.</p>
<p><strong>2. Appending to sub-partitions:</strong> Flume supports rollover where data is written to disk every &#8216;x&#8217; millis. This is particularly useful as data is available inside HDFS at regular intervals and can be queried by the hour or minute granularity. While whole data can be written to a single partition, partitioning data inside HIVE is a huge performance benefit as it only siphons through a specific range rather than whole data set. This was achieved by having two partitions for a table &#8211; by date and hour. An equivalent HIVE query looks something like:</p>
<p style="text-align: center;"><em>LOAD DATA INPATH &#8216;&#8221; + dstPath + &#8220;&#8216; INTO TABLE &#8221; + hiveTableName + &#8221; PARTITION (ds=&#8217;&#8221; + dateFormatDay.format(cal.getTime()) +<br />
&#8220;&#8216;, ts=&#8217;&#8221; + dateFormatHourMinute.format(cal.getTime()));</em></p>
<p style="text-align: left;"><strong>3. Using separate tables for different log locations:</strong> We wanted to use separate tables for Hadoop and HBase log locations. Our initial approach was to add a config setting in flume-site.xml, but half way down that road we realized that config is wrong place, as it needs to exist on each node-agent and mapping different folders to tables will be a logistical nightmare.</p>
<p style="text-align: left;">A new sink named <em>hiveCollectorSink(hdfs_path, prefix, table_name)</em> was added to the existing family (<a href="http://archive.cloudera.com/cdh/3/flume/UserGuide.html#_output_bucketing">http://archive.cloudera.com/cdh/3/flume/UserGuide.html#_output_bucketing</a>). This allowed us to add hive tables on the fly for each log folder location, thereby giving a separate placeholder for Hadoop and Hbase logs.</p>
<p><strong>4. Splitting a single log line in respective columns (a.k.a. regex):</strong> Log4J is a standard log file convention used by quite a few applications including Hadoop and HBase. A sample line looks something like this:</p>
<p style="text-align: center;"><em>2010-08-15 12:36:59,850 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_-1857716372571578738_336272</em></p>
<p style="text-align: left;">Given the above structure, we decided to split the line in 5 columns:<br />
date, time, message_type, class_name and message; the table definition given below -</p>
<p style="text-align: left;">CREATE TABLE cluster_logs (<br />
line_date STRING,<br />
line_time STRING,<br />
message_type STRING,<br />
classname STRING,<br />
message STRING<br />
)<br />
PARTITIONED BY (ds STRING, ts STRING, hn STRING)</p>
<p>ROW FORMAT SERDE &#8216;org.apache.hadoop.hive.contrib.serde2.RegexSerDe&#8217;<br />
WITH SERDEPROPERTIES (<br />
&#8220;input.regex&#8221; =<br />
&#8220;^(?&gt;(\\d{4}(?&gt;-\\d{2}){2})\\s((?&gt;\\d{2}[:,]){3}\\d{3})\\s([A-Z]+)\\s([^:]+):\\s)?(.*)&#8221;<br />
)<br />
STORED AS TEXTFILE;</p>
<p><strong>NOTE: </strong>The &#8220;hn&#8221; (hostname) partition was added so we could query the data based on individual hostnames, enabling us to know what hostname has biggest chunk of ERROR, WARN messages.</p>
<p style="text-align: left;"><strong> </strong>The above framework has allowed us to reliably collect logs from our  entire cluster to a single location and then query the data from a  SQLish interface.</p>
<p style="text-align: left;"><strong>Future Steps:</strong></p>
<ul>
<li> Flume + Hive patch is still a work in progress and will be committed to the trunk in a couple of weeks.</li>
<li><a href="http://code.google.com/p/socorro/">Socorro</a> (Mozilla&#8217;s crash reporting system) <a href="http://www.laurathomson.com/2010/08/the-future-of-crash-reporting/">1.9 and above</a> will be using processors in a distributed mode and we plan to insert the processor&#8217;s log data inside HIVE thereby helping us better understand throughput, avg. time to process each crash-data and other metrics. Watch this space for more related posts.</li>
</ul>
<p>The developers of Flume + Hive usually hang on IRC (irc.freenode.net) in the following channels: #flume, #hive, #hbase<br />
Feel free to ask questions/thoughts/suggestions and I will reply to them below.</p>
<p>-Anurag Phadke (email: first letter of my firstname followed by last name at &#8211; the &#8211; rate  mozilla dot com)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2010/08/15/collecting-and-analyzing-log-data-via-flume-and-hive/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Firefox Downloads on Release Day</title>
		<link>http://blog.mozilla.com/data/2010/02/18/firefoxdownloadsonreleaseday/</link>
		<comments>http://blog.mozilla.com/data/2010/02/18/firefoxdownloadsonreleaseday/#comments</comments>
		<pubDate>Fri, 19 Feb 2010 06:54:38 +0000</pubDate>
		<dc:creator>deinspanjer</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Firefox]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=160</guid>
		<description><![CDATA[Update: Ken Kovash studied the Firefox 3.6 downloads and found a wonderful reason for them! See here for more details. I was asked this evening if the nightly report was correct in showing that we had a 128% increase in Firefox downloads today.  The answer is a resounding yes that figure is correct, but I [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Update: </strong>Ken Kovash studied the Firefox 3.6 downloads and found a wonderful reason for them! <a href="http://blog.mozilla.com/metrics/2010/02/19/why-do-firefox-downloads-spike-on-release-days/">See here</a> for more details.</p>
<p>I was asked this evening if the nightly report was correct in showing that we had a 128% increase in Firefox downloads today.  The answer is a resounding yes that figure is correct, but I figured it wouldn&#8217;t hurt to put a bit more detail on it.</p>
<p style="text-align: center;">
<div id="attachment_159" class="wp-caption aligncenter" style="width: 501px"><a href="http://blog.mozilla.com/data/files/2010/02/FirefoxDownloadsOnReleaseDay.jpg"><img class="size-large wp-image-159 " title="Firefox Downloads On Release Day" src="http://blog.mozilla.com/data/files/2010/02/FirefoxDownloadsOnReleaseDay-1024x677.jpg" alt="Histograms demonstrating the sharp flood of upgrade patch downloads on release day" width="491" height="325" /></a><p class="wp-caption-text">Histograms demonstrating the sharp flood of upgrade patch downloads on release day</p></div>
<p>Wednesday, February 17th, 2010 was a release day for both Firefox 3.0.18 and 3.5.8.  Starting around noon Pacific time yesterday, hundreds of millions of Firefox users would eventually see a prompt notifying them that there was an upgrade available to install.</p>
<p>If the user was running the latest security release for their version of Firefox, the upgrade would consist of a small patch file that would quickly bring them up to the new security release.</p>
<p>If the user was not on the latest security release, they would instead download a special file that was the full size of the Firefox application.  The upgrade process would then automatically install the upgrade.</p>
<p>The question at hand is, &#8220;why do we see an increase in people visiting the Mozilla website and downloading a new installer?&#8221;  In the past, we&#8217;ve had concerns about whether people were having trouble with the automatic update system and were being forced to download the application and install it manually.  From the data I reviewed today, I think it is safe to say that is not a common scenario.  Instead, I&#8217;m happy to report that what I see is a lot of people who get a reminder that they should upgrade Firefox and they decide that it is about time for them to go ahead and download the latest and greatest Firefox 3.6 instead.</p>
<p>The chart above (generated courtesy of Tableau Software) shows that manual downloads of the 3.0 and 3.5 versions remained relatively flat, but manual downloads of Firefox 3.6 climbed by almost 3x over the previous day&#8217;s peak traffic time.<br />
I would also like to point out that an infrastructure that can handle in increase of over 5 million requests per hour in a three hour window isn&#8217;t too shabby.</p>
<p>While my Pentaho Data Integration ETL processes do slow down considerably when processing this huge influx of data, they keep up, managing to process these 10 or so million requests per hour in 30 to 40 minutes.  It could actually be much quicker if I moved this processing to a separate server, but the primary reason it is slow is because once it falls outside its normal 5 minute run time, it has to compete with other ETL processes that are scheduled to run later in the hour.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2010/02/18/firefoxdownloadsonreleaseday/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Tracking down the number of Firefox Addon users with hadoop</title>
		<link>http://blog.mozilla.com/data/2009/08/10/tracking-down-the-number-of-firefox-addon-users-with-hadoop/</link>
		<comments>http://blog.mozilla.com/data/2009/08/10/tracking-down-the-number-of-firefox-addon-users-with-hadoop/#comments</comments>
		<pubDate>Mon, 10 Aug 2009 09:05:25 +0000</pubDate>
		<dc:creator>skrueger</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AMO]]></category>
		<category><![CDATA[Firefox]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=108</guid>
		<description><![CDATA[I was presented with the challenge of answering the question – how many Firefox users have one add-on or more installed on their Firefox. Currently, addons.mozilla.org (AMO) has statistics on the download counts of add-ons but the actual usage of add-ons has been unanswered. The Add-ons manager inside of Firefox will check each add-on for [...]]]></description>
			<content:encoded><![CDATA[<div align="center"><img src="http://blog.mozilla.com/data/files/2009/08/hadoop-logo.jpg" alt="hadoop logo" /><img src="http://blog.mozilla.com/data/files/2009/08/firefox-64.png" alt="firefox logo" /></div>
<p>I was presented with the challenge of answering the question – how many Firefox users have one add-on or more installed on their Firefox.  Currently, addons.mozilla.org (<b>AMO</b>) has statistics on the download counts of add-ons but the actual usage of add-ons has been unanswered.</p>
<p>The Add-ons manager inside of Firefox will check each add-on for update at AMO.  This happens once every 24 hour period when Firefox is ran by the user.   Updates are handled over HTTP at either addons.mozilla.org for Firefox1.0/1.5/2.0 and versioncheck.addons.mozilla.org for Firefox3.0/3.5.  The add-ons manager will ping the servers with information about each add-on and if an update exists the server will respond with one.  Since the update ping is handled in HTTP the update ping is recorded in a log file.   If you have never seen a web server&#8217;s HTTP log file they simply are flat text files where each line contains information about the requests made to the server.  Below is example line of the AMO log file and an explanation of the fields.</p>
<pre>
IP                  HOSTNAME                             TIMESTAMP          REQUEST
 255.255.255.255 versioncheck.addons.mozilla.org - [22/Jun/2009:02:00:00 -0700] "GET
/update/VersionCheck.php?reqVersion=1&#038;id={B13721C7-F507-4982-B2E5-502A71474FED}&#038;
version=2.2.0.102&#038;maxAppVersion=3.*&#038;status=userEnabled&#038;
appID={ec8030f7-c20a-464f-9b0e-13a3a9e97384}&#038;appVersion=3.0.11&#038;appOS=WINNT&#038;
appABI=x86-msvc&#038;locale=en-US HTTP/1.1" 200 520 "-"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.11)
Gecko/2009060215 Firefox/3.0.11(.NET CLR 3.5.30729)"
</pre>
<p>We choose the log files on 2009/06/22 because Firefox will ping AMO multiple times after a Firefox version update and this date was 11 days after the Firefox 3.0.11 release and 3 days after Firefox 3.5 RC2 release so most users should have already of been up-to-date by this time.  The whole day&#8217;s worth of log data for both hostnames total out to be around <b>28GB compressed</b>.  The log files were large because they also contained requests for AMO&#8217;s website. </p>
<p>There is no unique identifier to determine which update pings came from which user and we had to rely on identifying pings from a single user by the IP address, and Timestamp in the update ping.  The IP address is the single most unique identifier due to the nature of IP addresses but because of routers (NAT) and proxies many computers can sit behind one IP address.  To add another degree of separation we decided to group the update pings by the timestamp of the ping.   Update pings will happen within a few seconds from each other so pings in a certain time window would be considered as one user, and other pings from an IP address out side of this time window would be considered as a different user.  For example,  say that there are two Firefox users behind a router. User 1 might open up his browser in the morning at 10AM and ping AMO and User 2 might open up his browser in the afternoon and ping AMO at 2PM.  Even though the pings are from the same IP address the pings at 10AM would all be grouped together and counted separately from the pings that happened at 2PM which would also be grouped together.</p>
<h2>Setup/Config</h2>
<p>Upon hearing about the description of the problem I thought that this would be an ideal candidate for a MapReduce job.  I would be able to Map the IP address as the Key and all the other data in the log file entry as a HashMap for the value.  I talked with my manager and found out that this was a technology that the metrics team was interested in exploring and was given 4 mac minis to test my implementation out on.  I quickly began setting up my Hadoop Cluster running on ubuntu 9.04 desktop which soon became ubuntu 9.04 server for memory conservation and unnecessary gui (Ubuntu 9.04 Desktop takes up about 256MB of RAM on a clean install while Server takes up about 90MB of RAM on a clean install).  I have been a user of hadoop in the past but I had never setup my own hadoop cluster before so I turned to the <a href="http://hadoop.apache.org/common/docs/current/cluster_setup.html">hadoop website</a> and this <a href="http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29">blog post</a> to help me out.  These tutorials at the time did not exactly cover the hadoop version I was using, 0.20.  (I believe they are up-to-date now).  But was able to accommodate for this fact.  The main two things different in hadoop version 0.19 than in version 0.20 is the configuration file and parts of the hadoop java API.  In version 0.20, <hadoop root>/conf/hadoop-site.xml was split in to three parts in <hadoop root>/conf/core-site.xml, <hadoop root>/conf/hdfs-site.xml, and <hadoop root>/conf/mapred-site.xml and the API was re-factored slightly and some classes were depreciated.</p>
<p>My Mac minis were given hostnames of hadoop-node1-4.  Hadoop-node1 was my master node that ran the NameNode and the JobTracker while hadoop-node2-4 were my slaves that ran the TaskTracker and the DataNode.</p>
<div align="center">
<img src="http://blog.mozilla.com/data/files/2009/08/1.jpg" alt="mac mini hadoop cluster" />
</div>
<pre>
 # <hadoop root>/conf/masters
 hadoop-node1.mv.mozilla.com

 # <hadoop root>/conf/slaves
 hadoop-node2.mv.mozilla.com
 hadoop-node3.mv.mozilla.com
 hadoop-node4.mv.mozilla.com

 # The java processes
 hadoop@hadoop-node1:/usr/local/hadoop/conf$ jps
 15778 Jps
 30059 NameNode
 30187 SecondaryNameNode
 30291 JobTracker

 hadoop@hadoop-node2:/usr/local/hadoop$ jps
 16950 TaskTracker
 16838 DataNode
 20186 Jps
</pre>
<p>After getting the cluster setup I tested it out with the hadoop wordcount example and validated the results.</p>
<h2>MapReduce</h2>
<p>I then began writing a MapReduce Job with the Hadoop Java API.  My first thought was to write my own RecordReader which is responsible for reading from an input split that would give key, value pairs to the mapper but decided to go with the default LineRecordReader which puts the file offset as the key and the line as the value because it seemed easier and more natural to have the log file line dissected inside of the Mapper&#8217;s map function.</p>
<h3>Map</h3>
<p>In the map function each line went through a regexp that broke each piece out of the log file line.  If the line contained Firefox&#8217;s appid, and VersionCheck.php I would map the IP as the Key and construct an AddonsWritable (which is a child of MapWritable with an overridden toString() for output purposes)  that contained the epoch time (converted from the date timestamp because it would be much easier to compare with), a MapWritable of add-on guids, and a count of the number of add-ons.  </p>
<pre>
public static class IPAddressMapper extends Mapper<LongWritable, Text, Text, AddonsWritable>{
     /* member vars for mapper which include vars for regexp and storing data */
     private AddonsWritable logInfo = new AddonsWritable();

     public void map(LongWritable key, Text logLine, Context context) throws IOException, InterruptedException {
         if(logLine.matchesRegexp() &#038;&#038; isFirefox() &#038;&#038; hasVersionCheckphp()) {
             logInfo.put(EPOCH, epoch); // store the epoch_ts
             logInfo.put(GUID, guid); // store the guid
             logInfo.put(TOTAL, ONE); // store the count
             context.write(ipAddress, logInfo); // map out the ipAddress as Key and logInfo as value
         }
     }
 }
</pre>
<h3>Reduce</h3>
<p>After the map the Hadoop Framework hashes keys and gives them to the reduce function.  Inside the Reducer&#8217;s reduce function you are given the Key which is the IP address and an Iterable of the AddonsWritables that were from the same key/IP.  I needed to group the values with update pings in a  certain time window together and unfortunately the Iterable does not guarantee order.  So I put the MapWritables in a PriorityQueue with a custom comparator that ordered values by the timestamp field in the AddonsWritable.  Some IP&#8217;s had thousands of pings, so if I counted an IP with more than 2,000 pings I threw it out.  Once all the values were placed in the PriorityQueue, I iterated over the priorityQueue popping off each value and comparing to the previous seen timestamp.  If the abs(current timestamp &#8211;  prev timestamp) <= 10 secs I considered them to be from the same user, and added them to a MapWritable of guids inside of the MapWritable that contained all the other information.  Once I saw a current timestamp where the abs(current timestamp – prev timestamp) > 10secs I wrote/collected the previous values and started a new MapWritable for the next new window. Until there were no more values in the priorityQueue and I would write/collect the final current values.</p>
<pre>
public static class IPAddressEpochTimeReducer extends Reducer<Text,AddonsWritable,Text,AddonsWritable> {
     private PriorityQueue<AddonsWritable> pq = new PriorityQueue<AddonsWritable>();
     public void reduce(Text key, Iterable<AddonsWritable> values, Context context) throws IOException, InterruptedException {
         for(AddonsWritable val: values) {
             pq.add(new AddonsWritable(val));
             if(pq.size() > 2000) {
                 pq.clear();
                 return ;
             }
         }

         while(!pq.isEmpty()) {
             AddonsWritable val = pq.remove();
             if(lastEpoch != -1 &#038;&#038; Math.abs(lastEpoch - currentEpoch) > SIXTY_SECONDS) {
                 writeOut();  // Write out all the information for the current collection of versioncheck pings
                 resetVars();  // Reset all the currently used vars for the next collection of versioncheck pings
             }
             addGuid(output, val.get(GUID));
             sum += val.get(TOTAL);
             lastEpoch = currentEpoch;
         }
         /* There is one more remaining.  Write it out */
         writeOut();
     }
 }
</pre>
<h2>Runtime Stats</h2>
<p>Hadoop provides a web interface that will output runtime statistics for a job, below are the stats for the job described above.</p>
<ul>
<li>Submitted At:  17-Jul-2009 09:55:12</li>
<li>Launched At: 17-Jul-2009 09:55:12 (0sec)</li>
<li>Finished At: 17-Jul-2009 14:05:52 (4hrs, 10mins, 39sec)</li>
<li>Average time taken by Map tasks: 2mins, 45sec</li>
<li>
Average time taken by Shuffle: 2hrs, 33mins, 9sec</li>
<li>Average time taken by Reduce tasks: 1hrs, 4mins, 10sec</li>
</ul>
<table border="2" cellpadding="5" cellspacing="2">
<tr>
<td>Kind</td>
<td>Total Tasks(successful+failed+killed)</td>
<td>Successful tasks</td>
<td>Failed tasks</td>
<td>Killed tasks</td>
<td>Start Time</td>
<td>Finish Time</td>
</tr>
<tr>
<td>Setup</td>
<td>
        1</td>
<td>
        1</td>
<td>
        0</td>
<td>
        0</td>
<td>17-Jul-2009 09:55:22</td>
<td>17-Jul-2009 09:55:24 (1sec)</td>
</tr>
<tr>
<td>Map</td>
<td>
        364</td>
<td>
        362</td>
<td>
        0</td>
<td>
        2</td>
<td>17-Jul-2009 09:55:25</td>
<td>17-Jul-2009 12:45:56 (2hrs, 50mins, 31sec)</td>
</tr>
<tr>
<td>Reduce</td>
<td>
        5</td>
<td>
        5</td>
<td>
        0</td>
<td>
        0</td>
<td>17-Jul-2009 10:13:40</td>
<td>17-Jul-2009 14:05:55 (3hrs, 52mins, 15sec)</td>
</tr>
<tr>
<td>Cleanup</td>
<td>
        1</td>
<td>
        1</td>
<td>
        0</td>
<td>
        0</td>
<td>17-Jul-2009 14:05:57</td>
<td>17-Jul-2009 14:06:03 (5sec)</td>
</tr>
</table>
<h2>Output/Results </h2>
<p>The outputted files ended up having lines looking like the one below.</p>
<pre>
 IP              EPOCH_TS   ADDON_COUNT                           LIST_OF_GUIDS
 255.255.255.255 1245665519000	2 {CAFEEFAC-0016-0000-0013-ABCDEFFEDCBA} {CAFEEFAC-0016-0000-0000-ABCDEFFEDCBA}
</pre>
<p>I created a python script to gather statistics out of the output.  With the 10 second window I ended up finding out that there was a total of </p>
<ul>
<li>244,727,644 add-on update pings</li>
<li>117,557,228 users</li>
<li>average of 2.14 add-ons per user</li>
<li>variance of 5.68</li>
</ul>
<p>Since our Active Daily User (<b>ADU</b>) count for that day was 98,000,000 users this data didn&#8217;t make much sense.  And we decided to repeat the processes with a 60 second window instead of the previous 10 second window.  </p>
<p>With the 60 second window I ended up finding that there was </p>
<ul>
<li>94,656,833 users </li>
<li>average of 2.63 add-ons per user</li>
<li>variance of 12.34</li>
</ul>
<p>These numbers still seemed fairly large so I decided to reduce on IP address which would allow us a base where there is at least 1 user behind an ip.  To reduce by IP I changed the reducer to not account for the timestamp and simply reduce all values sharing the same IP.</p>
<p>The output of this was </p>
<ul>
<li>32,848,771 IP/USERS</li>
<li>average of 5.04 add-ons per user</li>
<li>variance of 779.91</li>
</ul>
<p>To compare this data with Firefox&#8217;s ADU I ran a similar mapreduce job on our Firefox ADU data that counted </p>
<ul>
<li>61,460,501 IPs</li>
<li>average of 1.60 Users per IP</li>
<li>
variance of 86.57</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/08/10/tracking-down-the-number-of-firefox-addon-users-with-hadoop/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

