<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blog of Data &#187; MapReduce</title>
	<atom:link href="http://blog.mozilla.com/data/tag/mapreduce/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/data</link>
	<description>Mozilla metrics team technical articles</description>
	<lastBuildDate>Thu, 01 Sep 2011 21:30:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Riak and Cassandra and HBase, oh my!</title>
		<link>http://blog.mozilla.com/data/2010/05/18/riak-and-cassandra-and-hbase-oh-my/</link>
		<comments>http://blog.mozilla.com/data/2010/05/18/riak-and-cassandra-and-hbase-oh-my/#comments</comments>
		<pubDate>Tue, 18 May 2010 13:07:32 +0000</pubDate>
		<dc:creator>deinspanjer</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Riak]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=184</guid>
		<description><![CDATA[We are marching along in our integration of HBase with the Socorro Crash Stats project, but I wanted to take a minute away from that to talk about a separate project the Metrics team has also been involved with. Mozilla Labs Test Pilot is a project to experiment and analyze data from real world Firefox [...]]]></description>
			<content:encoded><![CDATA[<p>We are marching along in our <a href="https://wiki.mozilla.org/Socorro:Overview" target="_blank">integration of HBase</a> with the <a href="http://code.google.com/p/socorro/" target="_blank">Socorro</a> <a href="http://crash-stats.mozilla.com/" target="_blank">Crash Stats</a> project, but I wanted to take a minute away from that to talk about a separate project the Metrics team has also been involved with.</p>
<p><a href="https://testpilot.mozillalabs.com/">Mozilla Labs Test Pilot</a> is a project to experiment and analyze data from real world Firefox users to discover quantifiable ways to improve our user experience.  I was very interested and excited about the project because of the care they take to protect the user&#8217;s privacy.  They have a very user focused <a href="https://testpilot.mozillalabs.com/privacy.html" target="_blank">privacy policy</a> that is easy to read, which always makes me happy.  Every step of the way they make sure the user is aware and comfortable with the data they are sending by making it easy for the user to see their data before they submit it and providing the user the choice to submit it or not when the data is ready. The data is always very general in nature, not containing any sensitive information like URLs and it is not associated with any personally identifying information at any time.</p>
<p>In the pre 1.0 releases of Test Pilot, the data that is submitted from the add-on is received by a simple script transforms the data into a flat file that is stored on an NFS server.</p>
<p>We are planning on making a huge drive to ramp up the volume of users and the number of experiments, and that means that this simple storage mechanism will not survive.  Here are some of the most important requirements we&#8217;ve hashed out in our planning:</p>
<ul>
<li>Expected minimum users: 1 million.  Design to accommodate 10 million by the end of the year and have a plan for scaling out to tens of millions. (This is the 1x 10x 100x rule of estimation of which I am a fan)</li>
<li>Expected amount of data stored per experiment: 1.2 TB</li>
<li>Expected peak traffic: approximately 75 GB per hour for two 8 hour periods following the conclusion of an experiment window.  This two day period will result in collection of approximately 90% of the total data.</li>
<li>Remain highly available under load</li>
<li>Provide necessary validation and security constraints to prevent bad data from polluting the experiment or damaging the application</li>
<li>Provide a flexible and easy-to-use way for data analysts to explore the data.  While all of these guys are great with statistics and thinking about data, not all of them have a programming background, so higher-level APIs are a plus.</li>
<li>Do it fast.</li>
</ul>
<p>I am a technology nut.  I love to research technologies to keep abreast of the state-of-the-art and also potential tools.  While I&#8217;ve always been a SQL aficionado, I am also a big fan of the &#8220;NoSQL&#8221; technologies because I feel there is a great role that they serve.</p>
<p>When I looked at the characteristics of this project, I felt that a key-value or column-store solution was the best fit, so I started digging through my research bookmarks and doing some technology cost/benefit analysis.</p>
<p>Eventually, our team came down to three primary contenders:</p>
<ul>
<li><a href="http://hadoop.apache.org/hbase/" target="_blank">HBase</a></li>
<li><a href="http://cassandra.apache.org/" target="_blank">Cassandra</a></li>
<li><a href="http://riak.basho.com/" target="_blank">Riak</a></li>
</ul>
<p>We recently had a meeting wherein we hashed out a lot of the pros and cons of each of these solutions.  I wanted to share that discussion with everyone, not because I was looking forward to being set-upon by the two contenders that I didn&#8217;t feel were the best fit, but rather for two reasons:</p>
<ol>
<li>Crowd-sourcing &#8212; I believe that laying out the thoughts and assumptions in the open is the best way to ensure that we receive the broadest set of feedback from the experts in each of the varying technologies.  I further believe that it is better to be aware of the over-looked features and warnings raised by these experts and consider what can be done to mitigate them rather than hiding from them.</li>
<li>Sharing of knowledge &#8212; Even if it turns out that we didn&#8217;t get all the answers right or that we didn&#8217;t come up with the ideal solution, I believe that we asked a lot of good questions here and I believe that listing these questions might help some other team who has to make a similar decision.</li>
</ol>
<p>So let&#8217;s get down to the discussion points:</p>
<ul>
<li>Scalability &#8212; Deliver a solution that can handle the expected starting load and that can easily scale out as that load goes up.</li>
<li>Elasticity &#8212; Because the peak traffic periods are relatively short and the non-peak hours are almost idle, it is important to consider ways to ensure the allocated hardware is not sitting idle, and that you aren&#8217;t starved for resources during the peak traffic periods.</li>
<li>Reliability &#8212; Stability and high availability is important.  It isn&#8217;t as critical as it might be in certain other projects, but if we were down for several hours during the peak traffic period, the client layer needs to be able to retain the data and resubmit at a later date.</li>
<li>Storage &#8212; Need enough room to store active experiments and also recent experiments that are being analyzed.  It is expected that data will become stale over time and can be archived off of the active cluster.</li>
<li>Analysis &#8212; What do we have to put together to provide a friendly system to the analysts?</li>
<li>Cost &#8212; Actual cost of the additional hardware needed to deploy the initial solution and to scale through at least the end of the year.</li>
<li>Manpower &#8212; How much time and effort will it take us to deliver the first critical stage of the project and the subsequent stages?  Also consider ongoing maintenance and ownership of the code.</li>
<li>Security &#8212; Because we will be accepting data from an outside, untrusted source, we need to consider what steps are necessary to ensure the health of the system and the privacy of users.</li>
<li>Extensibility &#8212; delivering a platform that can readily evolve to meet the future needs of the project and hopefully other projects as well.</li>
<li>Disaster Recovery / Migration &#8212; If the original system fails to meet the requirements after going live, what options do we have to recover from that situation?  If we decide to switch to another technology, how do we move the data?</li>
</ul>
<p>Now we iterate those points again, but this time we have the points made by the team regarding each of the three solutions being considered:</p>
<ul>
<li>Elasticity&#8211; Machines can be added as load increases.  Machines can be turned off and reconfigured to remove them. There is the ever-present the risk of a bug resulting in a lack of replication or corruption causing data loss.  In all three solutions, re-balancing the existing data incurs an additional load penalty as data is shifted around the cluster.  We need to consider how much time and manual administration is required, how much can be automated, how risky rebalancing is, and how long until we begin to see the benefit of the additional nodes.
<ul>
<li>HBase<br />
In HBase, the data is split into &#8220;regions&#8221;.  The backing data files for regions are stored in HDFS and hence replicated out to multiple nodes in the cluster.  Every RegionServer owns a set of regions.  Normally, the RegionServer will own regions that exist on the local HDFS DataNode.<br />
If you add a new node, HDFS will begin considering that node for the purposes of replication. When a region file is split, HBase will determine which machines should be the owners of the newly split files.  Eventually, the new node will store a reasonable portion of the new and newly split data.<br />
Re-balancing the data involves both re-balancing HDFS and then ensuring that HBase reasses the ownership of regions.</li>
<li>Cassandra<br />
In Cassandra, nodes claim ranges of data.  By default, when a new machine is added, it will receive half of the largest range of data.  There are configuration options during node start-up to change that behavior.  There are certain configuration requirements to ensure safe and easy balancing, and there is a rebalance command that can perform the work throughout all the data ranges.  There is also a monitoring tool that allows you to track the progress of the re-balancing.</li>
<li>Riak<br />
In Riak, the data is divided into partitions that are distributed among the nodes.  When a node is added, the distribution of partition ownership is changed and both old and new data will immediately begin migrating over to the new data.</li>
</ul>
</li>
<li>Cost &#8212; Regardless of solution, we should be able to use commodity server hardware with Linux OS.
<ul>
<li>HBase &#8212; Because of the heavy peak traffic periods, it is very likely that we would need a dedicated cluster. Otherwise, other projects such as Socorro might be negatively impacted.  Also, a scheduled maintenance window would affect both projects instead of just one.<br />
HBase is memory-hungry.  Our current nodes are dual quad core hyper-threaded boxes with 4TB of disk and 24 GB of memory.  It is unlikely that we would want to go less than that.  We would need at least two highly available master nodes, and by the end of the year we&#8217;d likely need 12 machines for a single cluster solution.</li>
<li>Cassandra &#8212; Much lighter on the memory requirements, especially if you don&#8217;t need to keep a lot of data in cache.  We would likely want to double the amount of CPU on the four nodes currently allocated to the Test Pilot project. We&#8217;d also want to order 8 more machines.  To perform analysis with Cassandra, we&#8217;ll have to leverage our Hadoop cluster.</li>
<li>Riak &#8212; Also much lighter on memory requirements.  The existing four nodes (quad core 8 GB) allocated for the project should be enough to kick it off, and we&#8217;d expect to add at least two more equivalent machines to that cluster.  We&#8217;d also set up a second cluster of 6 to 8 less powerful machines for the analysis cluster.  Because of the elasticity of Riak, we could temporarily re-purpose N-3 of those machines to the write cluster to accommodate expected peak traffic windows.</li>
</ul>
</li>
<li>Manpower
<ul>
<li>HBase &#8212; Need a front-end layer to accept experiment submissions from the client.  The fewer changes required for the client, the better.  Thrift or a roll-our-own Java are the two most likely options.  The application needs to be heavily tested for capacity and stability.  Likely two weeks for development and two weeks for testing.  Estimate is dependent on the amount of security code, sanity checks, and cluster communication fail-over that has to be implemented.  Additional maintenance burden of supporting a separate service.<br />
Schema design needs to be reflected in the front-end code to allow data to be parsed out and stored in the proper column families.</li>
<li>Cassandra &#8212; Mostly the same as HBase. Thrift or Java application hand developed and tested.  Schema design to accommodate storage by the front-end.</li>
<li>Riak &#8212; Built in REST server.  Already heavily tested and production ready.  Minimal schema design and no specific hooking in of the schema to the REST server should be needed.</li>
</ul>
</li>
<li>Security &#8212; We can&#8217;t expect to hide any sort of handshake protocol or authentication token.  If we wanted to require an authentication token, extensive changes would have to be made to the client add-on which would delay the project.  SSL doesn&#8217;t seem to gain us much because we aren&#8217;t transmitting potentially sensitive data, and it has overhead penalties.  Our firewall and proxy/load-balancer layer is our most important line of defense.  It should reject URL hacks, unusual payload sizes, and potentially be able to blacklist repeated submissions from the same IP.  Ideally, if the payload inspection could communicate IP addresses or payload signatures to blacklist, we&#8217;d be pretty well equipped to prevent degradation of the cluster health.
<ul>
<li>HBase/Cassandra &#8212; We would need the custom built front-end layer to be responsible for inspecting the payload to look for invalid/incomplete data and reject it.  This adds to the requirements and implementation time of the custom front-end layer.</li>
<li>Riak &#8212; We can use Webmachine pre-commit hooks to allow inclusion of business logic to perform payload inspection.</li>
</ul>
</li>
<li>Extensibility &#8212; When changes are made to the data stored, all three solutions will potentially require modification of the payload inspection routines and potentially the analysis entry-point to reflect the schema changes.
<ul>
<li>HBase &#8212; Schema changes involving adding or altering column families require disabling the table. This means a maintenance window.  Creation of new tables can be performed on the fly.</li>
<li>Cassandra &#8212; Schema changes require a rolling restart of the nodes.</li>
<li>Riak &#8212; New buckets and schema changes are completely dynamic.</li>
</ul>
</li>
<li>Data Migration &#8212; All three solutions make it pretty easy to replicate, export, or MapReduce data out of the system.</li>
<li>Disaster Recovery  &#8212; In all three solutions, it would be best for the client add-on to have enough intelligence to be able to back-off if the cluster load is too high, and to retry submission later if it fails.
<ul>
<li>HBase &#8212; Custom front-end could incorporate fail-over code to locally spool submissions until cluster is back online.  A second cluster would be the most viable DR option.</li>
<li>Cassandra &#8212; Same as HBase</li>
<li>Riak &#8212; Could temporarily reassign the entire reporting cluster to handle incoming submissions. Because there is no custom front-end, if we were unable to make the Riak cluster available for client connections, we would have no buffer in place on the server side to spool submissions.</li>
</ul>
</li>
<li>Reliability &#8212; Small periods of downtime should not be a major issue, especially if the client add-on has retry capability and/or if the front-end layer can spool.
<ul>
<li>HBase &#8212; Until subsequent versions provide better High Availability options, the Hadoop NameNode and HBase Master are still a single point of failure.  Certain types of administration and upgrades require restart of the entire cluster with a maintenance window required to modify the NameNode or HBase Master.  Rolling restarts are an option for many types of maintenance, but some HBase experts discourage them.</li>
<li>Cassandra &#8212; No single point of failure. Most configuration changes can be handled via rolling restarts.</li>
<li>Riak &#8212; Same as Cassandra.</li>
</ul>
</li>
<li>Analysis
<ul>
<li>HBase &#8212; Can provide a HIVE based interface (possibly with JDBC connectivity).  Can provide a simplified MapReduce framework to allow analysts to submit certain types of common, simple jobs.</li>
<li>Cassandra &#8212; Uses Hadoop, answer same as HBase.</li>
<li>Riak &#8212; Map Reduce jobs can be written in JavaScript and submitted through the REST API.  A light-weight web interface can be created to allow submission of those jobs.</li>
</ul>
</li>
</ul>
<p>Based on the evaluation of these discussion points, and also on the availability of some Basho experts to deliver a nearly turn-key solution, we have decided to go with Riak for the implementation of the Test Pilot back-end.  While it feels a little odd to be using a technology that is similar in many ways to HBase which we are investing heavily in, I think it is the best choice for us and I actually see several areas that we could potentially consider using Riak for other projects.</p>
<p>If you have any questions, concerns, or clarifications, please feel free to submit them as comments and I will respond or update the post where applicable.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2010/05/18/riak-and-cassandra-and-hbase-oh-my/feed/</wfw:commentRss>
		<slash:comments>38</slash:comments>
		</item>
		<item>
		<title>Tracking down the number of Firefox Addon users with hadoop</title>
		<link>http://blog.mozilla.com/data/2009/08/10/tracking-down-the-number-of-firefox-addon-users-with-hadoop/</link>
		<comments>http://blog.mozilla.com/data/2009/08/10/tracking-down-the-number-of-firefox-addon-users-with-hadoop/#comments</comments>
		<pubDate>Mon, 10 Aug 2009 09:05:25 +0000</pubDate>
		<dc:creator>skrueger</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AMO]]></category>
		<category><![CDATA[Firefox]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/data/?p=108</guid>
		<description><![CDATA[I was presented with the challenge of answering the question – how many Firefox users have one add-on or more installed on their Firefox. Currently, addons.mozilla.org (AMO) has statistics on the download counts of add-ons but the actual usage of add-ons has been unanswered. The Add-ons manager inside of Firefox will check each add-on for [...]]]></description>
			<content:encoded><![CDATA[<div align="center"><img src="http://blog.mozilla.com/data/files/2009/08/hadoop-logo.jpg" alt="hadoop logo" /><img src="http://blog.mozilla.com/data/files/2009/08/firefox-64.png" alt="firefox logo" /></div>
<p>I was presented with the challenge of answering the question – how many Firefox users have one add-on or more installed on their Firefox.  Currently, addons.mozilla.org (<b>AMO</b>) has statistics on the download counts of add-ons but the actual usage of add-ons has been unanswered.</p>
<p>The Add-ons manager inside of Firefox will check each add-on for update at AMO.  This happens once every 24 hour period when Firefox is ran by the user.   Updates are handled over HTTP at either addons.mozilla.org for Firefox1.0/1.5/2.0 and versioncheck.addons.mozilla.org for Firefox3.0/3.5.  The add-ons manager will ping the servers with information about each add-on and if an update exists the server will respond with one.  Since the update ping is handled in HTTP the update ping is recorded in a log file.   If you have never seen a web server&#8217;s HTTP log file they simply are flat text files where each line contains information about the requests made to the server.  Below is example line of the AMO log file and an explanation of the fields.</p>
<pre>
IP                  HOSTNAME                             TIMESTAMP          REQUEST
 255.255.255.255 versioncheck.addons.mozilla.org - [22/Jun/2009:02:00:00 -0700] "GET
/update/VersionCheck.php?reqVersion=1&#038;id={B13721C7-F507-4982-B2E5-502A71474FED}&#038;
version=2.2.0.102&#038;maxAppVersion=3.*&#038;status=userEnabled&#038;
appID={ec8030f7-c20a-464f-9b0e-13a3a9e97384}&#038;appVersion=3.0.11&#038;appOS=WINNT&#038;
appABI=x86-msvc&#038;locale=en-US HTTP/1.1" 200 520 "-"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.11)
Gecko/2009060215 Firefox/3.0.11(.NET CLR 3.5.30729)"
</pre>
<p>We choose the log files on 2009/06/22 because Firefox will ping AMO multiple times after a Firefox version update and this date was 11 days after the Firefox 3.0.11 release and 3 days after Firefox 3.5 RC2 release so most users should have already of been up-to-date by this time.  The whole day&#8217;s worth of log data for both hostnames total out to be around <b>28GB compressed</b>.  The log files were large because they also contained requests for AMO&#8217;s website. </p>
<p>There is no unique identifier to determine which update pings came from which user and we had to rely on identifying pings from a single user by the IP address, and Timestamp in the update ping.  The IP address is the single most unique identifier due to the nature of IP addresses but because of routers (NAT) and proxies many computers can sit behind one IP address.  To add another degree of separation we decided to group the update pings by the timestamp of the ping.   Update pings will happen within a few seconds from each other so pings in a certain time window would be considered as one user, and other pings from an IP address out side of this time window would be considered as a different user.  For example,  say that there are two Firefox users behind a router. User 1 might open up his browser in the morning at 10AM and ping AMO and User 2 might open up his browser in the afternoon and ping AMO at 2PM.  Even though the pings are from the same IP address the pings at 10AM would all be grouped together and counted separately from the pings that happened at 2PM which would also be grouped together.</p>
<h2>Setup/Config</h2>
<p>Upon hearing about the description of the problem I thought that this would be an ideal candidate for a MapReduce job.  I would be able to Map the IP address as the Key and all the other data in the log file entry as a HashMap for the value.  I talked with my manager and found out that this was a technology that the metrics team was interested in exploring and was given 4 mac minis to test my implementation out on.  I quickly began setting up my Hadoop Cluster running on ubuntu 9.04 desktop which soon became ubuntu 9.04 server for memory conservation and unnecessary gui (Ubuntu 9.04 Desktop takes up about 256MB of RAM on a clean install while Server takes up about 90MB of RAM on a clean install).  I have been a user of hadoop in the past but I had never setup my own hadoop cluster before so I turned to the <a href="http://hadoop.apache.org/common/docs/current/cluster_setup.html">hadoop website</a> and this <a href="http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29">blog post</a> to help me out.  These tutorials at the time did not exactly cover the hadoop version I was using, 0.20.  (I believe they are up-to-date now).  But was able to accommodate for this fact.  The main two things different in hadoop version 0.19 than in version 0.20 is the configuration file and parts of the hadoop java API.  In version 0.20, <hadoop root>/conf/hadoop-site.xml was split in to three parts in <hadoop root>/conf/core-site.xml, <hadoop root>/conf/hdfs-site.xml, and <hadoop root>/conf/mapred-site.xml and the API was re-factored slightly and some classes were depreciated.</p>
<p>My Mac minis were given hostnames of hadoop-node1-4.  Hadoop-node1 was my master node that ran the NameNode and the JobTracker while hadoop-node2-4 were my slaves that ran the TaskTracker and the DataNode.</p>
<div align="center">
<img src="http://blog.mozilla.com/data/files/2009/08/1.jpg" alt="mac mini hadoop cluster" />
</div>
<pre>
 # <hadoop root>/conf/masters
 hadoop-node1.mv.mozilla.com

 # <hadoop root>/conf/slaves
 hadoop-node2.mv.mozilla.com
 hadoop-node3.mv.mozilla.com
 hadoop-node4.mv.mozilla.com

 # The java processes
 hadoop@hadoop-node1:/usr/local/hadoop/conf$ jps
 15778 Jps
 30059 NameNode
 30187 SecondaryNameNode
 30291 JobTracker

 hadoop@hadoop-node2:/usr/local/hadoop$ jps
 16950 TaskTracker
 16838 DataNode
 20186 Jps
</pre>
<p>After getting the cluster setup I tested it out with the hadoop wordcount example and validated the results.</p>
<h2>MapReduce</h2>
<p>I then began writing a MapReduce Job with the Hadoop Java API.  My first thought was to write my own RecordReader which is responsible for reading from an input split that would give key, value pairs to the mapper but decided to go with the default LineRecordReader which puts the file offset as the key and the line as the value because it seemed easier and more natural to have the log file line dissected inside of the Mapper&#8217;s map function.</p>
<h3>Map</h3>
<p>In the map function each line went through a regexp that broke each piece out of the log file line.  If the line contained Firefox&#8217;s appid, and VersionCheck.php I would map the IP as the Key and construct an AddonsWritable (which is a child of MapWritable with an overridden toString() for output purposes)  that contained the epoch time (converted from the date timestamp because it would be much easier to compare with), a MapWritable of add-on guids, and a count of the number of add-ons.  </p>
<pre>
public static class IPAddressMapper extends Mapper<LongWritable, Text, Text, AddonsWritable>{
     /* member vars for mapper which include vars for regexp and storing data */
     private AddonsWritable logInfo = new AddonsWritable();

     public void map(LongWritable key, Text logLine, Context context) throws IOException, InterruptedException {
         if(logLine.matchesRegexp() &#038;&#038; isFirefox() &#038;&#038; hasVersionCheckphp()) {
             logInfo.put(EPOCH, epoch); // store the epoch_ts
             logInfo.put(GUID, guid); // store the guid
             logInfo.put(TOTAL, ONE); // store the count
             context.write(ipAddress, logInfo); // map out the ipAddress as Key and logInfo as value
         }
     }
 }
</pre>
<h3>Reduce</h3>
<p>After the map the Hadoop Framework hashes keys and gives them to the reduce function.  Inside the Reducer&#8217;s reduce function you are given the Key which is the IP address and an Iterable of the AddonsWritables that were from the same key/IP.  I needed to group the values with update pings in a  certain time window together and unfortunately the Iterable does not guarantee order.  So I put the MapWritables in a PriorityQueue with a custom comparator that ordered values by the timestamp field in the AddonsWritable.  Some IP&#8217;s had thousands of pings, so if I counted an IP with more than 2,000 pings I threw it out.  Once all the values were placed in the PriorityQueue, I iterated over the priorityQueue popping off each value and comparing to the previous seen timestamp.  If the abs(current timestamp &#8211;  prev timestamp) <= 10 secs I considered them to be from the same user, and added them to a MapWritable of guids inside of the MapWritable that contained all the other information.  Once I saw a current timestamp where the abs(current timestamp – prev timestamp) > 10secs I wrote/collected the previous values and started a new MapWritable for the next new window. Until there were no more values in the priorityQueue and I would write/collect the final current values.</p>
<pre>
public static class IPAddressEpochTimeReducer extends Reducer<Text,AddonsWritable,Text,AddonsWritable> {
     private PriorityQueue<AddonsWritable> pq = new PriorityQueue<AddonsWritable>();
     public void reduce(Text key, Iterable<AddonsWritable> values, Context context) throws IOException, InterruptedException {
         for(AddonsWritable val: values) {
             pq.add(new AddonsWritable(val));
             if(pq.size() > 2000) {
                 pq.clear();
                 return ;
             }
         }

         while(!pq.isEmpty()) {
             AddonsWritable val = pq.remove();
             if(lastEpoch != -1 &#038;&#038; Math.abs(lastEpoch - currentEpoch) > SIXTY_SECONDS) {
                 writeOut();  // Write out all the information for the current collection of versioncheck pings
                 resetVars();  // Reset all the currently used vars for the next collection of versioncheck pings
             }
             addGuid(output, val.get(GUID));
             sum += val.get(TOTAL);
             lastEpoch = currentEpoch;
         }
         /* There is one more remaining.  Write it out */
         writeOut();
     }
 }
</pre>
<h2>Runtime Stats</h2>
<p>Hadoop provides a web interface that will output runtime statistics for a job, below are the stats for the job described above.</p>
<ul>
<li>Submitted At:  17-Jul-2009 09:55:12</li>
<li>Launched At: 17-Jul-2009 09:55:12 (0sec)</li>
<li>Finished At: 17-Jul-2009 14:05:52 (4hrs, 10mins, 39sec)</li>
<li>Average time taken by Map tasks: 2mins, 45sec</li>
<li>
Average time taken by Shuffle: 2hrs, 33mins, 9sec</li>
<li>Average time taken by Reduce tasks: 1hrs, 4mins, 10sec</li>
</ul>
<table border="2" cellpadding="5" cellspacing="2">
<tr>
<td>Kind</td>
<td>Total Tasks(successful+failed+killed)</td>
<td>Successful tasks</td>
<td>Failed tasks</td>
<td>Killed tasks</td>
<td>Start Time</td>
<td>Finish Time</td>
</tr>
<tr>
<td>Setup</td>
<td>
        1</td>
<td>
        1</td>
<td>
        0</td>
<td>
        0</td>
<td>17-Jul-2009 09:55:22</td>
<td>17-Jul-2009 09:55:24 (1sec)</td>
</tr>
<tr>
<td>Map</td>
<td>
        364</td>
<td>
        362</td>
<td>
        0</td>
<td>
        2</td>
<td>17-Jul-2009 09:55:25</td>
<td>17-Jul-2009 12:45:56 (2hrs, 50mins, 31sec)</td>
</tr>
<tr>
<td>Reduce</td>
<td>
        5</td>
<td>
        5</td>
<td>
        0</td>
<td>
        0</td>
<td>17-Jul-2009 10:13:40</td>
<td>17-Jul-2009 14:05:55 (3hrs, 52mins, 15sec)</td>
</tr>
<tr>
<td>Cleanup</td>
<td>
        1</td>
<td>
        1</td>
<td>
        0</td>
<td>
        0</td>
<td>17-Jul-2009 14:05:57</td>
<td>17-Jul-2009 14:06:03 (5sec)</td>
</tr>
</table>
<h2>Output/Results </h2>
<p>The outputted files ended up having lines looking like the one below.</p>
<pre>
 IP              EPOCH_TS   ADDON_COUNT                           LIST_OF_GUIDS
 255.255.255.255 1245665519000	2 {CAFEEFAC-0016-0000-0013-ABCDEFFEDCBA} {CAFEEFAC-0016-0000-0000-ABCDEFFEDCBA}
</pre>
<p>I created a python script to gather statistics out of the output.  With the 10 second window I ended up finding out that there was a total of </p>
<ul>
<li>244,727,644 add-on update pings</li>
<li>117,557,228 users</li>
<li>average of 2.14 add-ons per user</li>
<li>variance of 5.68</li>
</ul>
<p>Since our Active Daily User (<b>ADU</b>) count for that day was 98,000,000 users this data didn&#8217;t make much sense.  And we decided to repeat the processes with a 60 second window instead of the previous 10 second window.  </p>
<p>With the 60 second window I ended up finding that there was </p>
<ul>
<li>94,656,833 users </li>
<li>average of 2.63 add-ons per user</li>
<li>variance of 12.34</li>
</ul>
<p>These numbers still seemed fairly large so I decided to reduce on IP address which would allow us a base where there is at least 1 user behind an ip.  To reduce by IP I changed the reducer to not account for the timestamp and simply reduce all values sharing the same IP.</p>
<p>The output of this was </p>
<ul>
<li>32,848,771 IP/USERS</li>
<li>average of 5.04 add-ons per user</li>
<li>variance of 779.91</li>
</ul>
<p>To compare this data with Firefox&#8217;s ADU I ran a similar mapreduce job on our Firefox ADU data that counted </p>
<ul>
<li>61,460,501 IPs</li>
<li>average of 1.60 Users per IP</li>
<li>
variance of 86.57</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/data/2009/08/10/tracking-down-the-number-of-firefox-addon-users-with-hadoop/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

