Places stats

The Places team has realized that we don’t have a good idea of how people use bookmarks and history in Firefox. We don’t know if most people have lots of bookmarks, a few, or somewhere in between; lots of history or a little; large Places databases or small. The better we know how people use bookmarks and history, the better we can tweak performance.

So we started a project to collect statistics volunteered from people’s Places data. To be clear, we’re not talking about harvesting people’s individual bookmarks and history. What we’re interested in is, well, 1) getting volunteers, and 2) mostly numbers: the number of bookmarks people have, the number of pages in their histories, and so on. This raises reasonable privacy concerns, but I’m not going to address them here; the project site takes them on in detail.

What I will focus on here is how we built the data collection site and what we found from our initial round of volunteers.

Implementation

https://places-stats.mozilla.com/stats is a simple Ruby script that adds a data point on POST and dumps a big XHTML table on GET. We’re using MySQL on the backend (natch) and ActiveRecord as an ORM bridging Ruby and MySQL.

On POST /stats expects a string of JSON. From the front page of the site people can access a script, copy it to their clipboards, paste it into their JavaScript consoles, and in a pop-up window get back JSON that encodes their Places stats. Then they press a button to POST it. For the most part the property names in the JSON correspond directly to the columns of the main table in the stats database. (I’ll use the terms { "property name": "property value" }.) The columns themselves are the dimensions we’re measuring for each data point: number of bookmarks, file size of Places database, etc. The /stats script parses the JSON and makes sure that valid property names and values were specified, and from there it’s a simple insert into the database.

On GET /stats pulls in a table template using ERB, a Ruby-based templating system, and populates it appropriately. /stats can also output JSON on GET. It checks the Accept request header and if it contains application/json, it outputs a JSON object { "avg": AVG, "stddev": STDDEV, "max": MAX, "min": MIN }, where AVG, STDDEV, MAX, and MIN are themselves objects whose property names are the same as those when POSTing JSON and whose property values are the average, standard deviation, maximum, and minimum values collected so far for each dimension. In other words you get back the first four rows of this table in JSON form. Right now David is using the JSON to inform his database generation script. We could get fancier and enable more parameterized output, like, “get me the number of bookmarks for all data points.” We could also do different output types like CSV — this one at least is on the to-do list. I’m not sure yet what we will need for this project specifically, but we are aware that other people may find the data or even our methodology useful.

Initial round

We asked people at Mozilla to be our guinea pigs, and we got 127 unique responses in return. This data set is small, and there’s not much to draw from it, but uh, below are graphs illustrating some of the more interesting dimensions. Some people had many bookmarks, history visits, and so on, but overall the data skews the other direction.

There are two graphs for each dimension. The first simply shows all the data points, along with the mean and standard deviation, which is fairly big for all our data. The second graph is a histogram. Each y-axis tick label is the upper bound on the bucket it sits next to. The bucket’s lower bound is the next label below it. So for example the top bucket of the places_file_size graph contains about 2 data points ranging in file size from over 99 MB to 104 MB. The bottom bucket of that graph contains about 22 data points ranging in file size from over 0 MB to 5 MB.

places_file_size1

places_file_size-hist

moz_bookmarks_cnt

moz_bookmarks_cnt-hist

moz_historyvisits_cnt

moz_historyvisits_cnt-hist

moz_places_cnt

moz_places_cnt-hist

I had expected that some of these distributions would be approximately normal, but this measly data doesn’t quite bear that out, and I’m not very bright anyway. Maybe when we gather more data. But after thinking about it, I wonder whether the distributions will not generally be normal after all. To take one example, everybody starts out with just a handful of bookmarks after installing Firefox. (Same with history visits.) Some people will add 20 bookmarks a day, some people 20 bookmarks a month, others 20 bookmarks a year. It takes work to add bookmarks, and it takes more work to add more bookmarks. Some people will climb that hill (nerds), some won’t (pep-peps). But it seems reasonable to guess that the farther up the hill you go, the fewer people you’ll see. If that’s accurate it might suggest something with a shape like an exponential distribution. The question is, how far does the average person climb?

Some ideas for future analysis:

  • We could use clustering methods to come up with some exemplar Places database shapes, or in other words, archetypal Places users.
  • There are correlation questions we can ask, like, “What makes a large Places database?” “What determines the size of frecency_first_bucket_visit_cnt?” “Are some dimensions mutually dependent?” (Maybe these aren’t good examples; we know more or less the answers to them now, and I’m not sure they’re useful. But we might find something unexpected or just unthought of.)

Comments (5)

  1. Andyed wrote:

    Submitted data successfully from FF 3.0.7 !

    Thursday, March 26, 2009 at 4:39 am #
  2. O wrote:

    Just a quick note that the script code is blocked out by AdBlock. This might block many more inexperienced users.

    Tuesday, March 31, 2009 at 4:02 pm #
  3. adw wrote:

    Thanks guys. I renamed the script to something hopefully more innocuous and added a more visible note.

    Tuesday, March 31, 2009 at 4:17 pm #
  4. ski wrote:

    Is your graph “moz_historyvisits_cnt distribution” missing a y-axis entry for 0 visits? Looking at the stats results, 0 is the minimum, and it happens a couple of times. This might be necessary for other graphs as well?

    Sunday, April 5, 2009 at 1:16 pm #
  5. adw wrote:

    The distribution graphs are histograms. The bottom bucket in the moz_historyvisits_cnt distribution graph contains 22 data points whose numbers of history visits range from 0 to 9579. So the data points that have 0 visits are lumped into the bottom bucket. The paragraph right above the graphs explains it.

    Sunday, April 5, 2009 at 3:44 pm #

Trackbacks/Pingbacks (3)

  1. Places stats last call at Saturn Valley on Sunday, June 7, 2009 at 8:07 pm

    [...] and running for nearly three months now, and I previously blahhged about it and our initial results here. Everyone’s been busy getting Firefox 3.5 out the door since then, but now that work on it is [...]

  2. Understanding Bookmarks & Browsing with Places Stats on Monday, June 15, 2009 at 5:15 am

    [...] + bookmarks + tags) database (sqlite) by pasting a script into the error console. Drew (adw) did an analysis a few months back and recently gave “last call” for the first [...]

  3. Places Stats - Analysis in the Open < Blog of Metrics on Tuesday, June 16, 2009 at 10:56 am

    [...] to Bookmark pages.”  So far, several hundred people have volunteered to participate, and a few folks have started to put some framework ideas around analyses that could ultimately lead to useful [...]