13
Dec 11

Comparing the Bias in Telemetry Data vs The Typical Firefox User

Telemetry  is a feature in Firefox that captures performance metrics such as start up time, DNS latency among others. The number of metrics captured is in the order of a couple hundred. The data is sent back to the Mozilla Bagheera servers  which is then analyzed by the engineers.

The Telemetry feature asks the Nightly/Aurora (pre-release) users  if they would like to submit their anonymized performance data . This resulted in  a response rate (number of people who opted in divided by the number of people who were asked) of less than 3%. This led to two concerns: small number of responses (which changed when Telemetry became part of  Firefox release) and more importantly representativeness: are the performance measurements as collected from the 3% representative of those of people who chose not to  opt in?

Measuring the bias is not easy unless we have measurements about the users who did not opt in. Firefox sends the following pieces of information to the Mozilla servers: operating system, Firefox version, extension identifiers and the time for the session to be restored. This is sent by all Firefox installations unless the distribution or user have the feature turned off (this is called the services AMO ping). The Telemetry data contains the same pieces of information.

What this implies is that we have start up times for i) the users who opted in to Telemetry and ii) everyone. We can now answer the question “Are the startup times for the people who opted into Telemetry representative of the typical Firefox user?”

Note: ‘everyone’ is almost everyone. Very few have this feature turned off.

Data Collection

We collected start up times for Firefox 7,8 and 9 for November, 2011 from the log files of services.addons.mozilla.org (SAMO). We also took the same information for the same period from the Telemetry data contained in HBase ( some code examples can be found at the end of the article).

Objective

Are start up times different by Firefox version and/or Source, where source can be SAMO or Telemetry.

Displays

Figure 1 is boxplot of log of start up time for Telemetry (tele) vs. SAMO (samo) by Firefox version. At first glance it appears the start up times from Telemetry are less than those of SAMO. But the length of the bars makes it difficult to stand by this conclusion.

Figure 1: Boxplot of Log SessionRestored for Telemetry/SAMO by FF Version

Figure 1: Boxplot of Log SessionRestored for Telemetry/SAMO by FF Version

Figure 2 is the difference in the deciles of log of start up time. In other words, approximately speaking, the deciles of ratio of Telemetry start up time to SAMO start up time. The medians hover in the 0.8 region, though the bars are very wide and do not support to a the quick conclusion that Telemetry start up time is smaller.

Figure 2: Difference of Deciles of Logs

Figure 2: Difference of Deciles of Logs

In Figure 3, we have the mean of medians of 1000 samples: red circles are for telemetry and black for SAMO. The ends of the line segments correspond the sample 95% confidence interval (based on the sample of sample medians). The CI for the SAMO data lies entirely within that of the Telemetry data. This makes one believe that the two groups are not different.

Figure 4: Mean of the medians (circles) with their 95% confidence intervals. Red isTelemetry, Black is SAMO

Figure 4: Mean of the medians (circles) with their 95% confidence intervals. Red isTelemetry, Black is SAMO

Analysis of Variance

For a more numerical approach, we can estimate the analayis of variance components. The model is

log(startup time) ~ version + src

(we ignore interaction). Since the data is in the order of billions of rows, I instead take 1000 samples of approximately 20,000 (sampling rate of 0.001%) rows each. Compute ANOVA results of each and then average the summary tables of the lm function in R. In other words we make our conclusions based on the average of the 1000 samples of ~20,000 rows each. ( I should point out that the residuals (as per a quick visual check) were roughly distributed as gaussian and other diagnostics came out clean)

The average ANOVA indicates does not support version effect or source effect (at the 1% level). In other words, the log of start up time is not affected by the version nor is it affected by the source (Telemetry/ SAMO).

               Estimate Std. Error     t value   Pr(>|t|)
(Intercept)  8.62635472 0.01171420 736.4390937 0.00000000
vers8       -0.05995627 0.01928947  -3.1089666 0.02922402
vers9       -0.03382135 0.10466330  -0.3247165 0.48286903
vers10      -0.03862282 0.29308642  -0.1418623 0.48228122
srctele     -0.02290538 0.03946150  -0.5811779 0.45300964

This is good news! Insofar start up time is concerned, Telemetry is representative of SAMO.

A Different Approach and Some Checks

By now, the reader should note that we have answered our question (see last line of previous section). Two questions remain:

1. The samples are representative. We are sampling on 3 dimensions: startup time, src and version. Consider the 1000 quantiles of startup time, the 2 levels of src and 4 levels of version. All in all, we have 1000x2x4 or 8000 cells. Sampling from the population might result in several empty cells, so much so, that the joint distribution of the sample might be very different from that of the population. To confirm that our cell distribution of the samples reflect the cell distribution of the population, we computed Chi Square tests comparing the sample cell counts with that of the parent. All 1000 samples passed!

2. Why use samples? We can do a log linear regression testing on the 8000 cell counts (i.e all the 1.9 BN data points) . This of course loses a lot of power: we are binning the data and all monotonic transformations are equivalent. The model equivalent (using R’s formula language) of the ANOVA described above is

log(cell count) ~ src+ver+binned_startup:(src+ver)

 If the effects of binned_startup:src and binned_startup:ver are not significant this corresponds to our conclusion in the previous section. And nicely enough, it does!  Output of summary(aov(glm(…))) is

summary(aov(glmout <- glm(n~ver+src+sesscut:(ver+src)
                          , family=poisson
                          , data=cells3.parent))
              Df     Sum Sq    Mean Sq   F value Pr(>F)
ver            3 4.6465e+14 1.5488e+14 1131.8666 <2e-16 ***
src            1 3.2705e+14 3.2705e+14 2390.0704 <2e-16 ***
ver:sesscut 3952 5.4969e+13 1.3909e+10    0.1016      1
src:sesscut  988 2.0009e+13 2.0252e+10    0.1480      1
Residuals   2967 4.0600e+14 1.3684e+11

Some R Code and Data Sizes:

1. The data for SAMO was obtained from Hive, sent to a text file and then imported to blocked R data frames using RHIPE. All subsequent analysis was done using RHIPE.

2. The data for Telemetry, was obtained from Hbase using Pig (RHIPE can read HBase, but I couldn’t install it on this particular cluster). The text data was then imported as blocked R data frames and placed in the same directory as the
imported SAMO data.

3. Data sizes were in the few hundreds of gigabytes. All computations were done using RHIPE (R not on the on the nodes) on  a 350TB/33 node Hadoop cluster.

3. I include some sample code to give a flavor of RHIPE.

Importing text data as Data Frames

map         <- expression({
  ln        <- strsplit(unlist(map.values),"\001")
  a         <- do.call("rbind",ln)
  addonping <- data.frame(ds=a[,1]
                         ,vers=a[,3]
                         ,sesssionrestored=as.numeric(a[,6])
                         ,src=rep("samo",length(a[,6]))
                         ,stringsAsFactors=FALSE)
  rhcollect(runif(1),addonping)
})
z <- rhmr(map=map
          ,ifolder="/user/sguha/somequants"
          ,ofolder="/user/sguha/teledf/samo"
          ,zips="/user/sguha/Rfolder.tar.gz"
          ,inout=c("text","seq")
          ,mapred=list(mapred.reduce.tasks=120
             ,rhipe_map_buff_size=5000))
rhstatus(rhex(z,async=TRUE),mon.sec=4)

Creating Random Samples

map         <- expression({
  y         <- do.call('rbind', map.values)
  p         <- 20000/1923725302
  for(i in 1:1000){
    zz      <- runif(nrow(y)) < p
    mu      <- y[zz,,drop=FALSE]
    if(nrow(mu)>0)
      rhcollect(i,mu)
  }
})
reduce      <- expression(
    pre={ x <- NULL}
    ,reduce = {
      x     <- rbind(x,do.call('rbind',reduce.values))
    }
    ,post={ rhcollect(reduce.key,x) }
    )
z <- rhmr(map=map,reduce=reduce
          ,ifolder="/user/sguha/teledfsubs/p*"
          ,ofolder="/user/sguha/televers/dfsample"
          ,inout=c('seq','seq')
          ,orderby='integer'
          ,partition=list(lims=1,type='integer')
          ,zips="/user/sguha/Rfolder.tar.gz"
          ,mapred=list(mapred.reduce.tasks=72
             ,rhipe_map_buff_size=20))
rhstatus(rhex(z,async=TRUE),mon.sec=5)

Run Models Across Samples

map        <- expression({
  cuts     <- unserialize(charToRaw(Sys.getenv("mcuts")))
  lapply(map.values, function(y){
    y$tval <- sapply(y$sesssionrestored
                     ,function(r) {
                       if(is.na(r)) return( r)
                       max(min(r,cuts[2]),cuts[1])
                     })
    mdl    <- lm(log(tval)~vers+src,data=y)
    rhcollect(NULL, summary(mdl))
  })})
z <- rhmr(map=map
          ,ifolder="/user/sguha/televers/dfsample/p*"
          ,ofolder="/user/sguha/televers2",
          ,zips="/user/sguha/Rfolder.tar.gz"
          ,inout=c("seq","seq")
          ,mapred=list(mapred.reduce.tasks=0))
rhstatus(rhex(z,async=TRUE),mon.sec=4)

Computing Cell Counts For A Log Linear Model

cuts2                <- wtd.quantile(tms$x,tms$n,
                                     p=seq(0,1,length=1000))
cuts2[1]             <- cuts[1]
cuts2[length(cuts2)] <- cuts[2]
map.count <- expression({
  cuts       <- unserialize(charToRaw(Sys.getenv("mcuts")))
  z          <- do.call(rbind,map.values)
  z$tval     <- sapply(z$sesssionrestored,function(r)
                  max(min(r,cuts[length(cuts)]),cuts[1]))
  z$sessCuts <-
    factor(findInterval(z$tval,
                        cuts),ordered=TRUE)
  f          <- split(z,list(z$vers,z$sessCuts,z$src),drop=FALSE)
  for(i in seq_along(f)){
    y <-strsplit(names(f)[[i]],"\\.")[[1]]
    rhcollect(y,nrow(f[[i]])) }
})
z <-
  rhmr(map=map.count,reduce=rhoptions()$templates$scalarsummer
       ,combiner=TRUE,
       ifolder="/user/sguha/teledfsubs/p*"
       ,ofolder="/user/sguha/telecells",
       ,zips="/user/sguha/Rfolder.tar.gz"
       ,inout=c("seq","seq") ,mapred=
       list(mapred.task.timeout=0
            ,rhipe_map_buff_size=40
            ,mcuts=rawToChar(serialize(cuts2, NULL,
                                ascii=TRUE))))

8
Sep 11

Understanding DNT Adoption within Firefox

UPDATED 2011-09-08 11:55am PST: changed the description of how we store and retain IP address to be more accurate

On March 23rd, Mozilla launched its newest and most awesome browser: Firefox 4. Along with a plethora of features, including faster performance, better security and the whole nine yards, Firefox 4 included a cutting edge privacy feature called Do No Track (DNT). For the uninitiated, DNT simply tells sites “I don’t want to be tracked” via a HTTP header visible to all advertisers and publishers.
Mozilla’s new Privacy Blog  has several posts on the feature, including a new one today releasing a Do Not Track Field Guide for developers. Based on our current numbers, we’ve been seeing for several weeks now just under 5% of our users with DNT turned on within Firefox.
The Mozilla team is all about experimenting. We love innovating new technologies that do good and benefit the community as a whole. As Firefox 4 kept breaking records, the peak was 5,500 downloads/minute, (source: http://blog.mozilla.com/blog/2011/03/25/the-first-48-hours-of-mozilla-firefox-4/) we felt that it would be important to understand whether people were enabling DNT. Every Metrics guy lives and dies by the data. In late 2010, the metrics team gave a small talk on how we collect log data (click here for the video ppt). While that project has gone multiple iterations over time, the basic premise is still the same:

  • Grab logs from multiple data-centers.
  • Split out anonymized and non-anonymized data into two separate files
  • Store both sets of files in HDFS
  • Create relevant partitions inside HIVE
  • Query the data.
  • Drool over the stats.

(Non-anonymized data such as IP address has a 6-month retention policy and is deleted on expiration)

We decided to follow the same approach for calculating DNT stats. Once every day, each Firefox instance pings the AUS servers with respect to its DNT status. The ping request looks something like this:
“DNT:-” User has NOT set DNT
“DNT:1″ User HAS set DNT and does *not* wish to be tracked.

Armed with the following data points, a simple HIVE query gives us DNT stats for a given day:

   SELECT ds, dnt_type,  count(distinct ip_address)  FROM web_logs WHERE (request_url LIKE ‘%Firefox/4.0%’ OR request_url LIKE ‘%Firefox/5.0%’ OR request_url LIKE ‘%Firefox/6.0%’) AND dnt_type != ‘DNT:1, 1′ AND ds = ‘$dateTime’ GROUP BY ds, dnt_type ORDER BY ds desc;

The above script is run on a nightly basis and the result is then plotted over a time graph, as included with this post.

One BIG caveat:

The DNT numbers are being undercounted, primarily because we use hashed IP address as proxy for counting a unique user. This means, while there can be multiple users behind a given NAT with DNT set, the counter is incremented only once. This may account for why our numbers are a bit lower than those being reported by other groups, including the recent study of 100 million Firefox users conducted by Krux Digital

Possible Fix, NOT:

While it is possible to uniquely identify each instance of browser, doing so will require that we start tracking users, thereby defeating the exact purpose for why DNT was created in the first place.

 

Feel free to leave us a comment or email: (aphadke at_the_rate mozilla dot com – Anurag Phadke) for more information.

25
Aug 11

Do 90% of People Not Use CTRL+F?

According to an article in The Atlantic floating around the internet, 90% of users don’t know how to use CTRL+F or Command+F to search a webpage. We were surprised at that percentage. Fortunately, Mozilla has TestPilot studies with open data, and we can see if Firefox users behave similarly. One relevant 7-day TestPilot study of about 69,000 Windows users focused on Firefox’s user interface. Along with seeing how users interacted with the navigation bar, their bookmarks, etc., the study looked at how often people used keyboard shortcuts.

What we found is that about 81% of TestPilot users didn’t use CTRL+F during the course of the study. While 81% is lower than the 90% in the article, TestPilot users are usually more technologically experienced than the general population, since they are largely Firefox Beta users. When we look at TestPilot users who consider themselves beginners, the percentage goes up to 85%. Therefore, our 81% figure does not belie the Atlantic piece.

In addition, those who use CTRL+F on average use keyboard shortcuts twice as much as those who don’t, even when we ignore those people who don’t use any keyboard shortcuts at all. This implies that people who use CTRL+F are more comfortable with keyboard shortcuts in general. The only keyboard shortcut the users who use CTRL+F lag behind in is Full Screen, or F11.

Feel free to take a look at the data yourself and let us know about any interesting trends you discover!


15
Aug 11

Text mining users’ definitions of browsing privacy

One issue that’s been on everyone’s mind lately is privacy.  Privacy is extremely important to us at Mozilla, but it isn’t exactly clear how Firefox users define privacy.  For example, what do Firefox users consider to be essential privacy issues?  What features of a browsing experience lead users to consider a browser to be more or less private?

In order to answer  these questions, we asked users to give us their definitions of privacy, specifically privacy while browsing, in order to answer these questions.  The assumption was that users will have different definitions, but that there will be enough similarities between groups of responses that we could identify “themes” amongst the responses. By text mining user responses to an open-ended survey question asking for definitions of browsing privacy,  we were able to identify themes directly from the users’ mouths:

  1. Regarding  privacy issues, people know that tracking and browser history are  different issues, validating the need for browser features that address  these issues independently (“private browsing” and “do not track”)
  2. People’s definition of personal information vary, but we can group people  according to the different ways they refer to personal information (this leads to a natural follow-up question; what makes some information more personal than others?)
  3. Previous focus group research, contracted by Mozilla, showed that users are aware that spam indicates a  security risk, but what didn’t come out of the focus group research was that users also also consider spam to be an invasion of their privacy (a follow-up question, what do users define as “spam?”  Do they consider targeted ads to be spam?)
  4. There are users who don’t distinguish privacy and security from each other

Some previous research on browsing and privacy

We  knew from our own focus group research that users are concerned about viruses, theft of their personal information and passwords, that a  website might misuse their information, that someone may track their  online “footprint”, or that their browser history is visible to others.   Users view things like targeted ads, spam, browser crashes, popups, and  windows imploring them to install updates as security risks.

But it’s difficult to broadly generalize findings from focus groups.  One group may or may not have the same concerns as the general population.  The quality of the discussion moderator, or some unique combination of participants,  the moderator, and/or the setting can also influence the findings you get from focus groups.

One way of validating the representativeness of focus group research is to use surveys.  But while surveys may increase the representativeness of your findings, they are not as flexible as focus groups.  You have to give survey respondents their answer options up front.  Therefore, by providing the options that a respondent can endorse, you are limiting their voice.

A typical  way to approach this problem in surveys is to use open-ended survey questions.  In the pre-data mining days, we would have to manually code  each of these survey responses: a first pass of all responses to get an idea of respondent “themes” or “topics” and a second pass to code each  response according to those themes.  This approach is costly in terms of time and effort, plus it also suffers from the problem of reproducibility; unless themes are extremely obvious, different coders might not classify a response as part of the same theme.  But with modern text mining methods, we can simulate this coding process much more quickly and reproducibly.

Text mining open-ended survey questions

Because text mining is growing in popularity primarily due to its computational feasibility , it’s important to review the  methods in some detail.  Text mining, as with any machine learning-based approach, isn’t magic.  There are a number of caveats to make about the text mining approach used. First, the clustering algorithm I chose to use requires an arbitrary and a priori decision regarding the number of clusters.  I looked at 4 to 8 clusters and decided that 6 provided the best trade-off between themes expressed and redundancy.  Second, there is a random component to  clustering, meaning that one clustering of the same set of data may not produce the exact same results as another clustering. Theoretically,  there shouldn’t be tremendous differences between the themes expressed in one clustering over another, but it’s important to keep these details in mind.

The general idea of text mining is to assume that you can represent documents as “bags of words”, that bags of words can be represented or coded quantitatively, and that the quantitative representation of text can be projected into a multi-dimensional space. For example, I can represent survey respondents in two dimensions, where each point is a respondent’s answer.  Points that are tightly clustered together mean that these responses are theoretically very similar with respect to lexical content (e.g., commonality of words).

I  also calculated a score that identifies the relative frequency of each word in a cluster, which is reflected in the size of the word on each  cluster’s graph.  In essence, the larger the word, the more it “defines”  the cluster (i.e. its location and shape in the space).

Higher resolution .pdf files of these graphs can be found here and here.

Cluster summaries

  • “Privacy and Personal information”: Clusters  1, 4, and 5 are dominated by, unsurprisingly, concerns about  information.  What’s interesting are the lower-level associations  between the clusters and the words.  The largest, densest cluster  (cluster 4)  deals mostly with access to personal information whereas  cluster 1 addresses personal information as it relates to identity  issues (such as when banking).  Cluster 5 is subtly different from both 1  and 4.  The extra emphasis on “share” could imply that users have  different expectations of privacy with personal information that they explicitly choose to leak onto the web as opposed to personal information that they  aren’t aware they are expressing.  One area of further investigation would be to seek out user definitions on personal information; what makes some information more “personal” than others?
  • “Privacy and Tracking”: Cluster  6 clearly shows that people associate being tracked as a privacy issue.   The lower-scored words indicate what kind of tracked information  concerns them (e.g., keystrokes, cookies, site visits), but in general  the notion of “tracking” is paramount to respondents in this cluster.   Compare this with cluster 2, which is more strongly defined by the words  “look” and “history.”  This is obviously a reference to the role that  browsing history has in defining privacy.  It’s interesting that these clusters are so distinct from each other, because it implies that users  are aware there is a difference between their browser history and other  behaviors they exhibit that could be tracked.  It’s also interesting  that users who consider browser history a privacy issue also consider  advertising and ads (presumably a reference to targeted ads) as privacy  issues as well.  We can use this information to extend the focus group  research on targeted ads; in addition to a security risk, some users  also view targeted ads as an invasion of privacy.  One interesting question naturally arises: do users differentiate between spam and  targeted advertisements?
  • “Privacy and Security”: The  weakest defined group is cluster 3, which can be interpreted in many ways.  The least controversial inference could be that these users simply don’t have a strong definition of privacy aside from a notion  that privacy is related to identity and security.  This validates a notion from our focus group research that some users really don’t differentiate between privacy and  security.

Final thoughts

User  privacy and browser security are very important to us at Mozilla, and  developing a product that improves on both requires a deep and evolving  understanding of what those words mean to people of all communities - our entire user population.    In this post, we’ve shown how text mining can enhance our understanding  of pre-existing focus group research and generate novel directions for  further research. Moreover, we’ve also shown how it can provide insight into  users’ perception by looking at the differences in the language they use  to define a concept.  In the next post, I’ll be using the same text  mining approach to evaluate user definitions of security while browsing  the web.

 


3
Aug 11

Test Pilot New Tab Study Results

[Cross-posted at Mozilla User Research]

The new tab page in Firefox is intentionally left blank, while some browsers present rich information on a newly opened tab.

The decision to leave new tab pages in Firefox blank was driven, in part, by a suspicion that too much information in the new tab may distract users from getting to the destination intended for the new tab. To test whether this suspicion is true and to learn more about user behavior after opening a new tab, Test Pilot recently released the New Tab Study and will soon release a multivariate test on the new tab page. Test Pilot is a platform collecting structured user feedback through Firefox. It currently has about 3 millions users and all the studies are opt in. You can help us better understand how people use their web browser and the Internet in order to build better products by participating studies. Test Pilot add-on is available here.The study ran for 5 days and in all, we collected 256,282 valid submissions.
Results of the study show that on average each user daily:
  • opens 11 new blank tabs
  • loads 7 pages
  • visits 2 unique domains
  • visits 2 pages in a new tab before they leave or close it

Below are details on how a user loads a page in a new tab, their intentions when opening a new tab, and time spent on new tabs below.

How do users load a page in new tabs?

 We detected 11 different methods to load a Web page in a blank tab page. Actions in the Url bar include pressing ENTER through keyboard, clicking the go button on the right side of the bar, clicking the Web page suggestions in the dropdown menu and pressing ENTER key for dropdown suggestions. Similarly 4 actions can be performed in the search bar too. Users can load a previously saved page from the bookmark bar in the toolbar or Bookmark/History in the menu bar.

Note:

  • The URL bar is most used when navigating to new websites.
  • The Search bar is also popular. Users rarely use search bar dropdown to look for old search terms.
  • The Bookmark toolbar is used more often than the bookmark menu button.
  • The History Menu button is seldom used.

We can also classify all methods for loading web pages into either keyboard-based or mouse-based category. Generally speaking, users have a slight preference for mouse usage.

 

Why do users open new tabs?

1.    Are they looking for a specific URL?

13.95% of new tabs (13,941,404) are opened while the text in the clipboard starts with “http” or “www”, which are very likely to be URL strings. The number is surprisingly high, although it may be caused by previous actions rather than by pasting for loading a specific URL.

2.    Users browse a limited set of domains, and only a small proportion of domains attract most visits

If we represent each user as a single point in the plot where x-axis is the number of pageloads, and y-axis is the number of unique domains visited, we can get the following graph. The dash line (diagonal) is what will happen if users always visit a different domain for each page load. When the users are not so active, pageloads less or around a few hundreds, the number of unique domains grows linearly. However, once users get to browse more, distinct domains tend to be stabilized and saturated.

Globally, we check the visit frequencies of all domains, and find that globally only 17.38% domains (461,133 unique domains in total) take 80% of the total page loads (8,291,541 pageloads in total). It verifies the famous “20-80” law of long tail phenomena.

On the individual level, we are interested in whether a single user performs the browsing movements according to the 20-80 law. For each individual, domains taking 80% of the total page visits is defined as “main domains”. A user can confirm the 20-80 law if the ratio of the number of his main domains to the number of distinct domains is around 20%. According to the following fig., active users browse more web pages everyday, but the number of primary sites they go to decreases proportionally. It suggests that when users visit more sites, they prefer to go to the same sites more frequently. The result supports the existence of a speed-dial new tab page to some extent.

 

Time Spent on New Tabs

According to the study results, on average, users open 2 pages in a new tab before they leave or close it. They load the first web page in 6 seconds (median) after they open new tabs, and stay on the tab for 1 minute (median) once they start browsing. The distributions of these two types of reaction timings display broad tails. The actually mean values are much higher than the medians: users load the first web page in 45 seconds (mean) after they open new tabs, and stay on the tab for 7 minute (mean) once they start browsing, since the outliners and expected noises can vary the mean value a lot.
Meanwhile, how users open a new tab can distinguish 2 groups of mouse-based users and keyboard-based users. The tabs invoked by “Plus Button” and “Double Click on TabBar” represent the group of mouse-based users, and the tabs invoked by “Command+T” represent the the group of keyboard-based users. The results turn out that keyboard-based users act slightly faster than mouse-based ones, and they can stay on the same new tab a bit longer.

The study is preliminary study for redesign requirement of the new tab pages in Firefox. We detect user behavior patterns of how they use the new tabs, including how they load a new page, broadness of domain visited, and the timing of different actions. In the following New Tab Multivariate Test, we will do a comparison between several designs of the new tab page, and more research questions will be answered, including whether too much information in the new tab may distract users from the original target or not.


27
Apr 11

Join Mozilla Metrics

The Mozilla Metrics team is expanding to meet the growing data related opportunities and challenges faced by Mozilla and the web as a whole. In addition to open positions for a visualization expert and a metrics software engineer, we are also looking for a data analyst to focus on user experience. The UX data analyst will gather structured user insights and then leverage these insights to inspire and inform the design of our products.

Please reach out to us if you (or someone you know) has a passion for data and building a better internet.

The Mozilla community itself is also growing – so if data isn’t your thing, be sure to check out the other career listings as well!


21
Apr 11

Investigating Users’ Willingness to Recommend Firefox

Market research has shown that the Mozilla mission is a powerful attractor for Firefox users.  Furthermore, additional research has shown that recommendation is a strong method to promote the adoption of Firefox.

These observations lead to the following question: how does one’s willingness to recommend Firefox relate to their knowledge that Firefox is made by Mozilla, a mission-driven non-profit?  As an initial hypothesis, we posited that one’s willingness to recommend Firefox would be positively related to their knowledge of Firefox as a product of a mission-driven non-profit.

Using the beta survey interface, we asked Firefox 3.6 users the following questions:

  1. Did you know that by using Firefox, you are supporting a mission-driven non-profit organization?
  2. How likely are you to recommend Firefox to a friend or colleague? (0-10)

The response scale for question 2 was then used to calculate a Net Promoter Score (NPS), which is a marketing metric to gauge users’ willingness to recommend a product or service.  A person who responds at 6 or lower is considered a “detractor,” whereas one who says “9″ or “10″ is considered a promoter.  All else are considered “neutral.”

We calculate an NPS by subtracting the proportion of detractors from the proportion of promoters.  Thus, scores are between 0 and 1 and higher is better. By using this metric, we are able to investigate the relationship between the knowledge that Firefox is made by a mission-driven non-profit and one’s willingness to recommend Firefox to others.

As a survey experiment, we also reversed the presentation of the questions, meaning that for some of the time, we asked respondents to give their willingness to recommend before they indicated whether they knew Firefox was made by a mission-driven non-profit.  We did this in order to determine if simply informing users of this fact was enough to induce a “knowledge” effect.

Figure 1a shows that over every level of response, there are more users who say they did not know that Firefox was produced by a mission-driven non-profit than those who say they did.  In particular, the amount of “neutrals” (which can be interpreted as the 5s, since it is the midpoint) is greater in the “without knowledge” group than in the “with knowledge” group.  These data lend some credibility to the idea that knowing Firefox is made by a mission-driven non-profit relates to willingness to recommend and that more people are unaware of this fact than those who are.

Figure 1b shows the initial results of the question ordering.  By asking users to indicate knowledge first, it appears to reduce the amount of 8s, 9s, and 10s from users.   A potential explanation of this effect could be that users can tell that we are trying to induce positive feelings towards Firefox by asking them this information first.  We can interpret these data to indicate that simply telling users that Firefox is made by a mission-driven non-profit is not enough to boost their willingness to promote.

Figure 2 demonstrates the relationship of willingness to recommend by knowledge group.  This effect is quite pronounced.  The NPS of these groups (“Yes, I did know” versus “No, I didn’t know”) are different from each other, where those with knowledge are much more likely to say that they are willing to recommend Firefox to others.

These results support our initial hypothesis: one’s willingness to recommend Firefox is positively related to one’s knowledge that Firefox is made by a mission-driven non-profit.   Note that this is not a causal relationship; from this data, we cannot say that knowledge directly boosts one’s willingness to recommend Firefox.  No statistical tests of inference have been performed.  However, this survey study strongly indicates that this relationship bears further investigation.


13
Apr 11

Using the New Days Last Ping Metric to Look at Firefox 4 Downloads

As with many software companies, we are keenly interested in gauging active installations and understanding how people use our products. However, as a non-profit organization with a strong interest in promoting privacy, we also recognize there’s a fine line between this activity and tracking people in unwelcome ways. Mozilla has developed a new mechanism that further enhances privacy, while still meeting its objective to create usage metrics.

1. Measuring Usage Activity

1.1 Our Current Approach: Blocklist Cookies

Mozilla has a mechanism for maintaining installed add-ons called a “blocklist” [https://wiki.mozilla.org/Blocklisting]. This involves a  scheduled request to retrieve an updated blocklist file from Mozilla servers.  The request is currently performed not more than once per 24 hours by several Gecko powered applications maintained by Mozilla (e.g., Firefox Desktop and Mobile). Because the request only happens once per day, we can study the pattern and volume of requests to understand how many active installations of a product there are on a particular day. While requests do not collect or use any personal user information, they are currently using a cookie to study how many unique active installations we see during a given time period.

1.2 From FF4 Onwards: Days Since Last Ping

It’s been our experience that using cookies for tracking unique usage is somewhat unreliable  and cookies raise privacy issues for our users. Cookies can be cleared by the user and sometimes even corrupted via proxies. Because of these reasons, the Mozilla Metrics team filed bug 616835 [1]  to implement a new method for tracking unique installations without any need for a cookie or any other form of identifier. This not only gives us the ability to get better usage metrics, but also to strengthen user privacy by removing the old cookie entirely.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=616835

Design and Implementation of Days Since Last Ping

Each time a request is made for the blocklist data, the request includes a new parameter that indicates how many days it has been since the last request. There is very little possibility to derive a fingerprint from this new parameter since it is a low number of bits, it changes on every request, and users will not maintain outlier values unless they consistently have a pattern of extremely occasional usage (i.e. months between usage of the application). If Firefox is left open unattended for 2 weeks, the days last ping value will be 1 for every day even though the user might never have been at their computer.

Computing Active Installations

For each day in the desired time period, we add all the requests with a value indicating that either this is the first ever request to blocklist (a new installation), or the last time the application made the request was before the time period we are analyzing. This means that on the first day, we count all requests with a valid parameter (i.e. between 1 and max_valid). On the second day of the time period, we count all requests with a parameter between 2 and max_valid. After iterating through each day of the time period using this algorithm, we sum all the counts together and we have the number of unique active installations in that time period.

Example

Consider the date range 04 March – 07 March. We proceed as follows:

1. For March 4th, add the ‘new’ count and the number of days last ping (dlp) ==n, n>=1.
2. For March 5th, add the ‘new’ count and number of dlp==n, n>=2. We ignore dlp==1 because those installations would have made the blocklist check with some of value of dlp on March 4th.
3. Similarly for March 6th and March 7th, add ‘new’ and counts of dlp==n, n>=3 and n>=4 respectively.
4. Add all the counts in (1)-(3) to get number of unique active installations in the above period.

With this metric we can also compute the number of new installations being added on a daily basis. Also, we would like to confirm that we have a high proportion of profiles with days last ping equal to 1. This is not the same as a ‘daily user’ but it is encouraging to see a lot of installations using Firefox for two consecutive days.

Limitations

What we can’t compute the number of installations using Firefox for exactly ‘k’ days in a week or retention patterns.

2. Visualizing the  Behavior of the Days Last Ping Metric

Firefox 4 was released on 22nd March though beta and release candidates were available before that. This is a great opportunity to visualize the dynamics of a metric for a new product.

2.1 New Installations

We are eager to see what proportion of our daily ‘blocklist pings’ come from new installations. Figure 1 displays the proportion of blocklist pings that come from new installations. The heartening observation is the positive slope of the red smoother curve. The peaks are not day of the week effects but correspond to release dates. It is difficult to comment on day of week effects here because of the significant events that occurred in the time period. Nevertheless, new profile percentage appears to around 3.5-4.5% on a daily basis with a slight increasing trend towards the end of the period.

2.2 ‘Daily’ Usage

Figure 2 is a display of the proportion of blocklist updates that come from installations with days last ping equal to 1 which means the installation is active today and the previous day. The proportion varies from 72% to 86%, with a mean hovering around 82%. The red smoother indicates not much change. The 1st,3rd and 5th week have a similar pattern: a low at the beginning of the week (a value of dlp equal to 1 on Monday means that the profile used FF on Sunday), peaking towards the center of the week and dipping as we approach the weekend. The 4th week was the week of the release. Understandably this looks very different. In both Figures 2 and 1, week 2 looks different , probably because of the RC release. I would like to say there is no weekly effect, and indeed the shape is same (except for the two exceptional weeks) but the highs and lows are different.

2.3 ‘Recent’ Usage

Together with counts of days last ping less than 7, we capture more than 90% of the users. Figure 3 shows the proportion of blocklist pings that come from installations that last contacted between 2 to 7 days back. The alignment of troughs and valleys is opposed to the dlp==1 display (Figure 2). There does not seem to be an increasing trend, with the mean around 12-14%, peaking on Thursdays. Why is that? Because the bulk come from dlp=3. On average (across day of week), 95.3 % of the blocklist pings have dlp<=3 and 98.9%<=7. Figure 4 displays the mean cumulative percent of different values of days last ping (between 2 to 7) by the day of week. The key observation is that the curve doesn’t change much – meaning there is little interaction involved here.

2.4 ‘Infrequent’ Usage

Finally, in Figure 5, we get to see the dynamics of the days last ping distribution for a new product. In Figure 5, we plot the density of days last ping greater than 14 for all the days. Each row is 7 days so we can fix a day of week by moving along columns.

Firstly, we see the distribution shifting out to the tails. On one hand this is expected as there is more time available for installations to be used after a long period of inactivity. Also the maximum proportion (see Figure 6) decreases steadily over time from 0.1% to about 0.03 % on 3rd April, dramatically so a few days after FF4 launch. However in both Figures 5 and 6, we see the peak starting to rise again. This means we have more 14 days plus inactive users using FF4. Whether this panel stabilizes is something we can see over the next few months.

2.5 Daily Actives vs. Weekly Actives

Surprisingly, if we look at unique actives over a week, vs the mean daily actives in a week, the numbers are relatively stable, for the 6 weeks the ratios are: 0.692 0.593 0.666 0.994 0.685 and 0.672
In future, we can look at rolling 7 day periods and other window lengths( e.g.  14 days, monthly etc ) and week on week on growth for unique actives.
Reassuring results, and we are all very eager to monitor the progress as FF4′s adoption increases.

Credits:

Thanks to Daniel Einspanjer for the introduction to the cookie usage and the background on the metrics ping.


29
Mar 11

Firefox 4: Across the World

After a marathon Firefox 4 download ( 15.85 MM, see  http://blog.mozilla.com/blog/2011/03/25/the-first-48-hours-of-mozilla-firefox-4/ ) we prepared a location intensity graph. Figure 1 displays the cities by frequency of download (independent of how many downloads though the two are strongly related). Blue is infrequent to very few (bottom 10%), bright yellow is many (top 10%).

Figure 1. Location intensity of cities. Blue is rare, bright yellow is high.

Though the above picture is heartwarming at the same time it is disappointing. There are swaths of the world pitch black. The location information is constrained by the accuracy of the geo-location service but that is also related to internet penetration. With Internet access rapidly becoming (in the not too distant future)  a public utility, it is sad that the ‘have not’s have one more item on the list.

Based on reader input I’ve attached a large (2248 × 1268, 1.1MB) PNG file.Click here to download.


25
Mar 11

Browsing Sessions III: Do Users Overestimate How Long They Browse?

In our last post, we found that the number of installed extensions was a good discriminant of heavier users. In this short follow-up, we’ll delve into the survey data associated with the Beta Interface study.  Here is a snapshot of some of the research we’ve been conducting.

Users overshoot their estimated browsing time

The graph above demonstrates that users tend to simply overestimate how long they use Firefox. Those that typically use the browser less have a more accurate assessment of how long they are browsing. But for users who state a longer browsing time per day, the actual browser usage is lower than their own estimate.

First, a note about the methodology behind this graphic. We estimate the average daily browsing time by aggregating the session lengths of Test Pilot users over the course of the study. Previously we have defined a browser session as a continuous period of user activity in the browser, where successive events are separated by no more than 30 minutes. We subset on the users that state they only use Firefox, to avoid the problem of a different primary browser. 

We thought of a few possible explanations as to why, for heavier users, the estimated time is lower than the stated time. Those users might, for instance, be online and using their computers quite a bit during the day, but have integrated their online workflow with their offline ones. Software engineers are a good example of this – we might expect a programmer to be working on a computer all day, leaving the browser open, and using it every once in a while.  So there may be the perception of constant browser usage.  This certainly rings true from the experience of the Metrics team – we’re on our computers almost all day, with Firefox open, despite working.  This is, however, only speculative at this point, since we don’t have data about when users are on their computers.

There are still some obvious methodological issues with this approach: a user might, for instance, use Firefox on a work computer (with test pilot installed), and a different one for home use, which could account for the difference. As such, we hope to include a survey question asking “How much time a day do you spend on this computer?” in the next version of the study.  At that point, we can update this research.