visualization


15
Aug 11

Text mining users’ definitions of browsing privacy

One issue that’s been on everyone’s mind lately is privacy.  Privacy is extremely important to us at Mozilla, but it isn’t exactly clear how Firefox users define privacy.  For example, what do Firefox users consider to be essential privacy issues?  What features of a browsing experience lead users to consider a browser to be more or less private?

In order to answer  these questions, we asked users to give us their definitions of privacy, specifically privacy while browsing, in order to answer these questions.  The assumption was that users will have different definitions, but that there will be enough similarities between groups of responses that we could identify “themes” amongst the responses. By text mining user responses to an open-ended survey question asking for definitions of browsing privacy,  we were able to identify themes directly from the users’ mouths:

  1. Regarding  privacy issues, people know that tracking and browser history are  different issues, validating the need for browser features that address  these issues independently (“private browsing” and “do not track”)
  2. People’s definition of personal information vary, but we can group people  according to the different ways they refer to personal information (this leads to a natural follow-up question; what makes some information more personal than others?)
  3. Previous focus group research, contracted by Mozilla, showed that users are aware that spam indicates a  security risk, but what didn’t come out of the focus group research was that users also also consider spam to be an invasion of their privacy (a follow-up question, what do users define as “spam?”  Do they consider targeted ads to be spam?)
  4. There are users who don’t distinguish privacy and security from each other

Some previous research on browsing and privacy

We  knew from our own focus group research that users are concerned about viruses, theft of their personal information and passwords, that a  website might misuse their information, that someone may track their  online “footprint”, or that their browser history is visible to others.   Users view things like targeted ads, spam, browser crashes, popups, and  windows imploring them to install updates as security risks.

But it’s difficult to broadly generalize findings from focus groups.  One group may or may not have the same concerns as the general population.  The quality of the discussion moderator, or some unique combination of participants,  the moderator, and/or the setting can also influence the findings you get from focus groups.

One way of validating the representativeness of focus group research is to use surveys.  But while surveys may increase the representativeness of your findings, they are not as flexible as focus groups.  You have to give survey respondents their answer options up front.  Therefore, by providing the options that a respondent can endorse, you are limiting their voice.

A typical  way to approach this problem in surveys is to use open-ended survey questions.  In the pre-data mining days, we would have to manually code  each of these survey responses: a first pass of all responses to get an idea of respondent “themes” or “topics” and a second pass to code each  response according to those themes.  This approach is costly in terms of time and effort, plus it also suffers from the problem of reproducibility; unless themes are extremely obvious, different coders might not classify a response as part of the same theme.  But with modern text mining methods, we can simulate this coding process much more quickly and reproducibly.

Text mining open-ended survey questions

Because text mining is growing in popularity primarily due to its computational feasibility , it’s important to review the  methods in some detail.  Text mining, as with any machine learning-based approach, isn’t magic.  There are a number of caveats to make about the text mining approach used. First, the clustering algorithm I chose to use requires an arbitrary and a priori decision regarding the number of clusters.  I looked at 4 to 8 clusters and decided that 6 provided the best trade-off between themes expressed and redundancy.  Second, there is a random component to  clustering, meaning that one clustering of the same set of data may not produce the exact same results as another clustering. Theoretically,  there shouldn’t be tremendous differences between the themes expressed in one clustering over another, but it’s important to keep these details in mind.

The general idea of text mining is to assume that you can represent documents as “bags of words”, that bags of words can be represented or coded quantitatively, and that the quantitative representation of text can be projected into a multi-dimensional space. For example, I can represent survey respondents in two dimensions, where each point is a respondent’s answer.  Points that are tightly clustered together mean that these responses are theoretically very similar with respect to lexical content (e.g., commonality of words).

I  also calculated a score that identifies the relative frequency of each word in a cluster, which is reflected in the size of the word on each  cluster’s graph.  In essence, the larger the word, the more it “defines”  the cluster (i.e. its location and shape in the space).

Higher resolution .pdf files of these graphs can be found here and here.

Cluster summaries

  • “Privacy and Personal information”: Clusters  1, 4, and 5 are dominated by, unsurprisingly, concerns about  information.  What’s interesting are the lower-level associations  between the clusters and the words.  The largest, densest cluster  (cluster 4)  deals mostly with access to personal information whereas  cluster 1 addresses personal information as it relates to identity  issues (such as when banking).  Cluster 5 is subtly different from both 1  and 4.  The extra emphasis on “share” could imply that users have  different expectations of privacy with personal information that they explicitly choose to leak onto the web as opposed to personal information that they  aren’t aware they are expressing.  One area of further investigation would be to seek out user definitions on personal information; what makes some information more “personal” than others?
  • “Privacy and Tracking”: Cluster  6 clearly shows that people associate being tracked as a privacy issue.   The lower-scored words indicate what kind of tracked information  concerns them (e.g., keystrokes, cookies, site visits), but in general  the notion of “tracking” is paramount to respondents in this cluster.   Compare this with cluster 2, which is more strongly defined by the words  “look” and “history.”  This is obviously a reference to the role that  browsing history has in defining privacy.  It’s interesting that these clusters are so distinct from each other, because it implies that users  are aware there is a difference between their browser history and other  behaviors they exhibit that could be tracked.  It’s also interesting  that users who consider browser history a privacy issue also consider  advertising and ads (presumably a reference to targeted ads) as privacy  issues as well.  We can use this information to extend the focus group  research on targeted ads; in addition to a security risk, some users  also view targeted ads as an invasion of privacy.  One interesting question naturally arises: do users differentiate between spam and  targeted advertisements?
  • “Privacy and Security”: The  weakest defined group is cluster 3, which can be interpreted in many ways.  The least controversial inference could be that these users simply don’t have a strong definition of privacy aside from a notion  that privacy is related to identity and security.  This validates a notion from our focus group research that some users really don’t differentiate between privacy and  security.

Final thoughts

User  privacy and browser security are very important to us at Mozilla, and  developing a product that improves on both requires a deep and evolving  understanding of what those words mean to people of all communities - our entire user population.    In this post, we’ve shown how text mining can enhance our understanding  of pre-existing focus group research and generate novel directions for  further research. Moreover, we’ve also shown how it can provide insight into  users’ perception by looking at the differences in the language they use  to define a concept.  In the next post, I’ll be using the same text  mining approach to evaluate user definitions of security while browsing  the web.

 


7
Dec 10

Mozilla Open Data Competition – 10 Days Left!

[Note: cross-posted on Mozilla Labs]

Hello Data Hackers!

We just wanted to remind everyone that the submission deadline for the first Mozilla Open Data Visualization Competition is just 10 days away! Submit your entries by December 17th for a chance at a $300 Amazon gift card and a set of all 4 Edward Tufte books!

We’ve already received some great entries, and our panel of expert judges (Kevin Fox and Jinghua Zhang, Mozilla Labs; Hamilton Ulmer, Chris Jung and Blake Cutler, Mozilla Metrics) along with our partner judges (David Smith, Revolution Analytics; Andrew Vande Moere, Information Aesthetics) look forward to seeing the rest of the submissions!

Remember to visit the Official Competition Page for all the information you need, including how to download the data and enter the competition.

Good Luck!


17
Nov 10

Mozilla Open Data Visualization Contest – Data is Now Available!

[Note: cross-posted on Mozilla Labs]

Two weeks ago the Mozilla Metrics Team, together with Mozilla Labs and the growing Mozilla Research Initiative, announced our first Open Data Visualization Competition. Today, we are excited to release the data sets for this competition!

These data sets come from Mozilla’s own open data program, Test Pilot. Test Pilot is a user research platform that collects structured user data through Firefox. Currently, over 1 million Firefox users from all over the world participate in Test Pilot studies, which aim to explore how people use their web browser and the Internet in general.

For this challenge, data samples from two recent Test Pilot studies have been made available – check out the data sample pages below for a thorough description:

In addition, please note that the submission deadline has been extended from Dec. 5 to Dec. 17th to encourage more participation. Be sure to visit the official Competition page for more general info on dates, judges, and prizes.

Good luck and start hacking!


4
Nov 10

Mozilla Open Data Visualization Competition – How Do People Use Firefox

The Mozilla Metrics Team, together with Mozilla Labs and the growing Mozilla Research Initiative, is excited to announce our first Open Data Visualization Competition.

Using data from Mozilla’s own open data program, Test Pilot, we’d like to explore creative visual answers to the question: “How do people use Firefox?” We are looking for compelling visualizations that tell detailed, meaningful and yet easy-to-interpret stories about interesting user activities.

Read on for details on the data (accessible Nov. 17), prizes, and judges, including special judge, David Smith from Revolution Analytics. Also, don’t forget to check out the official Competition page and follow @moztestpilot on Twitter for news and updates.

The Data

This competition is based on Mozilla’s own open data program, Test Pilot. Test Pliot is a user research platform that collecting structured user data through Firefox. All data is gathered through pre-defined Test Pilot studies which aim to explore how people use their web browser and the Internet.

Currently, over 1 million Firefox users from all over the world participate in Test Pilot studies. The goal for this platform is to encourage everyone from all skill levels to improve the Web experience by conducting and participating in these studies. Test Pilot study results are made available under open licenses, with the data being anonymized before release. (For more information about the Test Pilot data policy, please check Privacy Policy.)

For this challenge, we will use data from two recent Test Pilot studies:

Partners and Judges

We are honored to have David Smith from Revolution Analytics partnering with us and serving as a special judge. Members from Mozilla Labs and Mozilla Metrics will form the rest of the judging panel.

Prizes

To recognize the awesome work from participants, some great prizes are at stake:

  • Grand prize will be a $300 Amazon gift card
  • Two “Best in Class” teams will receive a set of Edward Tufte’s books

We’ll also present all submissions on the Test Pilot website and Mozilla Metrics blog in a special post to highlight your work.

Join the competition!

You can choose any tools you like for your analysis and visualization, including but not limited to: R, Matlab, Protovis, Processing or IBM many eyes. You can participate solo or team up with other people. You are also welcome to enter as many times as you like. If you are interested to join the competition, please follow the following important dates:

  • Nov. 17th: Check the official Competition page on Nov. 17th to download the data
  • Dec.5th : Go here to submit your results and enter the competition before Dec.5th.
  • Dec. 14th: Winning visualizations will be announced on Dec. 14th.

To facilitate the free exchange of ideas, all visualizations and other contributions you make to this challenge must be contributed under the Creative Commons Attribution-ShareAlike 3.0 license.


14
Sep 10

Visualizing Firefox 4 Beta Usage

Last July, we presented initial analysis from our first comprehensive user interface (UI) study through an interactive, web-based heatmap. Many findings aligned with our expectations, but there were a few surprises. For example, only 12% of users clicked on the (tab bar) New Tab button and over 55% of users performed searches through the Location Bar.

Since then, we’ve introduced a major UI revision in Firefox 4 beta; tabs were moved on top, the Reload and Stop buttons were combined, and, more generally, the UI was simplified and streamlined.

Today, we’re assessing the impact of these changes through the first update to our heatmap. By intuitively presenting usability data on top of the Firefox UI, we aim to reduce speculation about Firefox usage and help designers make better informed product decisions, faster.

Here are just a few questions we can now answer:

1) Will users prefer tabs on top, or switch back to tabs on bottom?
2) Will Windows users smoothly transition to the combined Firefox menu button?
3) With fewer options to open new tabs, will use of the (tab bar) New Tab button rise?
4) Do advanced users interact with more of Firefox’s features, and, if so, which?

Heatmap Updates

Before I address these questions, let me briefly describe our heatmap updates. In addition to adjusting to the new UI, we:

1) Increased our sample size from just under 10,000 to well over 230,000 users (being integrated into the beta has its perks!).

2) Instrumented the combined Firefox menu button, doubling the number of UI elements items tracked.

3) Added bar charts that compare stats for beginner, intermediate, and advanced users. To view these charts, hover over any UI element.

Analysis

Now, back to the questions.

1) Tabs on top: Initial analysis indicates that this move was well received by our users; 92% of users kept tabs on top for the majority of their browsing sessions.

2) Combined menu button: Collapsing the menu bar into a single button was one of most significant changes to the Firefox 4 UI. How did Windows 7 and Vista users react to the move?

The majority appear comfortable with this change, but 29.7% did switch back to the traditional menu bar. This relatively large number is not surprising because early betas included an incomplete version of the Firefox button. To access basic menu items, including Downloads, New Tab, and Start Private Browsing, users needed to revert to the traditional menu.

3) New tab button: In our first iteration of the Test Pilot UI study, we were surprised to see that only 12% of people used the (tab bar) New Tab button. But in this study, the New Tab button was the fourth most commonly used element, with over 88% of users clicking the button at least once.

This large shift can be partially accounted for by both the lack of a New Tab menu item in the Firefox button and the removal of the New Tab button from the Customize Toolbars pane. In early betas, the New Tab button in the tab bar was the only way to open a blank tab besides the keyboard shortcut “Ctrl + t”.

4) Tech level differences: The heatmap affirms many of the usage differences we expected to see across varying skill levels. More beginner users click on UI elements for common tasks that have popular shortcuts (e.g. Back, Forward, Bookmark Star). They also are more likely to use the Go button in the Location and Search Bars.

Advanced users, on the other hand, are more likely to use the Location Bar and search more frequently. They also are heavier users of the Tab Scroll arrows, the List All Tabs button, the RSS icon, and the Site Identity button — perhaps reflecting a better understanding of online security.

Conclusion

We’ve covered some basic findings, but are only beginning to dive into the data. For those that want to look at the raw numbers, we’ll post data samples soon (check back here).

As always, Mozilla does not make design decisions based on Test Pilot data alone. Despite increasing our sample size twenty-fold, our beta population is still not representative.

Additionally, quantitative data tracks how users interact the new UI, but now how they feel about specific design decisions. Aakash Desai has headed up some terrific work collecting qualitative data at input.mozilla.com. So far, users’ response to the new user interface has been exceedingly positive!

Notice any thing unusual in the data or have another take on our analysis? Please leave your thoughts in the comments!