I’m always amazed when feed software preserves fidelity.

Below is a screenshot from Google Reader. I’ve been using it for a while now, in combination with Firefox 2’s web feed reader integration. For some reason, other browsers don’t have that feature.

Anyway, I was browsing through my feeds and I noticed the busted post on the Google AJAX Search API Blog. At first, I sighed, assuming I was looking at Yet Another Artificially Intelligent Ampersand Correction Algorithm deployed by the (closed-source) Reader server. But then I looked in Firefox’s feed preview, and saw the same ampersands going wild. Identical interpretations of Blogspot-generated roaches. Excellent.

There was one difference, though. Google Reader displayed a border around the markup sample, but Firefox 2 didn’t. I know the Google Reader guys aren’t just passing the HTML through as received from the server, because they don’t make rookie mistakes like that, so it seems our whitelists have diverged a bit. In fact, they are probably deploying a sanitizing CSS parser. You know, we should probably agree on a minimal whitelist. We’ll use the JSON-esque approach of allowing clients to consume CSS rules that are not on the standard whitelist. That way, a small set of CSS rules will be guaranteed to work in conformant applications.

Come to think of it, we might want to standardize similar policies for restricted HTML parsing. There’s even a W3C mailing list working on this stuff. Turns out mail clients have the same issues that feed readers do. And Google Reader is just one example of a website that has this problem. Why can’t browsers borrow this policy from email clients and feed readers, and allow site authors to activate it? That way, sites wouldn’t get burned by faulty markup sanitization. Let’s go over the common proposals:

The “jail” tag

This looks simple at first.

<jail><script>...</script></jail>

The idea is that the browser would ignore script elements inside the jail element, and site authors would surround user-populated areas of the page with this element. Except, it would be pretty easy for an attacker to close the jail element and proceed:

<jail><!-- begin attack --></jail><script>...</script></jail>

Oops. Well, nothing a little ugly markup couldn’t fix:

<jail-asd987dsf897asdf89asdf>
  <!-- begin attack --></jail><script>...</script>
</jail-asd987dsf897asdf89asdf>

That changes HTML tokenization, which is annoying, but it’s not a deal breaker. There are more problems, though.

<jail-asd987dsf897asdf89asdf>
  <!-- begin attack --></jail><textarea>
</jail-asd987dsf897asdf89asdf>

This is a DoS attack that Ian Hickson pointed out. An unclosed textarea element will turn the rest of the page into the contents of a form. Oops. We could keep going, and change the way we consume tokens inside textarea if there’s a jail element on the stack. It’s actually not that complicated. (Ian also pointed out that the parameter aConservativeConsume is a security vulnerability that many browsers share. It’s a bit hard to exploit, though, and fixing it would hurt web compatibility.)

OK, this proposal still seems doable, but it turns out that the jail tag introduces a DoS attack itself. If a site doesn’t enclose user input in a jail element, and makes a mistake in its sanitization routines, the jail element is a DoS threat itself! Doh. DoS attacks are less serious than the exploits which insert plugins and scripts. But is it worth bending over backwards to implement this element if it introduces new attacks? I’m not sure. Yesterday, I thought it was worth it. Today, not so much.

Fix The Servers!

A lot people say this. Jesse is the latest. These days, there are some good libraries. There’s even one for PHP. But input validation is only half the battle. To stay safe, sites will need a template language that operates on the DOM tree, like Genshi. By combining these libraries with a whitelist of allowed HTML and CSS, sites can prevent many exploits.

On the downside, these libraries are a lot harder to learn than simple string concatenation. In fact, you have to understand HTML or XML a little, which is a barrier that will eliminate a large number of users. Besides, they don’t prevent someone from cutting a corner somewhere and bypassing the template system. Some systems even encourage rogue programmers to extend their applications! It would be nice have a little defense-in-depth here, without covering our templates in jail tags.

Policy Headers

Most of the problems with the jail element stem from attempting to encode security contexts in the markup. So, how about adding a header that describes the needs of the site author? That’s the approach taken by Gervase Markham’s Content Restrictions proposal. It looks like a slam-dunk at first. Elite hackers are producing slides about it, and advocating it at conferences. It turns out the security context it provides is a little too coarse-grained. Site authors like scripts, and they like to put them everywhere. The Content Restrictions proposal requires filtering all event handlers in its restricted script modes. Bummer. Of course, authors can still add event handlers at load time, and some even advocate exorcising the demons of inline event calls. Would web authors actually use this? Maybe so. It will suck less in Firefox 3, because we have getElementsByClassName. We could also add an API that would let authors set a bunch of event handlers in one native call, if performance is still a problem.

Another downside of the proposal is that it requires settling on a whitelist of permitted HTML elements and attributes, including filtering inline CSS or banning it altogether. Sound familiar? Maybe we should define an HTML/CSS subset suitable for use across trust boundaries. In the meantime, fix your servers, and let me know what you think.

7 Responses to “Interoperability and XSS Mitigation”

  1. Edward Z. Yang Says:

    Hi, here are my two cents:

    As a slight clarification: the PHP tool you mentioned, HTML Purifier, always generates valid (X)HTML fragments that can easily be loaded into the DOM, and also uses DOM to parse the DOM when it is available (the only reason it isn’t used exclusively is due to spotty PHP4 support).

    As for the policy header proposal, it has a lot going for it, but the primary trouble with it is that it doesn’t “fix” the problem: all the older browsers floating out there won’t support it, and thus it can’t be depended on. Secure it at the server!

  2. Michael Says:

    What about digitally signing the javascript blocks? Do you see any problem with that?

    Off course, the site would need to use SSL or another specific mechanism.

  3. Ian Hickson Says:

    Actually as far as I can tell no browsers are vulnerable to the re-parse bug anymore. (Tested relatively recent versions of Safari, Opera, IE, and Firefox on Windows.) So I’m not sure fixing it would be a Web-compat problem… :-)

  4. Ezra Cooper Says:

    Rob–Another similar proposal is
    BEEP by Trevor Jim et al., presented at WWW 07. FWIW.

  5. RSnake Says:

    Hey, Rob - I’m actually the person who came up with the original concept for content restrictions, and later gave the concept to Rafael who gave it to Gerv (they will verify this). I’m told you are working on this now, so please feel free to drop me a line and I can discuss the origins of it (off this thread) and some more details. I told Mike Shaver that I’d get back to him on this, but if you are taking this over, perhaps you’d be better to speak with.

  6. Arshan Dabirsiaghi Says:

    Rob,

    This is great stuff - you might want to check out the AntiSamy project. We’ve already setup the framework and a Java implementation for safely validating rich input/content restrictions according to a policy file dictating what elements, attributes, size, etc. can occur. AntiSamy is definitely separated from the pack in that it also validates CSS.

    Side note - I’m not in love with the “sandbox header” idea, but I’m always parroting the same line my boss gave me. There’s 4 big browsers, hundreds of thousands of companies writing web applications, and millions of developers throughout the world. Where’s the easiest place to get stuff done?

    I like a modification to the jail idea.

    As long as you’re using a secure PRNG, you’re in business:

    attack_that_fails();

    The “rules” attribute could point to previously defined set of rules, either globally known or defined in the page like CSS rules.

  7. Arshan Dabirsiaghi Says:

    Wow your comment consumer totally ate my message. I was displaying a “jail” tag that had 2 attributes, a ruleset identifier and a “secret”. The ability to customize the ruleset will be an important piece of functionality that must be present for that idea to work.