Firefox 3: UTF-8 support in location bar
May 23rd, 2008 by Gen KanaiThere have been a number of posts recently looking at new features of Firefox 3 including the new smart location bar (a.k.a. Awesomebar), the new bookmarks functionality, color profile support, the site identification button, the 3 new themes, to name just a few.
I’d like to take a look at one of the new changes for Firefox 3 - support for UTF-8 multi-byte uris. To give credit where it is due, this functionality is already available in Internet Explorer 7, in Safari 3, and in Opera 9. However, this functionality is slightly different in these browsers (which I will explain further below.)
For those of us who mainly use the Roman-language us-ascii web, you may not notice one of big changes for Firefox 3: UTF-8 multi-byte support in the location bar. This is a very large usability win because previously non-Roman ascii language uris were unreadable in Firefox 2. In Firefox 3, they are now human readable.
As an extreme example, here is the Japanese wikipedia page for the place in Japan that has the longest name, 愛知県海部郡飛島村大字飛島新田字竹之郷ヨタレ南ノ割。
For those of you who study Japanese, you would pronounce it like this: 「あいちけんあまぐんとびしまむらおおあざとびしましんでんあざたけのごうよたれみなみのわり。」
In Firefox 2 where the location bar would not display the Japanese multi-byte characters, the encoded uri is 254 (!!!) characters.
In Firefox 3, where the location bar supports UTF-8, the uri is 54 characters (and is readable within an average laptop browser window.)
http://ja.wikipedia.org/wiki/愛知県海部郡飛島村大字飛島新田字竹之郷ヨタレ南ノ割
Human readability and a shorter uri together make this quite an important feature, especially for non-Roman ascii language parts of the web (which I think are the parts of the web that may be growing the fastest recently.)
Two other examples to show the extremes of multi-byte uris in ascii text:
The Welsh town of Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch is 58 characters in length.
In Wikipedia Japanese, it becomes a 389 character encoded uri in Firefox 2.
It is a mere 69 characters if we can use a browser that supports encoded multi-byte characters in the uri.
http://ja.wikipedia.org/wiki/ランヴァイル・プルグウィンギル・ゴゲリフウィルンドロブル・ランティシリオゴゴゴホ
Here is a Japanese wikipedia page that has information about a portion of the US-Japan Status of Forces Agreement. It is a 704 character encoded uri in Firefox 2.
It is 104 characters using Japanese in the uri:
These are extreme examples to show what happens when a multi-byte uri becomes encoded.
Here is an enlarged image of Firefox 2 of a uri from the Japanese volunteer translated Mozilla Developer Center documentation on Vine Linux. (Click on the image to see it larger.)
You can see that the uri after “MDC:” is unreadable encoded text. (Click on the image to see it larger.)
In Firefox 3 it looks like this: (Click on the image to see it larger.)
It’s a tad blurry but I hope you can see that the uri says “MDC:日本語版” which means ‘Japanese language.’
Here are 3 screenshots of Firefox 2 in Vista, Mac OS, and Vine Linux, as well as 3 shots of Firefox 3 in Vista, Mac OS, and Ubuntu to show you the differences. You can click on the images to see larger images if you would prefer that.
Firefox 2 on Vista (non-human readable because of encoded uri; click on image to view larger)
Firefox 2 on Mac OS (non-human readable because of encoded uri; click on image to view larger)
Firefox 2 on Vine Linux (non-human readable because of encoded uri; click on image to view larger)
Firefox 3 on Vista (human readable with decoded uri; click on image to view larger)
Firefox 3 on Mac OS (human readable with decoded uri; click on image to view larger)
Firefox 3 on Ubuntu 8.04 (human readable with decoded uri; click on image to view larger)
Dynamis helped me make the screenshots in Japanese just as an example (as that’s the non-Roman ascii language that we are most comfortable with) but if you have examples from your non-Roman ascii language, please feel free to post Firefox 3 screenshots to the web and leave uris in the comments so people can see how this might work in another non-Roman ascii multi-byte character set.
With respect to how browsers handle this functionality differently, Firefox 3, Opera 9 and Safari 3 all automatically decode uris in the location bar so that they are human-readable. IE7 has support for UTF-8 multi-byte uris but will not automatically decode them in the location bar.
There are no specifications anywhere for this browser behavior as far as I know (please correct me if I am wrong.)
Finally, note that pages that are not UTF-8 encoded will not be decoded properly in Firefox 3 if the uri is multi-byte.
It is a small feature, but for those of us who spend time in the multi-byte Internets, it is a very, very important feature for both readability and usability.
Thank you to dynamis and jdaggett for the review and help.
Some other posts about new features in Firefox 3








May 23rd, 2008 at 8:13 pm
Thanks for telling, I didn’t even know! That issue is exactly why we couldn’t use Russian letters in URLs when we implemented a Wiki a few years ago - when we tried users hated it. So we had to go back to transliteration of Russian letters (then the URL was at least readable). Looks like times are getting better for Wikis.
May 23rd, 2008 at 8:24 pm
Let me correct one thing only : you say Roman languages and non-Roman languages. That’s wrong. You should say latin-based writing scripts and non-latin-based writing scripts. And even that is a bit wrong. I am french, and french uses a latin writing script, but http://fr.wikipedia.org/wiki/Electricité with a trailing acute e is encoded in FF2… Same thing for http://sv.wikipedia.org/wiki/Umeå with a trailing a-circle. In URL, FF2 encodes anything beyond the us-ascii boundaries, really.
May 23rd, 2008 at 9:50 pm
You can even do multi-word search in the AwesomeBar with those terms: Japanese in the uri: “wiki 日本”
May 24th, 2008 at 6:38 am
Daniel, thank you for the clarification. I’ve updated the article accordingly.
Edward, yes- this is a gigantic, enormous win for the location bar usability. We will be focusing on this new i18n-enabled functionality whenever we talk about the AwesomeBar.
May 24th, 2008 at 8:44 pm
How can I disable this? Spending sometime playing in about:config and I can’t do it.
Well to reply to my email, thank you. If I find the answer before you reply, I’ll post it.
June 16th, 2008 at 10:33 am
I thought it was a bad idea to show internationalized domain names (IDN) so that homoglyph attacks could be made more obvious. Or does this pertain solely to the path after the domain name (or to the corresponding part of a URN)?
June 16th, 2008 at 10:31 pm
@Damian - It’s a bad idea to create a crappy experience for non-latin text, too.
The perceived incremental threat to rendering UTF-8 in the location bar assumes that people are otherwise good parsers of URLs, which is consistently shown to be false. Attackers already paper “paypal.com” all over subdomains and path segments in a URL, because they know unsophisticated users won’t know the difference. On the other hand, the downside of refusing to render IDN/UTF-8 properly is that a very large portion of the net gets a second-class experience.
This isn’t an argument that “things suck so who cares if we make them worse,” it’s a reminder that, one the one hand, benefit has to be weighed against cost, and on the other, that url tinkering is not the way to keep people safe. We keep people safe by removing complex URL semantics as a thing we expect users to understand - by actively blocking known badness, and by providing stronger (verified, revocable) identity information the rest of the time.
June 25th, 2008 at 2:53 am
Is an “uri” the same thing as an “url”?
June 25th, 2008 at 7:48 am
Hi, thanks for asking. Urls are a subset of uris (uniform resource identifier) and uri is the technically more accurate. Bernie Zimmerman has a good overview:
http://www.bernzilla.com/item.php?id=100
July 8th, 2008 at 3:20 am
The only problem I have experienced is that the decode/encode is incorrect for resubmitting links. Example as follows.
I submit this:
http://www.google.com/search?q=%E2%80%A2
Firefox decodes %E2%80%A2 into a visible BULLET character.
When I edit the url, Firefox encodes is incorrectly as this:
http://www.google.com/search?q=%95
%95 is not a BULLET character in UTF-8. In Windows-1252, this is a BULLET, but in UTF-8, this is a MESSAGE WAITING character.
Firefox is misrepresenting my data-sending wishes!
Thoughts?
October 5th, 2008 at 1:43 am
I would like to disable this feature as well to keep my URIs valid and copy-n-pastable. Please let me know how to do that. Thanks!