Mozilla DTD files, caveat emptor
If you’ve had the opportunity to localize Mozilla, then you have become very familiar with DTDs and the complexities that localizers face when translating a program like Firefox. I thought I would use a few blog posts to describe some of these challenges, leading up to the next generation of localization at Mozilla — L20n. [1]
Difficulty # 1: Declensions of nouns, pronouns, and adjectives and platform-specific word usage
Do you mind if I rewind us to high school Latin class where I am sure you remember repeating all the declensions of the various forms of nouns, pronouns and adjectives. In Latin, the six declensions have different number and gender endings (i.e. singular/plural and male/female/neuter). It turns out that Latin is not the only language that does this. In fact, Mozilla ships Firefox and Thunderbird in many languages that have similar, if not much higher, complexities with declensions.
Here is how it specifically relates to Mozilla’s DTDs. Take the following example:
So you want to be a localizer? In the en-US source, we identify our DTDs with the markup declaration “!ENTITY”. If you see “!ENTITY” in the code, then you know there is something that needs to be translated. Below, you can see that there is a variable called brandShortName with the string ” Firefox”.
<!ENTITY brandShortName “Firefox”>
Every time brandShortName appears in the code, the string “Firefox” (or the translated string provided by the localizer) will be presented in the user interface to someone using Firefox. But, what if a language has several different declensions of the word Firefox, like Latin, that could be used in different grammatical structures?
In Polish, for instance, Firefox could be written as Firefox, Firefoksa, Firefoksowi, Firefoksem, etc. depending on how our brand name is used in context. But, there really is no way to provide multiple words for Firefox in the setup you see above. brandShortName will always have the value in the string, but that string does not allow localizers to enter multiple possibilities. The localizer gets one string, effectively one chance, to make the translation work.
Now, let’s say that each operating system uses a different label to peform a specific function. Let’s use Polish again as an example. In the example below, you see can see different entities for Mac (hidemac.label), Windows (hidewin.label), and Linux (hidelin.label)
<!ENTITY brandShortName “Firefox”>
<!ENTITY hidemac.label “Hide &brandShortName;”>
<!ENTITY hidewin.label “Hide – &brandShortName;”>
<!ENTITY hidelin.label “Hide: &brandShortName;”>
Can you see above how the Mac label will read “Hide Firefox”, the Windows label will read “Hide – Firefox” and the Linux label will read “Hide: Firefox”? Seems to work just fine in English. But, in Polish, the word for Hide is ukryj, which is 2nd person, singular and requires Firefox to be spelled Firefoksa if you want to be grammatically correct. Here is how we have to localize in Polish:
<!ENTITY hidemac.label “Ukryj program &brandShortName;”>
<!ENTITY hidewin.label “Ukryj program – &brandShortName;”>
<!ENTITY hidelin.label “Ukryj program: &brandShortName;”>
Localizers have to create a new phrase, “Hide program: Firefox” instead of what seems more natural: “Hide Firefox”. No one would say “Ukryj program Firefox”. It sounds robotic and weird or even monsterish. Polish speakers say “Ukryj Firefoksa”. But, remember brandShortName can only be “Firefox”. You can imagine that this happens all the time in Polish.
Now, multiply that by 60+ localizations whose grammatical structures are different from English, across three platforms, and you get a sense of how difficult this gets. DTDs have other limits that I’ll blog about with more examples. Before we’re finished, I’ll get to the better way to do this that Axel and others have thought long and hard about.
[1] L20n is a term I would hear Axel mention in the past and I’ve spent some time learning what this actually means. I am hoping guys like Axel and Gandalf will comment on these blog posts to add to the conversation.




















Thanks Seth for your concern about language diversity
. I am very glad Mozilla is focussing on these questions, and waiting with much interest for the next # of this beginning series.
As for the example you mention, though quite revealing, it may not be the worse issue for localizers. I think that one previous *good solution* (”let’s make an entity out of this word so that it is quicker for translators who won’t have to write it again and again”) has generated a problematic *side-effect* (”hey, the entity is useless for some languages”).
So I suppose that when Polish translators see
they just skip the entity and drop
(Sorry Polish translators, of course you are welcome to correct me here, as it is very likely I am not right, since I unfortunately cannot speak your language
)
ouch I should have thought of that
missing parts in my message above:
when Polish translators see
they just ignore the entity and write
oh my! no way to drop entity tags in the comments :°]
I try again : the Polish translators can just write “Ukryj Firefoksa” and other variable forms each time it is necessary instead of being annoyed with the entity. Sad but simple workaround.
Wow.
Looks like declensions is a bug problem compare to Eastern languages measure words in bug 473706.
Goofy, you are right that translators could just avoid it by not using entities for the product name. That is how just about everybody I know of solve this. As far as I know, the reason it is done in Mozilla, is to allow for rebranding. Of course, you don’t have to make things easier for people doing rebranding, but I guess Mozilla has to, since they don’t allow unauthorised use of the Firefox brand (excuse me if I have the wording or nuances wrong).
So the interesting thing is that localisers in other projects struggle less with this since the need for rebrandability is not so strong elsewhere.
In addition to branding and beforementioned Eastern languages’ differences localization also gets tricky with dates, one example being bug #463273 where code rewriting introduced difficulties with localizing.
As far as I understand it’s a choice between simplicity for languages that don’t have declensions and making good localizations possible for those that do. It just can’t be comfortable for both at the same time.
Anxious to see what will be the solution that you mention in your next post. To me it seems that for use cases where one string is put into another there could be a base form of the string where localizer can omit the ending and then add correct ending in the resulting string. But that requires a base case that isn’t used separately.
Merike, I don’t believe that the choice has to be between simplicity and correctness. KDE’s Transcript is entirely hidden for languages that don’t need to use it, but provides the power to those who need it. I would say that simplicity is an absolute requirement for any alternative solution. Things are already hard. Making it harder won’t benefit our cause.
@ Goofy, F Wolff:
Rebranding is one reason you may want to do this, but it is really a side effect to something much more bigger: thinking about Mozilla as a platform. If you hard code “Firefoksa” everywhere, even Thunderbird and SeaMonkey can no longer share the same l10n code. Making sure we use &brandShortName; all the time is in fact encouraging developers to experiment with Mozilla code and create new applications (browsers or not) with less l10n effort.
Another major headache when translating things is plural forms.
In English, the singular is used for integer 1, the plural for all other integers.
In French, the singular is used for integers 0 and 1, and the plural for all other integers.
In Polish, there are different forms for integer ranges 1, 2-4 and 5-21.
Arabic has six plural forms.
See http://doc.trolltech.com/qq/qq19-plurals.html for a great summary. The pootle page http://translate.sourceforge.net/wiki/l10n/pluralforms has a list of equations, and it in turn uses GNU’s gettext function to achieve this, http://www.gnu.org/software/automake/manual/gettext/Plural-forms.html
F Wolff: “So the interesting thing is that localisers in other projects struggle less with this since the need for rebrandability is not so strong elsewhere.”
In Bugzilla, the term for principal database object ‘Bug’ is customizable. Imagine Firefox has user-configurable terms for ‘bookmark’ and ‘tab’ and all UI texts must be changed accordingly and adapt to nouns of any gender.
It’s been a while but I think Sid Meier’s 4Xs from Colonization to Alpha Centauri have a very interesting l10n system design. I’m no expert so try abandonia.com for a copy of Colinization and take a look at the language file, it might (hopefully) provide a different perspective and ideas for improvements.
The summary for me is that when we try be fancy we end up having to be fancier to work around the problem we’ve created.
I’d be curious to see some examples beyond the Firefox one, issues such what to do with types of data in variables that are more closely aligned to what programs output: filesnames, website titles, etc
I’m reviewing some LOCALIZATION NOTES and found some of these:
Blah + (a,b or c). Where a, b and c are on average 3 word string. It seemed like an amazing solution to a coder yet it would simply have worked better to have 3 redundant strings:
Blah a
Blah b
Blah c
And let localisers and there tools manage the differences.
One thing I would like to echo which Friedel mentioned, is that if l20n makes things harder for the many languages that do not have this problem or don’t need to manage it yet. Then I’m afraid we fail.
@ Dwayne: the point of l20n is not to make things harder. We want to make things better. Good point though.
Dwayne: ergonomic UI is always concise. There are web pages and error messages which make things complex. You have asked about real-life example, consider this one:
http://mxr.mozilla.org/mozilla/source/webtools/bugzilla/template/en/default/global/user-error.html.tmpl#111
This code is designed to emit more than 60 similar error messages and is still correct in English. Do you want to persuade developers that having 60 separate sentences is better ‘from l12y standpoint’?
I’m already engaged in uphill battle of bug 407752.
barf. I always thought that mailnews code was bad, and now you made me look at bugzilla. how evil.