-
Updating Localization Notes
Tomer, from the Hebrew localization team, highlighted an interesting problem the other day when he emailed the l10n-drivers to point out an issue that has been bothering him and many other localizers. Sometimes, developers will change entities in our locales/en-US directory, but forget to change the localization note above it to reflect the new entity. As Tomer explains,
“This causes the comment to become irrelevant to the text it references. Additionally, if someone then fixes the localization note, localizers won’t be notified on this change, and the comment does not get changed in our translations…As some of us are actually reading such comments before translating, it is important to get it 100% accurate.”
Here is an example that Tomer provides.
<!– LOCALIZATION NOTE (bookmarksSidebarGtkCmd.commandkey): This command
- key should not contain the letters A-F, since these are reserved
- shortcut keys on Linux. –>
<!ENTITY bookmarksGtkCmd.commandkey “o”>You can see that example in our code on MXR here: http://mxr.mozilla.org/mozilla1.9.2/source/browser/locales/en-US/chrome/browser/browser.dtd#110
For those readers who may not be seeing what is happening here, notice that the <!– LOCALIZATION NOTE –> is referencing “bookmarksSidebarGtkCmd.commandkey“, but the !ENTITY variable name is actually “bookmarksGtkCmd.commandkey“.
That mismatch in the entity names has made that localization note untrackable by any locaization tools. Unfortunately, localization tools will not understand which comment belongs to bookmarksGtkCmd.commandkey. Furthermore, localizers who use these notes for translations will have to make the educated guess where the comment is pointing. If the note gets updated in the future, it’s likely that localizers will miss it.
Tomer suggested writing a script to look for these mismatches. In the very least, I am hoping this post will spread the awareness to developers to remember to do this. A quick request from l10n community: please maintain localization notes if entities get changed.
-
Another thing about <!ENTITY> and then some on localization
Challenge 3: Some languages with multiple forms of a word
In some languages, words can have multiple forms depending on context. What if the word for tab could be written as tab, tabs, tab(x), or [prefix]-tab where each form might be used depending on what a developer hopes to communicate.
Here is an example:
<!ENTITY tabsOpen1 “You have %1 tab open”>
<!ENTITY tabsOpen2 “You have %1 tabs open”>In English, the UI is relaying to the end-user that he or she may have one tab open or more than one tab open. Using Polish again, we can see that there are multiple forms depending on the context: one kartę; two, three or four karty; or zero or five or more kart. In fact, the Polish grammar rule is much more complex than my explanation and I am sure I am missing all the rules, but you get the point.
See below:
<!ENTITY tabsOpen1 “Masz %1 kartę otwartą“>
<!ENTITY tabsOpen2 “Masz %1 karty otwarte“>
<!ENTITY tabsOpen3 “Masz %1 kart otwartych“>See the problem? Option three can and will never be used because the code only provides for 1 tab or [x] tabs. So, Polish localizers are forced to create an artificial form like ” otwórz kart: %1″. Once again, this is not really a pattern of natural spoken Polish. It reads more as a representation of the database or something.
Challenge 4: Localization in the broader sense
Sometimes in our UI, colors, icons, spacing allocated to certain words like “Firefox”, and more are hard-coded, limiting the ability for a localizer to change them to make more sense or work well in their localizations. If those elements are not hard-coded, they can still be hard to change. In those cases, a localizer can file a bug asking a developer to provide more options.
For instance, let’s say a developer uses the colors red and green to indicate success or failure when a user submits a password. These colors might not mean anything in certain localizations. A bug is then filed and a developer works to extend the options available to that localizer so it is more meaningful. But this can be laborious, and is definitely not scalable. Moreover, this new exception forces all other localizations to translate a new entity, even though it may not have the same level of importance (if any at all) in their home language.
Other issues to think about include languages that use right-to-left writing or languages that present their characters vertically rather than horizontally. The examples are numerous and we can go through all of them, but I think you get my point. Feel free to add your examples to the comments section of this post.
Next time, I’ll present a small piece of what could be the next generation of l10n. You might think of it as Localization 2.0 or L20n.
-
DTD limitations with the gender of translated words
Yesterday, I wrote about the complexity that many localizers face when translating Firefox.
Here is another example.
Difficulty #2: Gender of words
Remember from yesterday that we can have a DTD file in the en-US version of Firefox like
<!ENTITY brandShortName “Firefox”>
where every time the variable “brandShortName” appears in the code, Firefox displays the translated string that is shown in quotes.
What if a language used a different gender for the same word, causing that word to slightly change given different contexts in the user interface? Let’s look at this example, using Polish again as our language of translation:
<!ENTITY willCheck “&brandShortName will check links”>
In Polish, the localizer has to translate the entity into the following:
<!ENTITY willCheck “&brandShortName będzie sprawdzał(a) odnośniki“>
where sprawdzał is the masculine version of the word and sprawdzała is the feminine version of the word. See how that can be problematic? This isn’t how Poles speak or write naturally, using a parenthetical “a” to account for all possible genders in one sentence. In context, the proper gender version should be used. But, the localizer has to acknowledge both endings with the (a). Alternatively, the localizer could pretend that the word is simply masculine gender, which can be obviously sensitive depending on what word is being written, who is reading it, and what alternate meaning that word might take on with the wrong ending. Polish locaizers made this change and it’s serviceable.
As we expand into new areas, where languages can be extremely different from English, we’ll need to think about a better way to do this. In the next post, one more example and then some more on localization. I’ll conclude all this with a possible solution.
(By the way, I don’t speak Polish. I just happen to work next to a Polish guy every day. Thanks, Gandalf, for providing me some examples to round out these posts.)
-
Mozilla DTD files, caveat emptor
If you’ve had the opportunity to localize Mozilla, then you have become very familiar with DTDs and the complexities that localizers face when translating a program like Firefox. I thought I would use a few blog posts to describe some of these challenges, leading up to the next generation of localization at Mozilla — L20n. [1]
Difficulty # 1: Declensions of nouns, pronouns, and adjectives and platform-specific word usage
Do you mind if I rewind us to high school Latin class where I am sure you remember repeating all the declensions of the various forms of nouns, pronouns and adjectives. In Latin, the six declensions have different number and gender endings (i.e. singular/plural and male/female/neuter). It turns out that Latin is not the only language that does this. In fact, Mozilla ships Firefox and Thunderbird in many languages that have similar, if not much higher, complexities with declensions.
Here is how it specifically relates to Mozilla’s DTDs. Take the following example:
So you want to be a localizer? In the en-US source, we identify our DTDs with the markup declaration “!ENTITY”. If you see “!ENTITY” in the code, then you know there is something that needs to be translated. Below, you can see that there is a variable called brandShortName with the string ” Firefox”.
<!ENTITY brandShortName “Firefox”>
Every time brandShortName appears in the code, the string “Firefox” (or the translated string provided by the localizer) will be presented in the user interface to someone using Firefox. But, what if a language has several different declensions of the word Firefox, like Latin, that could be used in different grammatical structures?
In Polish, for instance, Firefox could be written as Firefox, Firefoksa, Firefoksowi, Firefoksem, etc. depending on how our brand name is used in context. But, there really is no way to provide multiple words for Firefox in the setup you see above. brandShortName will always have the value in the string, but that string does not allow localizers to enter multiple possibilities. The localizer gets one string, effectively one chance, to make the translation work.
Now, let’s say that each operating system uses a different label to peform a specific function. Let’s use Polish again as an example. In the example below, you see can see different entities for Mac (hidemac.label), Windows (hidewin.label), and Linux (hidelin.label)
<!ENTITY brandShortName “Firefox”>
<!ENTITY hidemac.label “Hide &brandShortName;”>
<!ENTITY hidewin.label “Hide – &brandShortName;”>
<!ENTITY hidelin.label “Hide: &brandShortName;”>Can you see above how the Mac label will read “Hide Firefox”, the Windows label will read “Hide – Firefox” and the Linux label will read “Hide: Firefox”? Seems to work just fine in English. But, in Polish, the word for Hide is ukryj, which is 2nd person, singular and requires Firefox to be spelled Firefoksa if you want to be grammatically correct. Here is how we have to localize in Polish:
<!ENTITY hidemac.label “Ukryj program &brandShortName;”>
<!ENTITY hidewin.label “Ukryj program – &brandShortName;”>
<!ENTITY hidelin.label “Ukryj program: &brandShortName;”>Localizers have to create a new phrase, “Hide program: Firefox” instead of what seems more natural: “Hide Firefox”. No one would say “Ukryj program Firefox”. It sounds robotic and weird or even monsterish. Polish speakers say “Ukryj Firefoksa”. But, remember brandShortName can only be “Firefox”. You can imagine that this happens all the time in Polish.
Now, multiply that by 60+ localizations whose grammatical structures are different from English, across three platforms, and you get a sense of how difficult this gets. DTDs have other limits that I’ll blog about with more examples. Before we’re finished, I’ll get to the better way to do this that Axel and others have thought long and hard about.
[1] L20n is a term I would hear Axel mention in the past and I’ve spent some time learning what this actually means. I am hoping guys like Axel and Gandalf will comment on these blog posts to add to the conversation.



















