-
An experiment to integrate Silme with Narro
Many of you know Romi Hardiyanto as our Indonesian localizer who has helped grow Firefox’s market share in Indonesia to 50% since he started localizing in 2007. Romi is also a dedicated Mozilla contributor who recently hosted a terrific add-ons workshop at the Information System Department Park, ITS Campus in Sukolilo, Surabaya, Indonesia. (But, I know you’ve read Gen’s post about that.)
Recently, Romi responded to a Google Summer of Code idea I had posted about helping to enhance Mozilla’s dashboard. The l10n-drivers knew that this project was a bit of an imperative, so we decided to take on development within our team before we had any guarantee from GSoC if our proposal would be accepted. (Some blog post about the dashboard vision and progress are coming from me and Axel.) Given the amount of ambiguity on the resources Mozilla would commit to the idea, the GSoC proposal was rejected.
But, from the ashes came an idea to do a similar summer of code style project within Mozilla. What if we could redirect Romi to do another experimental project that would have some benefit to the localization community? Could Romi contribute to Silme by working on an implementation? In the past, we’ve supported some of our tool authors with funding and development resources. It turns out that Narro, another tool used by many of our localization teams, seemed like a good fit for the experiment. Voila, a new proposal took shape.
I am pleased to announce that Romi will be working to integrate Silme, a library of localization scripts created by Gandalf, into Narro. With Silme integration, we should be able to get exports of translated strings from Narro that are file-type independent (because Silme does that nicely) and can be used by the localizers and l10n-drivers to smooth out any commit bugs when it comes time to push changes back to the l10n code repositories.
Why is this important?
I’ve blogged in the past about the uniqueness of Mozilla’s DTD and property file types. Our file structure and file types can create conflicts with the output people who choose to localize with tools send to us. With Silme integration, we’ll have something that maps a bit more nicely to DTD and property files with less conflict. You can read more about Silme on Gandalf’s blog, including this wiki page that describes what features we hope to add in the 0.7 release.
The early challenge for Romi’s project is going to be embedding a Python interpreter into Narro’s PHP code base He researched a bit about PECL and will blog soon about his findings. If you can provide any ideas on how to do this, Romi would love to hear your remarks. We also have some stretch goals to hit if Silme gets integrated into Narro, and Romi will continue to blog about his progress, and those goals, over the next couple months. Please welcome Romi when his first post to Planet appears and provide any advice you might have.
-
Another thing about <!ENTITY> and then some on localization
Challenge 3: Some languages with multiple forms of a word
In some languages, words can have multiple forms depending on context. What if the word for tab could be written as tab, tabs, tab(x), or [prefix]-tab where each form might be used depending on what a developer hopes to communicate.
Here is an example:
<!ENTITY tabsOpen1 “You have %1 tab open”>
<!ENTITY tabsOpen2 “You have %1 tabs open”>In English, the UI is relaying to the end-user that he or she may have one tab open or more than one tab open. Using Polish again, we can see that there are multiple forms depending on the context: one kartę; two, three or four karty; or zero or five or more kart. In fact, the Polish grammar rule is much more complex than my explanation and I am sure I am missing all the rules, but you get the point.
See below:
<!ENTITY tabsOpen1 “Masz %1 kartę otwartą“>
<!ENTITY tabsOpen2 “Masz %1 karty otwarte“>
<!ENTITY tabsOpen3 “Masz %1 kart otwartych“>See the problem? Option three can and will never be used because the code only provides for 1 tab or [x] tabs. So, Polish localizers are forced to create an artificial form like ” otwórz kart: %1″. Once again, this is not really a pattern of natural spoken Polish. It reads more as a representation of the database or something.
Challenge 4: Localization in the broader sense
Sometimes in our UI, colors, icons, spacing allocated to certain words like “Firefox”, and more are hard-coded, limiting the ability for a localizer to change them to make more sense or work well in their localizations. If those elements are not hard-coded, they can still be hard to change. In those cases, a localizer can file a bug asking a developer to provide more options.
For instance, let’s say a developer uses the colors red and green to indicate success or failure when a user submits a password. These colors might not mean anything in certain localizations. A bug is then filed and a developer works to extend the options available to that localizer so it is more meaningful. But this can be laborious, and is definitely not scalable. Moreover, this new exception forces all other localizations to translate a new entity, even though it may not have the same level of importance (if any at all) in their home language.
Other issues to think about include languages that use right-to-left writing or languages that present their characters vertically rather than horizontally. The examples are numerous and we can go through all of them, but I think you get my point. Feel free to add your examples to the comments section of this post.
Next time, I’ll present a small piece of what could be the next generation of l10n. You might think of it as Localization 2.0 or L20n.
-
DTD limitations with the gender of translated words
Yesterday, I wrote about the complexity that many localizers face when translating Firefox.
Here is another example.
Difficulty #2: Gender of words
Remember from yesterday that we can have a DTD file in the en-US version of Firefox like
<!ENTITY brandShortName “Firefox”>
where every time the variable “brandShortName” appears in the code, Firefox displays the translated string that is shown in quotes.
What if a language used a different gender for the same word, causing that word to slightly change given different contexts in the user interface? Let’s look at this example, using Polish again as our language of translation:
<!ENTITY willCheck “&brandShortName will check links”>
In Polish, the localizer has to translate the entity into the following:
<!ENTITY willCheck “&brandShortName będzie sprawdzał(a) odnośniki“>
where sprawdzał is the masculine version of the word and sprawdzała is the feminine version of the word. See how that can be problematic? This isn’t how Poles speak or write naturally, using a parenthetical “a” to account for all possible genders in one sentence. In context, the proper gender version should be used. But, the localizer has to acknowledge both endings with the (a). Alternatively, the localizer could pretend that the word is simply masculine gender, which can be obviously sensitive depending on what word is being written, who is reading it, and what alternate meaning that word might take on with the wrong ending. Polish locaizers made this change and it’s serviceable.
As we expand into new areas, where languages can be extremely different from English, we’ll need to think about a better way to do this. In the next post, one more example and then some more on localization. I’ll conclude all this with a possible solution.
(By the way, I don’t speak Polish. I just happen to work next to a Polish guy every day. Thanks, Gandalf, for providing me some examples to round out these posts.)
-
Mozilla DTD files, caveat emptor
If you’ve had the opportunity to localize Mozilla, then you have become very familiar with DTDs and the complexities that localizers face when translating a program like Firefox. I thought I would use a few blog posts to describe some of these challenges, leading up to the next generation of localization at Mozilla — L20n. [1]
Difficulty # 1: Declensions of nouns, pronouns, and adjectives and platform-specific word usage
Do you mind if I rewind us to high school Latin class where I am sure you remember repeating all the declensions of the various forms of nouns, pronouns and adjectives. In Latin, the six declensions have different number and gender endings (i.e. singular/plural and male/female/neuter). It turns out that Latin is not the only language that does this. In fact, Mozilla ships Firefox and Thunderbird in many languages that have similar, if not much higher, complexities with declensions.
Here is how it specifically relates to Mozilla’s DTDs. Take the following example:
So you want to be a localizer? In the en-US source, we identify our DTDs with the markup declaration “!ENTITY”. If you see “!ENTITY” in the code, then you know there is something that needs to be translated. Below, you can see that there is a variable called brandShortName with the string ” Firefox”.
<!ENTITY brandShortName “Firefox”>
Every time brandShortName appears in the code, the string “Firefox” (or the translated string provided by the localizer) will be presented in the user interface to someone using Firefox. But, what if a language has several different declensions of the word Firefox, like Latin, that could be used in different grammatical structures?
In Polish, for instance, Firefox could be written as Firefox, Firefoksa, Firefoksowi, Firefoksem, etc. depending on how our brand name is used in context. But, there really is no way to provide multiple words for Firefox in the setup you see above. brandShortName will always have the value in the string, but that string does not allow localizers to enter multiple possibilities. The localizer gets one string, effectively one chance, to make the translation work.
Now, let’s say that each operating system uses a different label to peform a specific function. Let’s use Polish again as an example. In the example below, you see can see different entities for Mac (hidemac.label), Windows (hidewin.label), and Linux (hidelin.label)
<!ENTITY brandShortName “Firefox”>
<!ENTITY hidemac.label “Hide &brandShortName;”>
<!ENTITY hidewin.label “Hide – &brandShortName;”>
<!ENTITY hidelin.label “Hide: &brandShortName;”>Can you see above how the Mac label will read “Hide Firefox”, the Windows label will read “Hide – Firefox” and the Linux label will read “Hide: Firefox”? Seems to work just fine in English. But, in Polish, the word for Hide is ukryj, which is 2nd person, singular and requires Firefox to be spelled Firefoksa if you want to be grammatically correct. Here is how we have to localize in Polish:
<!ENTITY hidemac.label “Ukryj program &brandShortName;”>
<!ENTITY hidewin.label “Ukryj program – &brandShortName;”>
<!ENTITY hidelin.label “Ukryj program: &brandShortName;”>Localizers have to create a new phrase, “Hide program: Firefox” instead of what seems more natural: “Hide Firefox”. No one would say “Ukryj program Firefox”. It sounds robotic and weird or even monsterish. Polish speakers say “Ukryj Firefoksa”. But, remember brandShortName can only be “Firefox”. You can imagine that this happens all the time in Polish.
Now, multiply that by 60+ localizations whose grammatical structures are different from English, across three platforms, and you get a sense of how difficult this gets. DTDs have other limits that I’ll blog about with more examples. Before we’re finished, I’ll get to the better way to do this that Axel and others have thought long and hard about.
[1] L20n is a term I would hear Axel mention in the past and I’ve spent some time learning what this actually means. I am hoping guys like Axel and Gandalf will comment on these blog posts to add to the conversation.



















