Another thing about <!ENTITY> and then some on localization

January 30th, 2009 by seth bindernagel

Challenge 3: Some languages with multiple forms of a word

In some languages, words can have multiple forms depending on context.  What if the word for tab could be written as tab, tabs, tab(x), or [prefix]-tab where each form might be used depending on what a developer hopes to communicate.

Here is an example:

<!ENTITY tabsOpen1 “You have %1 tab open”>
<!ENTITY tabsOpen2 “You have %1 tabs open”>

In English, the UI is relaying to the end-user that he or she may have one tab open or more than one tab open.  Using Polish again, we can see that there are multiple forms depending on the context:  one kartę; two, three or four karty; or zero or five or more kart. In fact, the Polish grammar rule is much more complex than my explanation and I am sure I am missing all the rules, but you get the point.

See below:

<!ENTITY tabsOpen1 “Masz %1 kartę otwartą“>
<!ENTITY tabsOpen2 “Masz %1 karty otwarte“>
<!ENTITY tabsOpen3 “Masz %1 kart otwartych“>

See the problem?  Option three can and will never be used because the code only provides for 1 tab or [x] tabs.  So, Polish localizers are forced to create an artificial form like ” otwórz kart: %1″.  Once again, this is not really a pattern of natural spoken Polish.  It reads more as a representation of the database or something.

Challenge 4:  Localization in the broader sense

Sometimes in our UI, colors, icons, spacing allocated to certain words like “Firefox”, and more are hard-coded, limiting the ability for a localizer to change them to make more sense or work well in their localizations.  If those elements are not hard-coded, they can still be hard to change.  In those cases, a localizer can file a bug asking a developer to provide more options.

For instance, let’s say a developer uses the colors red and green to indicate success or failure when a user submits a password.  These colors might not mean anything in certain localizations.  A bug is then filed and a developer works to extend the options available to that localizer so it is more meaningful.  But this can be laborious, and is definitely not scalable.  Moreover, this new exception forces all other localizations to translate a new entity, even though it may not have the same level of importance (if any at all) in their home language.

Other issues to think about include languages that use right-to-left writing or languages that present their characters vertically rather than horizontally.  The examples are numerous and we can go through all of them, but I think you get my point.  Feel free to add your examples to the comments section of this post.

Next time, I’ll present a small piece of what could be the next generation of l10n.  You might think of it as Localization 2.0 or L20n.

Tags: , , , , | Categories: Uncategorized

  1. Another big problem with localization is that English is a very concise language with short words, while other languages (for example German or Italian) have complex structures and long words. “Add-on” in Italian is “Componente aggiuntivo”: can you see the problem of using it as a menu or button label? ;-)

    Most of the time we have to accept compromises to fit our localization in the existing UI. Unfortunately this is a problem that can’t be solved by L20n, but can be minimized with a good UI design.

  2. Challenge 3: I would reword this, you are discussing plural forms. Unless by %1 you mean something other then a pure number say the words ‘one’, ‘two’ (but that would raise a myriad of other problems).

    Plurals are either very simple in English, French and Asian languages or very complex Polish, Slovenian (I think) and Arabic (6 forms). Mostly they’re well documented and formulated. There seem to only be about 10 forms with Eastern European languages having the most variations.

    Nice thing about plurals is that computers can help here and the problem is solved. Although we seem to keep solving it. Gettext has had it since 2000 I think. KDE has managed plurals for a very long time. Qt has introduced plural handling in I think v4. Mozilla’s own .properties have it solved also, I don’t think DTDs have is solved though. Most of these solutions require an editing tool to hide the complexity from the user.

    The folks developing the CLDR and Unicode have started formulating this data quite well. Also handling soft cases like ‘no files’, ‘many files’

    Challenge 4: This is an area where few people venture.

    In terms of sizes. This is a problem of the UI tookit in our case XUL. It shouldn’t be a localisation issue. The toolkit should be able to adjust to the requirements of the language. Changing entries that say “35em” is prone to errors.

    Some other examples that can affect localisation

    * Dates and Time: different calendars and different times. Swahili time starts with zero hours being six in the morning, very much like time in the bible. We can localise dates but many people have different date systems such as the Ethiopians who had their year 2000 bug in about 2005.
    * Sounds: what noise does an animal make in your language?
    * Pictures: The GNOME icon in Thailand is problematic since showing a foot is offensive.
    * Jokes: Our about:robots page is completely untranslatable in most languages, unless your Hungarian with a rich scifi tradition.
    * Yes/No: some languages change the terms depending on the question. Irish being one of them.
    * Tone of address: while Firefox could be quite chatty in Afrikaans its a machine so it won’t have a close personal relationship with you. So the translation of you become ‘U’ not the informal ‘jy’. Not a technical problem but problematic for new translators. In other languages there are more levels e.g. Nepali has 4.

    Ah the world of localisation.

  3. @ Dwayne: Thank you for this commentary. You know, I might leave my blog post as is. You are definitely pointing out that the plural form is the definite example of what I am referring to.

    But, what if multiple forms of a word exist that should be used in different context. Frankly, I have no examples here, so you are probably right. I am just trying to be as general as possible. Your note provides the best clarification.

    Also, your comments about general localization are great. The world of localization has many nuances that are great challenges to be solved one day.

  4. @ Flod. Thanks for your comment. I hope we can start to think about ways to solve these issues with l20n.

  5. Actually, the problem of plurals is far from solved. There’s a common hack for the most common case, which is one composed string with one non-negative integer number.

    Anything beyond that, in particular, “n items of foo” doesn’t have a good solution.

    Calendar has some great examples in stuff like “2nd Friday in the month” etc. Those are just horrible in a host of languages, and on top of getting a system that supports such cases in the backend, finding the right UI and UE for localizers to actually fill that out is yet another challenge.

Leave a Reply