Teaching CakePHP to be Multilingual (part 3)

This is the last part of a three part series. (Part 1) (Part 2)

The basic premise of our strategy for dynamic localization was to replace actual strings in the database with ints, which were foreign keys into a `translations` table that held the actual strings. The `translations` table looks like:

+------------------+------------------+------+-----+---------+-------+
| Field            | Type             | Null | Key | Default | Extra | 
+------------------+------------------+------+-----+---------+-------+
| id               | int(11) unsigned |      | PRI | 0       |       | 
| locale           | varchar(10)      |      | PRI |         |       | 
| localized_string | text             | YES  |     | NULL    |       | 
| created          | datetime         | YES  |     | NULL    |       | 
| modified         | datetime         | YES  |     | NULL    |       | 
+------------------+------------------+------+-----+---------+-------+

When combined, our id and locale create the primary key which will look up a textual value in the localized_string field. This method has the advantage of utilizing foreign keys (one weakness of the PEAR::Translation2 method). The price we pay is complexity – if we want to look at any information in the database, there will be joins, and plenty of them.

Striding forward with our plan, we hit a roadblock with the way CakePHP handles associations. Specifically, CakePHP doesn’t allow us to associate models on arbitrary columns. For example, a hasMany relationship allows you to choose the “foreignKey” for one table, but the other is assumed to be the primary key. This means we can’t just setup relationships in the model and forget about them. Using CakePHP’s relationship model still seemed like the right thing to do, even if it wouldn’t be entirely automatic.

Firstly, to identify which fields in each model would be translated, we added another array to each model called $translated_fields.

Next, we had to build a relationship between whichever model we were using and the Translation model. Since we had to use custom SQL to retrieve the translations, we had to build this relationship on the fly for every object that had translations. Using app_model’s beforeFind() made this, relatively, easy. The code we used to build the relationship is fairly long, but it’s straightforward and well commented. The more notable part is the freakishly large and complex queries that come out of it. For example, this is an actual query that is run to retrieve information for an addon (forgive the strange line breaks – I’m trying to fit it in the wordpress column):

SELECT 
    `Addon`.`id`, 
    `Addon`.`guid`, 
    IFNULL(`tr_name`.localized_string, `fb_name`.localized_string) AS `name`, `Addon`.`defaultlocale`, 
    `Addon`.`addontype_id`, 
    `Addon`.`status`, 
    `Addon`.`icontype`, 
    IFNULL(`tr_homepage`.localized_string, `fb_homepage`.localized_string) AS `homepage`, 
    IFNULL(`tr_description`.localized_string, `fb_description`.localized_string) AS `description`, 
    IFNULL(`tr_summary`.localized_string, `fb_summary`.localized_string) AS `summary`, 
    `Addon`.`averagerating`, 
    `Addon`.`weeklydownloads`, 
    `Addon`.`totaldownloads`, 
    IFNULL(`tr_developercomments`.localized_string, `fb_developercomments`.localized_string) AS `developercomments`, 
    `Addon`.`inactive`, 
    `Addon`.`trusted`, 
    `Addon`.`viewsource`, 
    `Addon`.`prerelease`,
    `Addon`.`adminreview`,
    `Addon`.`sitespecific`,
    `Addon`.`externalsoftware`,
    IFNULL(`tr_eula`.localized_string, `fb_eula`.localized_string) AS `eula`,
    IFNULL(`tr_privacypolicy`.localized_string, `fb_privacypolicy`.localized_string) AS `privacypolicy`,
    `Addon`.`created`,
    `Addon`.`modified`,
    IF(!ISNULL(`tr_description`.localized_string), `tr_description`.locale, `fb_description`.locale) 
                AS `description_locale`,
    IF(!ISNULL(`tr_developercomments`.localized_string), `tr_developercomments`.locale, `fb_developercomments`.locale) 
                AS `developercomments_locale`,
    IF(!ISNULL(`tr_eula`.localized_string), `tr_eula`.locale, `fb_eula`.locale) AS `eula_locale`,
    IF(!ISNULL(`tr_homepage`.localized_string), `tr_homepage`.locale, `fb_homepage`.locale) AS `homepage_locale`,
    IF(!ISNULL(`tr_name`.localized_string), `tr_name`.locale, `fb_name`.locale) AS `name_locale`,
    IF(!ISNULL(`tr_privacypolicy`.localized_string), `tr_privacypolicy`.locale, `fb_privacypolicy`.locale) 
                AS `privacypolicy_locale`,
    IF(!ISNULL(`tr_summary`.localized_string), `tr_summary`.locale, `fb_summary`.locale) AS `summary_locale`
FROM `addons` AS `Addon` 
LEFT JOIN translations AS `tr_description` 
    ON ( `Addon`.`description` = `tr_description`.id AND `tr_description`.locale='en-US' )
LEFT JOIN translations AS `fb_description` 
    ON (`Addon`.`description` = `fb_description`.id AND `fb_description`.locale=`Addon`.`defaultlocale`)
LEFT JOIN translations AS `tr_developercomments` 
    ON (`Addon`.`developercomments` = `tr_developercomments`.id AND `tr_developercomments`.locale='en-US')
LEFT JOIN translations AS `fb_developercomments` 
    ON (`Addon`.`developercomments` = `fb_developercomments`.id AND `fb_developercomments`.locale=`Addon`.`defaultlocale`)
LEFT JOIN translations AS `tr_eula` 
    ON (`Addon`.`eula` = `tr_eula`.id AND `tr_eula`.locale='en-US')
LEFT JOIN translations AS `fb_eula` 
    ON (`Addon`.`eula` = `fb_eula`.id AND `fb_eula`.locale=`Addon`.`defaultlocale`)
LEFT JOIN translations AS `tr_homepage` 
    ON (`Addon`.`homepage` = `tr_homepage`.id AND `tr_homepage`.locale='en-US')
LEFT JOIN translations AS `fb_homepage` 
    ON (`Addon`.`homepage` = `fb_homepage`.id AND `fb_homepage`.locale=`Addon`.`defaultlocale`)
LEFT JOIN translations AS `tr_name` 
    ON (`Addon`.`name` = `tr_name`.id AND `tr_name`.locale='en-US')
LEFT JOIN translations AS `fb_name` 
    ON (`Addon`.`name` = `fb_name`.id AND `fb_name`.locale=`Addon`.`defaultlocale`)
LEFT JOIN translations AS `tr_privacypolicy` 
    ON (`Addon`.`privacypolicy` = `tr_privacypolicy`.id AND `tr_privacypolicy`.locale='en-US')
LEFT JOIN translations AS `fb_privacypolicy` 
    ON (`Addon`.`privacypolicy` = `fb_privacypolicy`.id AND `fb_privacypolicy`.locale=`Addon`.`defaultlocale`)
LEFT JOIN translations AS `tr_summary` 
    ON (`Addon`.`summary` = `tr_summary`.id AND `tr_summary`.locale='en-US')
LEFT JOIN translations AS `fb_summary` 
    ON (`Addon`.`summary` = `fb_summary`.id AND `fb_summary`.locale=`Addon`.`defaultlocale`)
LEFT JOIN `addontypes` AS `Addontype` 
    ON `Addon`.`addontype_id` = `Addontype`.`id` WHERE (`Addon`.`id` = 9)
    AND (`Addon`.`inactive` = 0)
    AND (`Addon`.`addontype_id` IN ('1', '2') )
LIMIT 1

That query went through several revisions, all of which had their pros and cons. Despite it’s intimidating length, the structure has a purpose. A few notes:

  • You can see which columns have translations in the SELECT pretty easily by looking for the IFNULL() function
  • The left outer joins allow us to get nulls back where the string doesn’t exist. Our initial query used inner joins, and when a string didn’t exist in the database, we’d get nothing back.
  • The union merges two nearly identical queries – one for the localized content, and one for the English fall back content.
  • The outer “SELECT AS” statement exists so the whole query plays nicely with cake. CakePHP’s dbo implementation assumes the model name will be the index into the array where it will store data. If that name isn’t set, it will automatically pluralize the name, which eventually causes a fatal error.

In our previous revisions of the query, we had to use a complimentary afterFind() in the Translation model to organize the strings into a structure that was more usable by the views. It automatically looked if a string in the language was null, and if so, replaced it with the English version. The query above makes this unnecessary, as that workload is passed off to MySql. With either method, the end result is an array filled with dynamically translated strings, and their locales (in case we had to fall back, the view needs to know the string is in a different language).

Our method has several shortcomings, some specific to our code, and others that are a problem with most (all?) localization efforts. We’ll probably address these in the future, but in the mean time, some trouble spots are:

  • Recursion. At this point when we do a query, beforeFind() only runs on the first model. Someone already filed cakephp’s Ticket 1183 on the issue. We’ve had to do a couple extra queries on some complex pages in order to get all the strings we need.
  • We currently only support one fallback language. It would be nice to have multiple language fallback. For example, from zh-TW -> zh-CN -> en-US. This would be easiest to implement by just looking for another language (when appropriate) with each query, but I’d like to come up with a more efficient way (in terms of query length and complexity).
  • Declinations, capitalization differences, etc. – You’ll find these in any localization project, but it’s worth mentioning.

Since it’s been a while since the previous posts, a quick summary is probably in order. At this point, we’ve successfully combined localization for static and dynamic content, using systems that are fast and robust. Things are more complex because of it, but that’s par for the course when it comes to localization. What we’ve done here is an exciting step forward for enabling people to share their firefox extensions with people around the world.

Looking back over these 3 posts, I see most of the posts have been high level ideas, and less about the actual integration into the CakePHP code. If there is sufficient interest, I can make another post describing:

  • How to detect/handle the language in the URL
  • How to initialize/remap languages
  • More on integrating the dynamic parts of the localization strategy into CakePHP (probably focusing on the view).
  • Anything else you’re curious about

If you’re less interested in reading my ramblings, and more interested in code, let me point out the SVN repository one last time.

Categories: AMO

4 responses

  1. Anders wrote on :

    In your example sql you left join on addontypes, while you on the same time in the where clause specify that Addons.addontype_id must be either 1 or 2. Are you really unsure if the addontypes table contains rows with the ids 1 and 2? Judging for the table names I would assume that the content of addontypes was fairly static, so why burden the database with left join instead of a inner join?

    Do you really use to ability to use the same text for different fields? That is, e.g. having the same id in the addons.description and addons.eula columns? That would seem to me to be the only reason for having the addons.description and addons.eula columns at all (and if you really didn’t want the database to be able to use an index — since the optimizer can’t know if the id’s used will be consecutive or scattered).

    Otherwise you could instead have had:
    create table translations2 (
    id int(11) unsigned,
    field varchar(20),
    locale varchar(10),
    value text NULL,
    created datetime NULL,
    modified datetime NULL,
    primary key (id, locale, field)
    );
    For performance you would probably have an covering index on (id, field, locale, value) to make sure you could get to the data in one go.

    For having more than one fallback locale, you might have a table translationfallback:
    create table translationfallback (
    locale varchar(10),
    fallback varchar(10),
    sort int unsigned NOT NULL,
    primary key (locale, fallback)
    );
    The table should always have a row with locale=fallback and sort=0 for each locale to make it the most preferred, and a row with fallback=’default’ and sort=1000 for each locale to make sql easier.

    As for the sql itself, you might want to eliminate some of the join:
    SELECT a.id, a.guid,
    SUBSTR(MIN(IF(t.field=’name’,LPAD(tfb.sort,10,’0′)+t.value)),11) AS name,
    a.defaultlocale, a.addontype_id, a.status, a.icontype,
    SUBSTR(MIN(IF(t.field=’homepage’,LPAD(tfb.sort,10,’0′)+t.value)),11) AS homepage,
    SUBSTR(MIN(IF(t.field=’description’,LPAD(tfb.sort,10,’0′)+t.value)),11) AS description,
    SUBSTR(MIN(IF(t.field=’summary’,LPAD(tfb.sort,10,’0′)+t.value)),11) AS summary,
    a.averagerating, a.weeklydownloads, a.totaldownloads,
    SUBSTR(MIN(IF(t.field=’developercomments’,LPAD(tfb.sort,10,’0′)+t.value)),11) AS developercomments,
    a.inactive, a.trusted, a.viewsource, a.prerelease, a.adminreview, a.sitespecific, a.externalsoftware,
    SUBSTR(MIN(IF(t.field=’eula’,LPAD(tfb.sort,10,’0′)+t.value)),11) AS eula,
    SUBSTR(MIN(IF(t.field=’privacypolicy’,LPAD(tfb.sort,10,’0′)+t.value)),11) AS privacypolicy,
    a.created, a.modified,
    SUBSTR(MIN(IF(t.field=’description’,LPAD(tfb.sort,10,’0′)+t.locale)),11) AS description _locale,
    SUBSTR(MIN(IF(t.field=’developercomments’,LPAD(tfb.sort,10,’0′)+t.locale)),11) AS developercomments,
    SUBSTR(MIN(IF(t.field=’eula’,LPAD(tfb.sort,10,’0′)+t.locale)),11) AS eula,
    SUBSTR(MIN(IF(t.field=’homepage’,LPAD(tfb.sort,10,’0′)+t.locale)),11) AS homepage,
    SUBSTR(MIN(IF(t.field=’name’,LPAD(tfb.sort,10,’0′)+t.locale)),11) AS name,
    SUBSTR(MIN(IF(t.field=’privacypolicy’,LPAD(tfb.sort,10,’0′)+t.locale)),11) AS privacypolicy,
    SUBSTR(MIN(IF(t.field=’summary’,LPAD(tfb.sort,10,’0′)+t.locale)),11) AS summary
    FROM addons AS a
    INNER JOIN addontypes AS aot ON a.addontype_id=aot.id
    LEFT JOIN (translationfallback AS tfb
    INNER JOIN translations AS t ON a.description=t.id AND t.locale=if(tfb.fallback=’default’, a.defaultlocale, tfb.fallback)
    ) ON tfb.locale=’en-US’
    WHERE (a.id = 9) AND (a.inactive = 0) AND (a.addontype_id IN (1, 2))
    GROUP BY a.id, a.guid, a.defaultlocale, a.addontype_id, a.status,
    a.icontype, a.averagerating, a.weeklydownloads, a.totaldownloads, a.inactive,
    a.trusted, a.viewsource, a.prerelease, a.adminreview, a.sitespecific,
    a.externalsoftware, a.created, a.modified

    But, really, it seems kind of silly to use the old “database-in-a-database” pattern where you have a table of entities and a table with its attributes, instead of just two tables:
    create table moz_addons (
    id int unsigned,
    guid varchar(16),
    defaultlocale varchar(10) not null,
    addontype_id int unsigned not null,
    status int unsigned not null,
    icontype int unsigned not null,
    averagerating int unsigned not null,
    weeklydownloads int unsigned not null,
    totaldownloads int unsigned not null,
    inactive int unsigned not null,
    trusted int unsigned not null,
    viewsource int unsigned not null,
    prerelease int unsigned not null,
    adminreview int unsigned not null,
    sitespecific int unsigned not null,
    externalsoftware int unsigned not null,
    created datetime not null,
    modified datetime not null,
    primary key (id),
    foreign key (addontype_id) references moz_addontype (id),
    );

    create table addontranslations (
    id int unsigned,
    locale varchar(10),
    name text null,
    homepage text null,
    description text null,
    summary text null,
    developercomments text null,
    eula text null,
    privacypolicy text null,
    primary key (id, locale),
    foreign key (id) references addons (id),
    );

  2. Frédéric Wenzel wrote on :

    The LEFT JOIN on the addon type is, (I think) created by cake automagically.

  3. Ugo PARSI wrote on :

    Thanks a lot for your post, as it was really difficult to adopt a good i18n-strategy because of the lack of many documentations.

    I’d be glad to read your last post concerning the detection in URLs.

    Keep-up the good work 🙂

  4. Rostislav Palivoda wrote on :

    http://www.palivoda.eu/2008/04/i18n-in-cakephp-12-database-content-translation-part-2/