Ten Tips for Website Localization

This post has some general tips that I’d recommend to anyone wanting to write a multilingual web application. The majority of my code these days is PHP, but I think these tips are applicable to most web programming languages. In no particular order:

UTF-8 is your friend. Use it.

The big step from ASCII to Unicode was the potential to use multiple bytes to represent a single character. With ASCII, each character was given a number between 0 and 255 and that’s all the programmer could use. If another character needed to be shown, the numbers were reused and a different font was loaded. If people didn’t have the same fonts, they got errors or undefined results.

Enter Unicode and UTF-8. With the creation of UTF-8, characters from all over the world are assigned numbers between 0 and 1,114,111. This is fantastic news, because you can store text in many different languages without having to worry about specific encodings. This also means you can support language fall back for sections of your web page. If you have 80% of your page translated for a specific language, you can fall back to an alternative, all while using the same encoding. (an aside: Be sure to use lang=”” attributes in your HTML tags if you’re changing the language mid-stream).

Don’t concatenate strings

Code like this makes me sad, and will make your localizers cry (or quit):

  <?
    $item = "toast";

    // Example one - This is bad
    echo _("Sometimes I eat")." {$item} "._("and sometimes I don't.");

    // Example two - This is better
    echo sprintf(_("Sometimes I eat %s and sometimes I don't."), $item);
?>

In the first example, chances are good the localizer will get the list of strings to translate and the two separate calls to _() will look like two different sentences with no context around them. Firstly, the phrases by themselves make no sense, and secondly, a localizer needs to be able to look at an entire sentence (and sometimes more) to understand how to translate it most effectively.

The second example uses the printf() standard %s to let the localizer know you’ll be substituting a string into the middle of the sentence. This is the current best practice for creating sentences with variables. Depending on what the string is, they may still be upset, but that’s out of scope for this tip (here’s a hint though).

Don’t use machine translation

In recent years great progress has been made towards programmatically translating documents from language to language. That said, it is far from being an acceptable replacement for a fluent translator. The edge cases and “what ifs” on the technical/logical side of the translation are enough for me to say that, but when you start talking about potentially offensive translations (that’s the next tip) this is a definite requirement. Just look at an example of an automated German to English translation. It’s readable but it’s far from polished - not something you want as a first impression to your site.

Be culturally sensitive

If you’re not very familiar with your target culture ask for an opinion from someone who is (or hire a localizer who is). Seemingly innocent words, phrases, and images could be misunderstood by another culture. If you use terminology that is only understood in your region or culture, the best case you can hope for is that a visitor to your site just won’t understand and will ignore it, but it really reduces your credibility and the overall enjoyment of visiting your site.

Use multi-byte functions

This may be a little PHP specific, but it’s good to be aware of it in any language. PHP has string functions and multibyte string functions. The latter functions support characters that fill up more than one byte (ie. UTF-8 characters). This is essential when manipulating strings with letters outside of the Latin alphabet. If you’re not using PHP, at the least, verify your programming language will manipulate multi-byte strings correctly.

Separate your views from your logic

I’m a fan of MVC separation, but there are plenty of other architectural patterns. Depending on what process and software you use for localization you may be giving template files to localizers. If that’s the case, the simpler the better - you don’t need a bunch of complex code around the strings waiting to be translated. Even if you’re using a method that doesn’t require giving template files to localizers, updating strings is easier, and whoever does maintenance on your software in the future will thank you.

Use (meaningful) placeholder text

This one might be a little controversial and is gettext specific. The documented and recommended way to use gettext is to pass an English string to the gettext() function. This serves two purposes: It lets the localizer see the complete English string when they are translating, and it let’s gettext fall back to English if a translation isn’t available.

I’m suggesting using a substitute string in place of the English string. For example, instead of _("Error: Your cart is full!") I would use _("error_cart_full"). English translations are done in the .po file, just like every other locale. By following this rule, it’s possible to change the English text, without affecting the other translations. Using the documented method means that even adding a comma means changing every locale’s .po file and then recompiling them all. If you’ve got localizers watching for changes on their files (through a shared repository) this means they have to check and verify any changes - it’s a hassle and it’s time consuming for everyone involved.

The first purpose I mentioned, seeing English strings, can be duplicated by running msgattrib --set-fuzzy $file1 | msgmerge -NUs $file2 where $file1 is the updated en-US .po file, and $file2 is the outdated .po file from another locale. This will merge the English strings into the other locale, but will mark them as fuzzy, so gettext will ignore them until they are translated.

The second purpose can be addressed just by making sure the strings you’re trying to use are available. If you need to use a new English string on the site, and the localizer is unavailable, you can temporarily move the fall back logic into your code:

  <?
    // This is a temporary fix!
    if (_("string_to_translate") == "string_to_translate") {
      // Print the English string
    } else {
      // Print the translated string
    }
?>

While we’re on the subject of .po files, useful comments should be added to the file wherever appropriate to help provide context and hints for localizers.

Be aware of word length

Words in different languages have different lengths - words in Asian languages generally have fewer characters than English, and German words, more. When designing the layout for your site, bear this in mind. Don’t hard code widths to elements holding text - the words should be able to flow and expand as necessary. This can be tough with today’s complex sites, but CSS will go a long way to help. Also, when accepting user input, don’t put unneeded arbitrary length restrictions on the input.

Don’t use graphics as text

This is just a good idea in general, but it makes even more sense when localizing pages. Creating images is time consuming and has more potential for error. Using an appropriate encoding and employing CSS should get close to the same effect (with an extra point for accessibility). If you need to use an image, be prepared to accept localized strings and make the image yourself - localizers may not have the time, skills, or software they need to create the images.

Be aware of how changing the locale can affect strings

Setting the LC_ALL variable doesn’t just change the formatting of strings - it also changes currency formatting, time/date formatting, how things are sorted, and what symbols represent numbers/lettters/etc. Some Examples:

  <?
    setlocale(LC_ALL, 'fr_FR');
    $num = 1.5;
    var_dump($num); // Prints 1.5
    echo $num; // Prints 1,5
?>

Internally, the decimal is represented by a period, and all the php functions will recognize that (eg. /[0-9.]+/ matches, whereas /[0-9,]+/ does not). However, if you need to print the string to pass it to another library or page (into a mysql query, passing to javascript, etc.) it’s going to become a comma. Another example:

  <?
    preg_match('/\w/', 'ホーム'); // Will never match, regardless of LC_ALL
?>

Using regular expressions on UTF-8 data can be risky. The \w and [[:alpha:]] character escapes only ever match single byte values (ie. characters with values up to 256) with the preg functions. The PCRE Documentation says:

“This remains true even when PCRE includes Unicode property support, because to do otherwise would slow down PCRE in many common cases. If you really want to test for a wider sense of, say, “digit”, you must use Unicode property tests such as \p{Nd}.”

If we need to match UTF-8 strings with regular expressions in PHP, we can use:

  <?
    mb_regex_encoding('UTF-8');
    mb_ereg('\w+', 'ホーム', $match);
    print_r($match); // Prints: Array ( [0] => ホーム  )
?>

By setting the internal regular expression encoding to UTF-8, and using the mb_ereg() function, we can match multibyte characters with regular expressions. Realize though, that this has the performance issues the PCRE documentation mentioned.

9 Comments

"With the creation of UTF-8, characters from all over the world are assigned numbers between 0 and 65,535."

Actually, UTF-8 can hold not just the BMP, but all unicode characters from 0x00 to 0x10FFFF, or numbers between 0 and 1,114,111.

UCS-2 is the encoding limited to the first 65535 chars.
-- Karellen, 26 Jul 2007
Everything would be even better if we had the new L20n framework ready, I guess... We should really get something moving there again... ;-)
-- Robert Kaiser, 26 Jul 2007
Good entry. Me like.

(I think the comments count on your blog is off by one.)
-- Barry, 26 Jul 2007
Nicely written, Wil.

Regarding the fallback code: I still believe this is a very ugly solution as it requires code changes (and necessarily patch reviews etc) twice, once for putting it in and once for removing it. I have to admit though that there may not be a better solution out there, at least not while gettext doesn't come with a fallback procedure of its own (beyond its current "display the msgid instead").

(Another short note: either you use sprintf() with echo or you just use printf() with no "echo". printf prints on its own.)
-- Fred, 27 Jul 2007
Thanks Fred and Karellen. I fixed the typos.
-- Wil Clouser, 27 Jul 2007
In section "Don't concatenate strings" you make a good point, but you should know that the given better option is hugely flawed too. Plenty of languages have trouble with this scheme. The problems come about as %s is (re)used in other strings. This is a problem in several languages when different %s require different (e.g.) endings or prepositions.

In Finnish (my language) this results in being forced to add strange sounding modifiers to the sentence. In the given example the English equivalent would be something like "Sometimes I eat the food product %s and sometimes I don't." The point of adding "the food product" is that in Finnish I could now use the basic form of %s i.e. no ending would be necessary.

An authentic example of the same can be seen on AMO, where the Finnish version replaces the composite string "Browse Extensions by Category" with the English equivalent of "Browse by Category the Add-on type Extensions" where Extensions is of course replaced by Themes etc. when appropriate. The Finnish translation could translate "Themes" and "Extensions" to their correct forms, but it's hard to find out all the places where a given "%s" string is used and there's really no way to know where it's going to be used in the future. Thus an ugly fix like the one above is a necessary eye-sore. BTW. I don't mean to belittle the quality of AMO here in any way.... the same applies for Firefox et al. too!

The fix would be to use only full sentences when ever possible. That would also fix the annoyance of capital letters mid sentence .... trying to guess where a "%s" string will be placed in sentence (and thus whether it should grammatically be with a capital letter or not) is a terrible pain and error prone.
-- Ville Pohjanheimo, 31 Jul 2007
Don't you wish these things were everyday knowledge? I'm consistently amazed by how many people know almost nothing about character encoding and the like.
-- matt, 05 Feb 2008
Some global tips.

1. UTF8 is it, there are few reasons to use any other encoding. If you don't know what these reasons are -- use UTF8.

2. PHP sucks, it's a web scripting language whose native functions lack support for multibyte strings. See point 1 and the multibyte tip above. If possible use a language where strlen returns the length of the string instead of the byte count. If you must use PHP, familiarize yourself with the multibyte or iconv functions.

3. Consider pre-processing localization strings unless an app requires allowing users to switch language at runtime.

4. If you serve UTF8 and store data as UTF8 -- make sure user submitted data is UTF-8 encoded.

$valid_utf8 = iconv('UTF-8', 'UTF-8', $user_data);
-- utf-8 guy, 06 Feb 2008
My tip would be this localization tool: https://poeditor.com. It helps you a lot more than you can help yourself.
-- Tessa, 02 Jul 2013

Post a comment

Feel free to email me with any comments, and I'll be happy to post them on the articles. This is a static site so it's not automatic.