Maintaining localization between Python and PHP (it's not fun)
I reached my hand into the barrel of problems our migration to Python is going to cause and came up with Localization. It figures.
First out of the chute was the .po files. It turns out the actual formatting is different between the two languages. PHP uses %1$s for its substitutions, but python uses either named variables like (num)s or integers like {0}. For the record, they both support %s when you don't need to order the substitutions.
PHP example:
I have %2$s apples and %1$s oranges
Python example:
I have {1} apples and {0} oranges
Since I've worked with the Translate Toolkit before, I decided to write a script to convert between the two formats. If you find yourself in the same unfortunate boat as me, behold
phppo2pypo and pypo2phppo to convert between the two types.
Crisis averted, right? Oh, that's just scratching the surface. Remember how happy I was that PHP finally started supporting msgctxt? Well, Python has had a patch for it since 2008 but no one has bothered to land it. I wrote a new ugettext() and ungettext() that recognizes context in the .po files. To use simply do: from l10n import ugettext as _ at the top of your file.
Along with adding msgctxt support, those two functions also collapse consecutive white space. We're using Jinja2 with Babel and the i18n extension as our template engine. Jinja2 has a concept of stripping white space from the beginning or end of a string but does nothing about the middle. A paragraph of text in a Jinja2 template would look like:
{% trans -%}Mozilla is providing links to these applications
as a courtesy, and makes no representations regarding the
applications or any information related thereto. Any questions,
complaints or claims regarding the applications must be
directed to the appropriate software vendor.
{%- endtrans %}
That's a decent looking template, right? Yeah, well, when Babel extracts that, it includes all the line breaks too, giving you something like this. The localizers would revolt if I sent them that, so I added in auto white-space collapsing. Getting Babel to use the new functions means a new extraction script.
At this point, we're extracting strings from our new code and we can convert between Python and PHP files. All we need now is a Frankenstein mix of xgettext functions to act as glue. Meet the amalgamate script that uses the pypo2php scripts, concatenates the .pot files, and merge updates each locales .po file. After that it's quick tweaks to the build scripts to create z-messages.po files and we're done.
So, all that said, the new process for L10n, while we're in this transitional phase, is:
- From the PHP code, run locale/extract-po-remora.sh. That pulls everything from all the PHP files, creates locale/r-keys.pot, updates the messages.po file for each locale, and compiles them. Life used to be so simple.
- From the python code, make sure you're up to date, then run ./manage.py extract. That will pull everything from the python code and templates and create locale/z-keys.pot.
- Run ./manage.py amalgamate. That will merge the z-keys.pot into the PHP messages.po files.
- Localizers can make their changes as usual, and commit back to messages.po.
- From PHP, locale/copy-to-zamboni.py locale will create z-messages.po files in the Python format. We could skip right to .mo files, but in case something goes wrong I want to see the .po files.
- Then, like today, locale/compile-mo.sh locale will compile all the .po files.
After all those steps are done, we've got duplicate .mo files, aside from formatting, and each application can look at its own .mo to get the strings it needs. All this code is just a big band-aid and there are plenty of things that are more fun than juggling L10n between two applications across two RCSs. But we knew what we were getting in to. I'll post something more positive later to help justify it. :)