Firstly, thanks for all the great feedback. Something as seemingly simple as tagging gets complex quickly when thought out and the varied perspectives of the community are always great to have.
Allowing full Unicode would let anyone use meaningful tags in their own character sets but would prevent us from offering similar matches and common misspellings. On the other hand, we support several languages on AMO that don't use the Latin alphabet. It stands to reason that users would search for tags in their own character sets and would get no results. There are pros and cons for each choice but we're essentially debating the value of normalization in tagging.
After distilling all the feedback and talking amongst ourselves our overall feeling was that forcing people to convert their input into the Latin alphabet wasn't in the users' best interest. The Mozilla Manifesto talks about a global internet that fosters creativity and free expression. Not supporting a user's native language when we have the option to doesn't feel like the right path to take.
With that in mind our current plan is as follows:
Allow full Unicode in tags to the extent we do everywhere else on AMO.
Do no automatic character normalization. The option of manual normalization (essentially, marking some tags as equivalent) is left open as a future enhancement.
Do automatic white space and capitalization normalization. Spaces are displayed on an add-on's page but when searching or entering into a URL spaces are unnecessary. For example, newyork, new york and New YORk are all equivalent.
A list of suggestions will be provided as the user types. We may attempt some simplistic character normalization in the suggestions if we can come up with a way that provides enough value to continue to use (perhaps something that is per-language).
White space is trimmed from the beginning and end of tags before they are saved into the database.
Tags are limited to 128 characters and add-ons are limited to 80 tags.
Tags will be comma delimited. To include a comma in your tag you must use quotation marks. Quotation marks, whether they are matched or not, are discarded. Example: "Portland, OR" will become Portland, OR whereas Portland", OR will become Portland and OR.
Additional feedback, as always, is welcome.
Tags broke into the limelight around the time "Web 2.0" was becoming popularized. They provided a simple but effective way to categorize objects and many sites are using them now. Despite their proliferation, I haven't found any documentation on the internet regarding standards for implementing tags.
A tag library exists for CakePHP but it, and many others, are too simplistic for what we want.
We've written our tagging goals into a plan but have some technical details we still need to figure out. While reviewing what we have a couple questions arose that we thought people would have opinions on.
1) What should the range of allowed characters be? Our first instinct was simplicity, something like /[A-Za-z0-9-]/ (that is, all English letters and numbers and a dash). This is easy to handle on our end but leaves out everyone that doesn't want to add tags using the English alphabet. There is some debate how useful it would be to allow other Unicode characters, particularly when you think about #2 below.
2) Tags are most useful when they are normalized. By allowing Unicode characters we run the risk of diluting our tag cloud. For example, resume and résumé are close enough that for our purposes they are equivalent. If we allow Unicode we'll have to deal with converting characters like é to e and vice versa for searches. At that point we'll need a list of "equivalent" characters - not impossible but it will slow things down (both development and speed of a search). The second question is: Assuming you think we should allow Unicode characters, what characters are equivalents? Here is a quick idea from php.net's strtr() documentation:
Bugzilla is an awesome bug tracker that is used by hundreds of companies. I've got accounts on several projects' trackers and I'm sure many others do also.
When I get mail from Bugzilla it's not obvious which project it's from. My email client (GMail) only shows the "from name" so all I see for these projects is:
Mozilla: bugzilla-daemon
Pootle: bugzilla-daemon
Miro: bugzilla
kernel.org: bugme-daemon
Apache: bugzilla
Wouldn't it make sense to differentiate each projects' emails in the from name? Maybe even by default (something like "%SITE_NAME% Bugzilla")?
Reed says it's a personal problem because his mail client shows the full address. Am I the only one? :(
One of the things that gets a lot of news time these days is XSS. There are a lot of places that explain what it is
and how to prevent it but most are oversimplified or don’t provide real world
examples. I thought I’d explain a couple of the ways AMO attempts to prevent it.
Translate Toolkit 1.3.0 was released a few days ago. I was following along with trunk on my development box and I wanted to upgrade our alpha install to take advantage of the new features (namely, speed improvements) and the django framework.
I attempted this tonight and it was not a pretty upgrade (or install, for that matter). Among the medley of problems is Django ticket #6548. Django assumes it's not behind an SSL proxy so when it does any redirects it doesn't use https. This means logging in and logging out work on our server but the user is presented with a jarring "bad request" interstitial.
The current status is that user accounts are not migrated and, even if they were, I can't seem to set permissions for projects. Since there are some odd problems that we haven't seen elsewhere and this is an alpha install I'm going to leave it as is and debug some of the issues over the next few days. Expect downtime. If there are questions visit #verbatim on irc.mozilla.org.