Some considerations when adding Tags to AMO

Tags broke into the limelight around the time "Web 2.0" was becoming popularized. They provided a simple but effective way to categorize objects and many sites are using them now. Despite their proliferation, I haven't found any documentation on the internet regarding standards for implementing tags.

A tag library exists for CakePHP but it, and many others, are too simplistic for what we want.

We've written our tagging goals into a plan but have some technical details we still need to figure out. While reviewing what we have a couple questions arose that we thought people would have opinions on.

1) What should the range of allowed characters be? Our first instinct was simplicity, something like /[A-Za-z0-9-]/ (that is, all English letters and numbers and a dash). This is easy to handle on our end but leaves out everyone that doesn't want to add tags using the English alphabet. There is some debate how useful it would be to allow other Unicode characters, particularly when you think about #2 below.

2) Tags are most useful when they are normalized. By allowing Unicode characters we run the risk of diluting our tag cloud. For example, resume and résumé are close enough that for our purposes they are equivalent. If we allow Unicode we'll have to deal with converting characters like é to e and vice versa for searches. At that point we'll need a list of "equivalent" characters - not impossible but it will slow things down (both development and speed of a search). The second question is: Assuming you think we should allow Unicode characters, what characters are equivalents? Here is a quick idea from php.net's strtr() documentation:

<?
$a = 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûýýþÿŔŕ';
$b = 'aaaaaaaceeeeiiiidnoooooouuuuybsaaaaaaaceeeeiiiidnoooooouuuyybyRr';
?>

Some other aspects of our current plan are:

13 Comments

Can't really help with your unicode dilemma, since my solution always revolve around ASCII and making non-English speakers learn some English. You can probably get help from users who can help map localized tags into their English equivalent.

But to add to the considerations/plans, it would be nice if there was a list of tags, or if there was autocomplete. I regularly mix up tags, and am known to overlap. While trying to tag my flickr collection, I usually mess up on plural (rim or rims?) and punctuation (Mercedes-Benz or "Mercedes Benz"?). flickr does something like this, but it's ugly (missing spaces) and overwhelming. Hence why autocompete would be nice.
-- Cesar, 02 Mar 2009
Regarding unicode character canonical equivalence and normalization, some useful links:

Unicode Standard Annex 15
Unicode normalization FAQ
W3C Charlint - A Character Normalization Tool
Unicode normalization demo
International Components for Unicode (OSS library for Unicode support)

-- Daniel Einspanjer, 02 Mar 2009
I know it isn't that useful for PHP, but I also dug up the sample Java code I used elsewhere for storing normalized text for use in a search form:


import java.io.*;
import java.text.Normalizer;
import java.text.Normalizer.Form;

public class NFD {
public static void main(String[] args) {
final String INPUT_ENC = "UTF-8";
final String OUTPUT_ENC = "UTF-8";
try {
BufferedReader r = new BufferedReader(
new InputStreamReader(System.in, INPUT_ENC));
PrintWriter w = new PrintWriter(
new OutputStreamWriter(System.out, OUTPUT_ENC), true);
String s;
while ((s = r.readLine()) != null) {
// decompose and remove accents
String decomposed = Normalizer.normalize(s, Form.NFD);
String accentsGone =
decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
w.println(accentsGone);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
-- Daniel Einspanjer, 02 Mar 2009
To add to the fun, not everything always maps to the same character, so strtr is not really safe. For example, German ö can be written as "oe", but not "o". I think Finnish has some other translations for umlauted characters, so it not only depends on the input script, but also the input language.
-- Jan!, 03 Mar 2009
As Firefox's Places uses commas and not spaces to separate tags (removing the need for quote-enclosed spaces) it would be nice to see this standardised across the Mozilla Project. Having to learn tagging rules for each part of the project just seems unnecessary, especially when each feature is designed from the ground-up.

A natural English speaker myself, I can't help but feel limiting the input to US-ASCII (or normalising it to such) goes against the community spirit and essentially implies that non-English speakers are second class users. Would there be any way we could add language notation to the tags, or do simple things such as to hide Cyrillic/Greek/Ideographs from English users by default and vice-versa? I know this would be much harder within, for example Romance Languages - how do you differentiate French/Spanish/Portuguese/Italian without looking at the characters they use and white- or black-listing them from a massive list?

On the tagging front, another idea is to follow Amazon's implementation where a product must be tagged with a tag a certain number of times by users before that tag is 'accepted'. These tags can be shown to the user to essentially approve (tick a check-box by each tag you agree with and add any of your own) so as to limit tags which may be obscure/incorrect.

As far as resume Vs. résumé could we not simply have a list of cognates in English which we apply at the point of tagging to catch those few words which have spelling variations. If not we're still going to need a solution to British Vs. American spellings. Or will people be encouraged to tag 'colour color' etc?

Just a couple of my thoughts.
-- Alan, 03 Mar 2009
You can't in general translate Unicode characters 1:1 to ASCII. For instance, ß doesn't translate to "s", it translates to "ss". Furthermore, you might want to handle multiple transliterations, to allow "ö" to match either "oe" or "o".
-- Anonymous, 03 Mar 2009
I'm a big advocate of pruning the tag space as it gets large with essentially "mark as duplicate". This can either rewrite the original tag or allow the original to point to the master version of the tag. This would solve your problem around localized versions as well.
-- AndyEd, 03 Mar 2009
Tagging to me falls under the same jurisdiction as URLs. The realm of possible tags across all languages defeats the purpose of tagging, really.

By agreeing on some common nomenclature we could set a precedent and promote simpler tags (you wouldn't tag something with strange english words). I don't think this would work for most other things like web writing, novels, movies, etc., but we're not talking about War and Peace here.

My opinion may be unpopular, but in many cases it doesn't make sense to destroy your "hit rate" just to be 100% inclusive. I think the altruistic approach of "make everything universal at all costs" is definitely situational. For things like tagging and URLs where uniqueness and visual recognition are paramount I'm not fond of fragmenting otherwise unique and simple phrases into 40 different synonymous alternatives.
-- Mike Morgan, 03 Mar 2009
From a pure linguistic view, I totally agree with comment #4 and #6's statement that letters do not translate one to one. "ß" is more of a ligature of the two letters "ss" (the "Eszett" as it is called in German), and "Þ" is the letter "thorn", which is transliterated to "th" most of the time. ("Thou" used to be spelt "Þu".) And then there's the problem of transliterating a letter like "ö". If used in context of diaeresis, the diacritic mark can simply be removed, but if used in German, it is an umlaut, in which case "ö" should be transliterated into "oe".

Reckless normalization can be evil. What can be done is suggestion through normalization; if two phrases / keywords normalize to something really similar, then the similar phrase can be shown, among other similar phrases, as suggestions in an autocomplete drop down.
-- kourge, 03 Mar 2009
Why make the separator a space and not a comma? In doing the tagging system for my personal PHP-based community system, I found it's easiest to use comma as a separator as it feel natural to most people (it's even the separator of lists of attributes in normal written language) and it easily allows for spaces withing tags without the workaround of quoting.
-- Robert Kaiser, 04 Mar 2009
Here's an idea I talked about in the call earlier today:

When the users wants to tag an add-on, they are presented with a couple of text input fields, laid out vertically, where they can insert the tags, e.g. résumé. Next to each field, horizontally, there is a link saying "add alternative ASCII-only spelling to help searches". When the user clicks on it, it is replaced by another text input and a "+" sign allowing to add another additional text input field. In those fields, users can type the alternative spelling for their tag, "resume" in this example.

There are two things that can happen next:

1.

The add-on is now tagged with two tags: "résumé" and "resume", but only "résumé" is a primary tag, meaning that only "résumé" is displayed in the add-on's tag cloud. "Resume" is an auxiliary tag, used only for searches.

The problem here is the inverse scenario: I believe it is rather unlikely that someone typing "resume" will think of the alternative spelling: "résumé". Hence the second solution:

2.

As soon as the users provides two spelling versions of one tag, we can use them to create a two-way mapping. Whenever someone searches for "résumé, their query will be first checked against the mapping returning "resume" and the search results can now include add-ons tagged with both spellings. This works the other way round too.

So in fact, instead of mapping Unicode letters onto ASCII letters, we're mapping words onto words (in both directions: unicode->ascii and ascii->unicode). And that's generated by users, so we have good chances of covering the most popular words first.

Thoughts?
-- Staś Małolepszy, 04 Mar 2009
Other suggestions (not ideal, but just to add them):

* Tag/word suggestion feature to help with normalization (i.e. Did you mean 'résumé'?")
* Tag editing or suggested changes after tag is submitted (i.e. flag-a-tag, community suggesting/editing?)
* Use Unicode and then have a second step that asks the tagger to enter using the English set above? This would require friendly UI that somehow explains why we need to do both. This option essentially combines 1 and 2.
* Creating another way to identify a tag that doesn't rely on character. Can you look for context or usage in meta information? Or make sure the tag reflects the category where it will live and then ask the user if a potential conflict occurs?

Just some ideas that sprung up when chatting with others.
-- sethb, 04 Mar 2009
Tags and localization is a tough beast. Untested idea: What about separating tags per locale. This will encourage the use of common idioms within a locale and not pollute the display of other locales. Additionally it may provide some performance benefits on the backend where character set optimizations can be made.
-- Austin King, 04 Mar 2009

Post a comment

All comments are held for moderation; basic HTML formatting accepted.

Name: