Skip to content

Top 50 searches on addons.mozilla.org

The flight from Portland to San Jose is just about the right length to write some scripts to analyze a bunch of data, make a pretty graph, and then write a blog post drawing fairly obvious conclusions. Someone on IRC said they were interested in the top search terms being used on addons.mozilla.org so here we are.

During the week of April 29, 2009 and May 5, 2009 there were around 150000 queries. Of the top 20 queries on addons.mozilla.org (a quick estimate says that is around 12% of the total queries on the site) only 7 actually have search terms. The rest are just choosing different options for the search like category or number of results on a page. If we filter the top queries for ones that include search terms we get a graph that looks like this:

All the searches on that page are for the en-US locale unless otherwise noted. It looks like the majority of searches are for specific add-ons but there are also some popular generic terms like download, gmail, and video. I think it's interesting that German was the only other locale to make the list (and fairly high up on the list). Maybe the next stats post will be about overall locale use.

Tagged , , ,

addons.mozilla.org Celebrates 1000 (passing) Unit Tests

We started writing unit tests for AMO a few years ago with the best of intentions. As the tests grew we started running into memory/timeout problems that prevented us from running the tests. Other priorities took over and since we couldn't run the tests we quit writing them. The tests got put on the back burner, became stale, and we're for the most part forgotten (an all too familiar story for most developers).

Over the past few months we've been turning that around. While it's certainly a team effort, it's not stretching the truth to say that Jeff Balogh has been the driving force behind making sure our framework can scale and getting our old tests running again. Thanks to his tireless efforts our latest numbers show over 1200 unit tests, 1065 of which are passing.

In an effort to prevent them from being forgotten again he also created an IRC bot named bosley who tracks the tests and reminds people when they fail. Expect to see bosley in #amo soon.

The number of tests and the continuous monitoring of them is a huge milestone for AMO and Mozilla WebDev.

Tagged , ,

The Tagging Plan for AMO

Firstly, thanks for all the great feedback. Something as seemingly simple as tagging gets complex quickly when thought out and the varied perspectives of the community are always great to have.

Allowing full Unicode would let anyone use meaningful tags in their own character sets but would prevent us from offering similar matches and common misspellings. On the other hand, we support several languages on AMO that don't use the Latin alphabet. It stands to reason that users would search for tags in their own character sets and would get no results. There are pros and cons for each choice but we're essentially debating the value of normalization in tagging.

After distilling all the feedback and talking amongst ourselves our overall feeling was that forcing people to convert their input into the Latin alphabet wasn't in the users' best interest. The Mozilla Manifesto talks about a global internet that fosters creativity and free expression. Not supporting a user's native language when we have the option to doesn't feel like the right path to take.

With that in mind our current plan is as follows:

  • Allow full Unicode in tags to the extent we do everywhere else on AMO.
  • Do no automatic character normalization. The option of manual normalization (essentially, marking some tags as equivalent) is left open as a future enhancement.
  • Do automatic white space and capitalization normalization. Spaces are displayed on an add-on's page but when searching or entering into a URL spaces are unnecessary. For example, newyork, new york and New YORk are all equivalent.
  • A list of suggestions will be provided as the user types. We may attempt some simplistic character normalization in the suggestions if we can come up with a way that provides enough value to continue to use (perhaps something that is per-language).
  • White space is trimmed from the beginning and end of tags before they are saved into the database.
  • Tags are limited to 128 characters and add-ons are limited to 80 tags.
  • Tags will be comma delimited. To include a comma in your tag you must use quotation marks. Quotation marks, whether they are matched or not, are discarded. Example: "Portland, OR" will become Portland, OR whereas Portland", OR will become Portland and OR.

Additional feedback, as always, is welcome.

Tagged , ,

Some considerations when adding Tags to AMO

Tags broke into the limelight around the time "Web 2.0" was becoming popularized. They provided a simple but effective way to categorize objects and many sites are using them now. Despite their proliferation, I haven't found any documentation on the internet regarding standards for implementing tags.

A tag library exists for CakePHP but it, and many others, are too simplistic for what we want.

We've written our tagging goals into a plan but have some technical details we still need to figure out. While reviewing what we have a couple questions arose that we thought people would have opinions on.

1) What should the range of allowed characters be? Our first instinct was simplicity, something like /[A-Za-z0-9-]/ (that is, all English letters and numbers and a dash). This is easy to handle on our end but leaves out everyone that doesn't want to add tags using the English alphabet. There is some debate how useful it would be to allow other Unicode characters, particularly when you think about #2 below.

2) Tags are most useful when they are normalized. By allowing Unicode characters we run the risk of diluting our tag cloud. For example, resume and résumé are close enough that for our purposes they are equivalent. If we allow Unicode we'll have to deal with converting characters like é to e and vice versa for searches. At that point we'll need a list of "equivalent" characters - not impossible but it will slow things down (both development and speed of a search). The second question is: Assuming you think we should allow Unicode characters, what characters are equivalents? Here is a quick idea from php.net's strtr() documentation:


$a = 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûýýþÿŔŕ';
$b = 'aaaaaaaceeeeiiiidnoooooouuuuybsaaaaaaaceeeeiiiidnoooooouuuyybyRr';

Some other aspects of our current plan are:

  • Tags are not localizable in the same way as other strings on the site (like categories). There isn't anything stopping someone from using "WebDev" as a tag or creating a new tag with "WebDev" translated in their language. However, there won't be any relationship between the two translated tags.
  • Tags are separated by spaces. Spaces within tags are allowed with quotes.
  • Spaces will be preserved when displaying a tag on the add-on's page, however, they will be removed for displaying the tag in a URL and for doing logical operations on the back end like searching. This means searching for "Portland OR" will actually be collapsed to "PortlandOR" and will match either "Portland OR" or "PortlandOR" tags. This is consistent with flickr.
  • If unicode is allowed we'll preserve characters as they are entered even if we are actually searching on their "equivalents."
Tagged , , ,

Differentiate Bugzilla emails?

Bugzilla is an awesome bug tracker that is used by hundreds of companies. I've got accounts on several projects' trackers and I'm sure many others do also.

When I get mail from Bugzilla it's not obvious which project it's from. My email client (GMail) only shows the "from name" so all I see for these projects is:

Mozilla: bugzilla-daemon
Pootle: bugzilla-daemon
Miro: bugzilla
kernel.org: bugme-daemon
Apache: bugzilla

Wouldn't it make sense to differentiate each projects' emails in the from name? Maybe even by default (something like "%SITE_NAME% Bugzilla")?

Reed says it's a personal problem because his mail client shows the full address. Am I the only one? :(

Tagged ,

How addons.mozilla.org defends against XSS attacks

One of the things that gets a lot of news time these days is XSS. There are a lot of places that explain what it is and how to prevent it but most are oversimplified or don't provide real world examples. I thought I'd explain a couple of the ways AMO attempts to prevent it.

I'm not trying to invite attackers by posting this. My goal is to provide a (hopefully) working example from a real world, high-traffic site. I think the people exploiting XSS have a fairly good idea what they are doing and, too often, the people attempting to secure their sites don't. Since AMO is open source I'm not sharing anything that isn't available already anyway (side note: please don't depend on security by obscurity).

Firstly, this chunk of code sits in CakePHP's bootstrap.php and runs very close to the start of every request:


if (array_key_exists('url',$_GET) &&
    !preg_match('/\/api\//', $_GET['url']) &&
    preg_match('/[^\w\d\/\.\-_!: ]/u',$_GET['url'])) {
    header("HTTP/1.1 400 Bad Request");
    exit;
}

Since a lot of XSS attacks are launched from the URL we implemented this simple white list of characters we'll allow. If anything outside of that white-list is in the URL we return an invalid request header and die. This isn't a lot of protection but it does narrow the field of what our application expects and has to deal with (particularly with control characters, high level ASCII, etc.).

The second, and more important section of code is in our app_controller class. We wrote a custom sanitize() function that any string going into one of our views gets run through:


$sanitize_patterns = array(
    'patterns'      => array("/%/u", "/\(/u", "/\)/u", "/\+/u", "/-/u"),
    'replacements'  => array("%", "(", ")", "+", "-")
    );

........

$data = iconv('UTF-8', 'UTF-8//IGNORE', $data);
$data = htmlspecialchars($data, ENT_QUOTES, 'UTF-8');
$data = preg_replace($sanitize_patterns['patterns'], $sanitize_patterns['replacements'], $data);

This code has several important parts and I'll start with the functions. The first function that modifies the actual data is iconv(). We ask it to convert our data from UTF-8 to UTF-8 which seems unnecessary but the "//IGNORE" part is important - that means it will throw out any characters it can't represent appropriately. This was added to prevent a proof of concept attack that exploited a C0 ASCII control code character to break the output (discovered on the sla.ckers.org forums).

The next function, htmlspecialchars(), is a pretty well known function and converts special characters to their ASCII equivalents. The second parameter specifically asks it to encode single quotes.

Lastly we use the array of patterns and replacements declared at the beginning to encode a few final symbols, like parenthesis and the percentage sign, into HTML entities.

This system has worked fairly well for a few years now and as issues are discovered we make changes to it. If you're looking for the latest code please be sure to check our repository. And, as always, if you find any kind of exploit on AMO please let me know! :)

Tagged , , , ,

Verbatim Server Downtime

Translate Toolkit 1.3.0 was released a few days ago. I was following along with trunk on my development box and I wanted to upgrade our alpha install to take advantage of the new features (namely, speed improvements) and the django framework.

I attempted this tonight and it was not a pretty upgrade (or install, for that matter). Among the medley of problems is Django ticket #6548. Django assumes it's not behind an SSL proxy so when it does any redirects it doesn't use https. This means logging in and logging out work on our server but the user is presented with a jarring "bad request" interstitial.

The current status is that user accounts are not migrated and, even if they were, I can't seem to set permissions for projects. Since there are some odd problems that we haven't seen elsewhere and this is an alpha install I'm going to leave it as is and debug some of the issues over the next few days. Expect downtime. If there are questions visit #verbatim on irc.mozilla.org.

Tagged ,

Add-on Statistics Status (part 2)

This is the second update about add-ons' statistics. Read part one.

Statistics for both update pings and download counts have been updated beginning with February 1 through today, February 6th. Some notes:

  • New statistics are stored in UTC and data processing happens shortly after the logs close. This means you can expect new data at around 8pm PST or shortly after.
  • Download numbers will drop dramatically. They have been recorded incorrectly[1] for the past several weeks. Bug 472538 has more details.
  • We'll begin replacing statistics back to 2008-11-15 over the next few weeks as processing time allows.
  • An aside that you may not know: When Firefox looks for an update to an add-on we count that as an "update ping." If it finds the update it will hit releases.mozilla.org directly for the new add-on. That means that in your current stats numbers updates are not counted as downloads, or another way, "download counts" are the counts of someone actually clicking the "Install Now" button on addons.mozilla.org.

Since we're pulling these statistics from a team dedicated to crunching numbers we're getting richer and more reliable data now. This frees up our time to fix existing stats bugs and also to add additional data views (like what locale your users are using). Good things are coming; keep an eye on your stats!

Update 2008-02-07: HP issued a critical alert regarding potential data loss which affected our servers. Our IT team applied the fix but upon restart discovered it's been way too long since the file system had fsck run on it. Since there is so much data on the system it will take several more hours to finish, then IT will restore log files, and then we can begin to process the stats for this weekend. In short, stats won't be current for another day or two.

[1] The technical reason is that Firefox does 2 or 3 GET requests to a server when it installs an add-on. The filter we had to remove duplicate requests was broken.

Tagged , ,

Verbatim: going forward

According to the high level plan, we're currently on step 4. The Mozilla branch has been merged back into Pootle's trunk and work on the branch has been discontinued.

While writing code it became apparent that the framework Pootle was built on, jToolkit, had some shortcomings that were making it difficult to work with (not to mention development had been stopped on it since 2006). The decision was made to migrate the back end of Pootle from jToolkit to Django. This wasn't something I had counted on when I originally made the time line for Mozilla using Pootle but it was a necessary delay. During the transition, forward progress, at least on the Mozilla side, was halted. In November and December, the translate.org.za team did some fantastic work and completely replaced jToolkit.

Thanks to a lot of work from everyone and a bunch of unit tests the django based system reached parity with the old system rapidly. The Pootle team is expecting to release a new version around the end of this month. At that time I'll upgrade our alpha version and re-enable the features I've had to disable. I'm expecting the upgrade to solve a lot of the scalability problems we've been having and then we can start advertising our install more and expanding the projects it works with.

Once I do the upgrade Mozilla will be running a stock version of Pootle which I expect to continue from this point forward. Any patches Mozilla contributes back will be generic enough to be useful to anyone and will land on trunk.

We've created a 2009 idea/goal wiki page which will be distilled into a project road map. There are some exciting features coming down the pipeline, bringing a lot of improvements (particularly with the user interface) with them. As an added bonus, the new Django framework will allow us to progress faster with new features and it will be easier for more people to contribute code.

Thanks for your patience.

Tagged , ,

Add-on Statistics Status

Add-on statistics have been intermittent for a couple months and are just recently getting the attention they need.

Our current process is to count download statistics once per day and update ping statistics once per week (update pings are a sampling of the complete set). The reliability of the script generating these statistics has been falling as our data size has grown and we've had several bugs filed regarding the numbers it's produced. Most of the time they are relatively small fixes and the script continued to limp along.

Currently we're facing questionable results in both sets of statistics (bug 468570 for update pings, bug 472538 for download counts). I've been debugging the update pings script and despite solving some problems we're continuing to see the script fail to run properly.

Parallel to AMO development, Daniel Einspanjer has been working on a larger statistics parser that will aggregate data from many Mozilla sites into a dashboard with easy visualizations. It turns out he's already processing the AMO logs and pulling out more data than us more often and in less time.

With a system like that available it doesn't make sense for us to continue to develop (and, in this case heavily modify) our local statistics scripts. With that in mind, our next steps are:

  1. Verify the results we (used to) get with the AMO scripts match those of the new system
  2. Create a transformation script to push the data from Daniel's project to the AMO database
  3. Turn off the AMO scripts
  4. Back fill statistics through at least November 15th, 2008 to replace our flailing stats. If the comparisons in step 1 reveal miscounting from before that we'll back fill as far as we need to.

These steps will let us meet the immediate goal of getting the statistics we offer now to be reliable and complete. In the future we can look at pulling additional data from the new metrics system. The target date to switch to the new system is the end of next week, Jan 31 2009. Once we make the switch we can evaluate how long the parsing takes and give an estimate of how long back filling will take. As always, let me know if there are any concerns.

Update 2009-02-02: We compared the scripts' results and found a discrepancy among add-ons that have significant external download numbers. The current stats script verified the GUID matches and then counted the update. The new stats script verified the GUID and the version before counting the update. This means if a specific version isn't hosted on AMO the new script doesn't count it. I think the current method of verifying only the GUID is more useful to authors and the new script is being changed. That means we'll have to re-run and re-compare the numbers (a single day is taking about 5 hours now). Other numbers are showing early promise. I'll continue to update as we progress.

Tagged , ,