How defends against XSS attacks

One of the things that gets a lot of news time these days is XSS. There are a lot of places that explain what it is and how to prevent it but most are oversimplified or don’t provide real world examples. I thought I’d explain a couple of the ways AMO attempts to prevent it.

I’m not trying to invite attackers by posting this. My goal is to provide a (hopefully) working example from a real world, high-traffic site. I think the people exploiting XSS have a fairly good idea what they are doing and, too often, the people attempting to secure their sites don’t. Since AMO is open source I’m not sharing anything that isn’t available already anyway (side note: please don’t depend on security by obscurity).

Firstly, this chunk of code sits in CakePHP’s bootstrap.php and runs very close to the start of every request:

if (array_key_exists('url',$_GET) &&
    !preg_match('/\/api\//', $_GET['url']) &&
    preg_match('/[^\w\d\/\.\-_!: ]/u',$_GET['url'])) {
    header("HTTP/1.1 400 Bad Request");

Since a lot of XSS attacks are launched from the URL we implemented this simple white list of characters we’ll allow. If anything outside of that white-list is in the URL we return an invalid request header and die. This isn’t a lot of protection but it does narrow the field of what our application expects and has to deal with (particularly with control characters, high level ASCII, etc.).

The second, and more important section of code is in our app_controller class. We wrote a custom sanitize() function that any string going into one of our views gets run through:

$sanitize_patterns = array(
    'patterns'      => array("/%/u", "/\(/u", "/\)/u", "//u", "/-/u"),
    'replacements'  => array("&amp;#37;", "&amp;#40;", "&amp;#41;", "&amp;#43;", "&amp;#45;")


$data = iconv('UTF-8', 'UTF-8//IGNORE', $data);
$data = htmlspecialchars($data, ENT_QUOTES, 'UTF-8');
$data = preg_replace($sanitize_patterns['patterns'], $sanitize_patterns['replacements'], $data);

This code has several important parts and I’ll start with the functions. The first function that modifies the actual data is iconv(). We ask it to convert our data from UTF-8 to UTF-8 which seems unnecessary but the “//IGNORE” part is important - that means it will throw out any characters it can’t represent appropriately. This was added to prevent a proof of concept attack that exploited a C0 ASCII control code character to break the output (discovered on the forums).

The next function, htmlspecialchars(), is a pretty well known function and converts special characters to their ASCII equivalents. The second parameter specifically asks it to encode single quotes.

Lastly we use the array of patterns and replacements declared at the beginning to encode a few final symbols, like parenthesis and the percentage sign, into HTML entities.

This system has worked fairly well for a few years now and as issues are discovered we make changes to it. If you’re looking for the latest code please be sure to check our repository. And, as always, if you find any kind of exploit on AMO please let me know! :)


Good post, Wil! I think you should cross-post (or at least reference) it on the webdev blog.
-- Fred, 23 Feb 2009
The webdev blog doesn't have sexy syntax highlighting though.
-- Wil Clouser, 23 Feb 2009
Good post. I generally just use htmlspecialchars at the moment, but will consider adding a similar function to our codebase.

I think its worth pointing out more clearly exactly where the sanitise function is run. Lots of developers know that they need to run strings through functions like this to be safe, but they don't really understand what they are doing so just run the strings through these functions wherever they see the string and remember to do it. That leads to some strings being sanitised multiple times, and others not to be sanitised at all.

It can be tempting just to sanitise everything on it's way in - once it is in your system it is trusted data (think PHP's Magic Quotes). But how do you sanitise it? SQL escaping? HTML escaping?

I like to think of it as converting a string from it's original format, as supplied by the user, to a particular format for how I want to use it. If I want use the string in a PostgreSQL query, then I convert it to PostgreSQL by running it through pg_escape_string. If I want to use it on a web page, then I convert it to HTML by running it through htmlspecialchars (or Wil's sanitise function). In both cases the end user (querying the database or browsing the web) will see the string in its original format, but the intermediate language (PostgreSQL or HTML) will ignore any special characters.
-- Ian Thomas (thelem), 23 Feb 2009
Good post, I will suggest you, to transform this code in a separated library/project, so all of us can use your skills in our projects.
-- Fernando Hartmann, 03 Mar 2009
would you like to speak out a little bit about what you are expectin the combination of the first two regexes to do? and how did you test, if your expectations are met? Thanks!
Otherwise still very confusing, that a framework like cakephp does still not give you xss protection right out of tthe box.
-- Garibaldo Persisto, 10 Mar 2009
Hi Wil,

Interesting stuff.

Would it be easier/simpler/the same (or is there an additional advantage?) to drop the iconv call and add this to the preg_replace

"/\x00-\x1f/" => ''
( OR "/\x{0000}-\x{002f}/" => '' )

I ask primarily because this is what I do, and want to know if I'm missing out/opening things up ( :) ).


-- AD7six, 12 Mar 2009
what about cookies and http headers and post? seems an architectural weekness to check this on other places in your app. the best thing would be to check ALL input channels in ONE file and with the same expressions, not clutter it over different files. Also tighten your app to only let ALL input channels through ONE special file, where no side-effects can occur. Also, of course, WHITELIST and not BLACKLIST.
I checked cakephp and it seems one of the weeker frameworks considering security architecture - it leaves certain important steps to the user, that means the framework does NOT make things more secure for you but you have to think yourself - besides having a lot of development goodness from a security pov the question arises "so wtf is the framework good for then?". Also the AUTH and especially the ACL system is not integrated into he backend but users have to configure and built it itself - it shows that most people do not understand how to do it and open new security holes. As we have seen in the recent past even mature cake devs are not able to implement secure auth themaselves - how could users, if documentation is extremely vague?
cake is a great tool for rapid dev - but it is contraproductive in security as it A) not gives you security for your apps out-of-the-box and B) complicates the way you might be used to secure your site because it injects another level of abstraction. A framework with these weeknesses in security can only be used by extremely experienced developers which are used to study forign code - you will have to study cake very deeply to make a secure site. Of course, if you are that experienced, you will already have your own framework that might be much more accurate and adopts common anti-xss measures to the max.
I predict: many many middle- to low-skilled php developers, mostly typically "designers with some web skillz" will adopt cakephp as a "cool framework" without even knowing anything about its inner workings - and they will produce the next big wave of insecure php applications. That is why a framework MUST be secure out-of-.the-box - every action that might endanger your site must be safe by default and an experienced developer *might* disable this behaviour if she knows what she is doing, not the other way around.
-- Jones, 13 Mar 2009
Check out HTML purifier @

It is a php solution that uses whitelists, is recursive, and checks attributes as well as tags. I have never used HTML purifier, but I recently wrote a Java class for doing this exact thing for our web application and it has passed every test I've thrown at it thus far. It seems to be a pretty good solution.
-- Bryan Migliorisi, 18 Mar 2009
Oops, I posted the link to HTML Purified, a Wordpress plugin for sanitizing comments.

The HTML Purifier can be found @
-- Bryan Migliorisi, 18 Mar 2009

Post a comment

Feel free to email me with any comments, and I'll be happy to post them on the articles. This is a static site so it's not automatic.