Javascript Madness Intro

Javascript Madness: Query String Parsing

Jan Wolter
July 29, 2011

Recently I had a Javascript application that needed to parse the query parameters that were passed to it in the URL. That is I have a page that is loaded by a URL something like:

http://site.com/page.html?search=Fruit+Bat&results=25

and I want the Javascript that is embedded in the HTML page to be retrieve those command line arguments, so that it will know that the search string is "Fruit Bat" and the number of results to display is "25".

1. What I Found on the Net

I figured this was a normal enough kind of thing to want to do, so I shouldn't have to write it from scratch. I should be able to google up something someone else wrote.

I did indeed find plenty of solutions to this problem generously proffered on the net, but my first impression, as I read through them, all was that they were all wrong.

I noticed four distinct problems with their parsing of query strings. Among the dozens of solutions I saw, only a few handled even one of these issues completely correctly. (Since writing this page I found this solution which appears to do everything correctly.)

  1. Failure to Fully Decode Values. What if my search string is "Rock & Roll"? Then the URL will normally contain "search=Rock+%26+Roll". Most of the solutions I saw did no decoding of this at all. They would return "Rock+%26+Roll" as the value of the string. I saw one that tried to handle this by called decodeURI() on the whole string before parsing, but that would decode neither the plus signs nor the "%26" in "Rock+%26+Roll".

  2. Failure to Decode Keys. It is possible for the key strings to be encoded as well. Something like "rock%26roll=here+to+stay" is perfectly legal, and should give the result that "rock&roll" is "here to stay". I didn't see a single solution that handled this correctly, though admittedly it is a case that doesn't come up often.

  3. Failure to Correctly Handle Null Values. Query strings can contain parameters with a null value, written as "key1=&key2=" or as "key1&key2". Some of the code I saw would not distinguish these cases from key1 and key2 being undefined.

  4. Failure to Correctly Handle Multiple Values. A query string can contain multiple values for the same key, like "key=dog&key=cat&key=mouse". This is the normal result from form elements like multiple selects. It should be possible to get all the values in such cases, not just the first or last.

2. Query String Standards (or the Lack Thereof)

All this looked rather unsatisfactory to me, but, in a way, I was actually wrong. You see, really there is no standard defining the correct way to encode key/value pairs into a query string. As far as the URI standards go, the query string is just a string whose syntax is defined by the application. Maybe it contains key/value pairs, and maybe it doesn't. About the only hard rule is that it shouldn't ever contain a pound sign.

There is really only one case where there is a standard for how the query string is to be encoded. If you have an HTML form with method="GET", and submit it, then it will be posted with content type "application/x-www-form-urlencoded" where whatever was filled into the form is passed in the query string on the URL, instead of in the request body as is done with the more normally used "method="POST". There is a W3C standard for that. It says:

Forms submitted with this content type must be encoded as follows:
  1. Control names and values are escaped. Space characters are replaced by `+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
  2. The control names/values are listed in the order they appear in the document. The name is separated from the value by `=' and name/value pairs are separated from each other by `&'.
So in the specific case that the query string you are parsing was generated by a form submission with method="GET", then this is how the arguments will be encoded.

In my case, the parameters are just generated by another one of my pages. In that case I can encode the parameters any darned way I please. If in my application my query strings never contain special characters or spaces, never have null values, and never have multiple values, then the parsers I saw on the net would be fine. I'm not required to follow the W3C recommendations. In fact, the W3C recommends that you make at least one change in such circumstances - they recommend that you use semicolons to separate the key/value pairs instead of ampersands.

However it is certainly standard practice to encode query strings using the "application/x-www-form-urlencoded" rules, and if we are offering a query string parser to the public, it ought to obey those rules, since they are the only ones around. So that's what I'm going to do in this article.

3. Javascript Encode/Decode Functions

Before we get down to the task of writing a query string parser, we need to closely examine the handy tools built into by Javascript to help us with this task. Javascript provides three encoding functions and three decoding functions for URLs. Originally there was just escape() and unescape(). These have been depreciated and replaced with two pairs, encodeURI() and decodeURI() and encodeURIComponent() and decodeURIComponent().

3.1. The Built-In Functions Are Not Meant for Query Strings

The first thing to understand about them is that none of the functions are actually meant for query string encoding. As noted in the quote above, query strings are encoded slightly differently than other parts of the URI, with spaces being represented by plus signs, not by "%20". Let's see what happens if you encode the string "A + B" with these functions:
String: "A + B"
Expected Query String Encoding: "A+%2B+B"
escape("A + B") =  "A%20+%20B"   Wrong!
encodeURI("A + B") =  "A%20+%20B"   Wrong!
encodeURIComponent("A + B") =  "A%20%2B%20B   Acceptable, but strange
Now encoding the space as a "%20" instead of a plus sign is harmless. Any decent query string parser will accept that. But not encoding the plus as "%2B" is bad because then it will be interpreted as a space when we decode it. So you can't use escape() or encodeURI for query string encoding. You can use encodeURIComponent(), but if you really want to get the formally correct result, you have to do:
   encoded= encodeURIComponent(text).replace(/%20/g,'+');

But our interest is in decoding, not encoding, so let's see how the three available decode functions work on the correctly encoded version of the "A + B" string:

Encoded String: "A+%2B+B"
Expected Decoding: "A + B"
unescape("A+%2B+B") =  "A+++B"   Wrong!
decodeURI("A+%2B+B") =  "A+++B"   Wrong!
decodeURIComponent("A+%2B+B") =  "A+++B"   Wrong!
So none of these decode standard query strings correctly. If you want to decode a query string component correctly, you need to do:
   text= decodeURIComponent(encoded.replace(/\+/g,' '));
This isn't really a bug in the functions. They just aren't really meant for query string encoding.

3.2. The encodeURI() and decodeURI() Functions are Stupid

The second thing to note is the encodeURI() and decodeURI() are idiotic and worthless. I've strained my brain pretty hard trying to think of any application where it would be entirely sensible to use them, and haven't come up with much of anything.

Here's the concept. There are a few characters that clearly are the most important ones to encode, because if they weren't encoded then they could mess up the parsing of the URL. Things like '&' and '=' which could be confused with query string separators, and '#' which could be confused with the start of a location tag.

So what encodeURI() does is encode all sorts of other characters, but NOT those important-to-encode characters. So it encodes all the stuff that isn't really all that critical to encode, but leaves behind the important stuff. The list of characters it does not encode (but which encodeURIComponent() does encode) is:

    # $ & + , / ; = ? @

How could this possibly be useful? Well, if you assemble a whole URL without doing any encoding, and you happen to know somehow that the only instances of those particular characters that appear in it are ones used for their special functions, then you could encode the whole URL at once by calling encodeURI() on the whole thing. I guess that's the theory anyway. It seems clearly more robust just to encode the bits as you assemble them.

The decodeURI() function is the inverse of this. It will decode much of the string, but it will leave things like "%26" unchanged, because that would decode into an ampersand, and you wouldn't want to decode those, now would you?

So these are basically useless and you should forget they exist. The versions you want are the ones with the ridiculously long names, encodeURIComponent() and decodeURIComponent().

3.3. Non-ASCII Characters

URL's can only contain standard printable ASCII characters, so any Unicode characters must be encoded. All three encoding functions will do this for you, but escape() does it entirely differently from the newer functions.

For an explanation of the exact differences, I'll refer you to Javascripter.net which has a nice little table. I assume that the behavior of encodeURIComponent() is fashionable, but I can't say I've actually seen a standard on this.

The main thing to know is that they aren't interchangable. Strings encoded by escape() can't be correctly decoded by decodeURIComponent(), and strings encoded by encodeURIComponent() can't be decoded by unescape(). Either way, every single non-ASCII character will be mangled beyond recognition. Things will work fine if you stick with one pair or the other, but they don't mix and match.

3.4. Stupid Error Handling

So, given that escape() and unescape() are depreciated, and encodeURI() and decodeURI() are stupid, then encodeURIComponent() and decodeURIComponent() must be the good ones to use, right?

Well, encodeURIComponent() is fine, but decodeURIComponent() has a serious problem (which it actually shares with decodeURI()).

Let's suppose that the encoded string isn't quite encoded correctly. Maybe it was generated on some other site that used a slightly different encoding standard. Maybe it was encoded by escape() so that there is a © symbol in it encoded as '%A9' instead of '%C2%A9'. Maybe it got mangled a bit when someone cut/pasted it or copied it incorrectly. What happens?

Well, when nasty old depreciated unescape() saw something it didn't understand it just left it untouched, unencoding everything else. That's actually quite sensible. You usually want parsers to be as forgiving as possible.

When decodeURIComponent() sees something it doesn't understand, it throws an exception. If you didn't embed the call in a try/catch block, your Javascript program crashes. If you did use the try/catch block, then you can save your program from crashing, but you don't get any kind of partial result back. You've lost the whole string because of one bad character. If you want to be super nit-picky and not accept any input that isn't perfect, then decodeURIComponent() is just what you need, but if you want something more forgiving you are out of luck.

For situations like that, I wrote the decoder below. This should correctly decode anything encodeURIComponent() can generate, just like decodeURIComponent() does, but instead of throwing an error when it sees things it doesn't understand, it just silently passes them through.

function decode(s)
{
    s= s.replace(/%([EF][0-9A-F])%([89AB][0-9A-F])%([89AB][0-9A-F])/gi,
        function(code,hex1,hex2,hex3)
        {
            var n1= parseInt(hex1,16)-0xE0;
            var n2= parseInt(hex2,16)-0x80;
            if (n1 == 0 && n2 < 32) return code;
            var n3= parseInt(hex3,16)-0x80;
            var n= (n1<<12) + (n2<<6) + n3;
            if (n > 0xFFFF) return code;
            return String.fromCharCode(n);
        });
    s= s.replace(/%([CD][0-9A-F])%([89AB][0-9A-F])/gi,
        function(code,hex1,hex2)
        {
            var n1= parseInt(hex1,16)-0xC0;
            if (n1 < 2) return code;
            var n2= parseInt(hex2,16)-0x80;
            return String.fromCharCode((n1<<6)+n2);
        });
    s= s.replace(/%([0-7][0-9A-F])/gi,
        function(code,hex)
        {
            return String.fromCharCode(parseInt(hex,16));
        });
    return s;
}
Note that like all the built-in functions this does not convert pluses into spaces. If you want that you can modify the function (as I do below) or just replace all pluses with spaces before (not after) passing the string in.

4. My Query String Parser

So, here's my version of a query parser. (As with all the code on these "Javascript Madness" pages, you may use this freely in any way without worrying about my copyright at all, and, as usual, it is offered without warrantee.)
// Query String Parser
//
//    qs= new QueryString()
//    qs= new QueryString(string)
//
//        Create a query string object based on the given query string. If
//        no string is given, we use the one from the current page by default.
//
//    qs.value(key)
//
//        Return a value for the named key.  If the key was not defined,
//        it will return undefined. If the key was multiply defined it will
//        return the last value set. If it was defined without a value, it
//        will return an empty string.
//
//   qs.values(key)
//
//        Return an array of values for the named key. If the key was not
//        defined, an empty array will be returned. If the key was multiply
//        defined, the values will be given in the order they appeared on
//        in the query string.
//
//   qs.keys()
//
//        Return an array of unique keys in the query string.  The order will
//        not necessarily be the same as in the original query, and repeated
//        keys will only be listed once.
//
//    QueryString.decode(string)
//
//        This static method is an error tolerant version of the builtin
//        function decodeURIComponent(), modified to also change pluses into
//        spaces, so that it is suitable for query string decoding. You
//        shouldn't usually need to call this yourself as the value(),
//        values(), and keys() methods already decode everything they return.
//
// Note: W3C recommends that ; be accepted as an alternative to & for
// separating query string fields. To support that, simply insert a semicolon
// immediately after each ampersand in the regular expression in the first
// function below.

function QueryString(qs)
{
    this.dict= {};

    // If no query string  was passed in use the one from the current page
    if (!qs) qs= location.search;

    // Delete leading question mark, if there is one
    if (qs.charAt(0) == '?') qs= qs.substring(1);

    // Parse it
    var re= /([^=&]+)(=([^&]*))?/g;
    while (match= re.exec(qs))
    {
        var key= decodeURIComponent(match[1].replace(/\+/g,' '));
        var value= match[3] ? QueryString.decode(match[3]) : '';
        if (this.dict[key])
            this.dict[key].push(value);
        else
            this.dict[key]= [value];
    }
}

QueryString.decode= function(s)
{
    s= s.replace(/\+/g,' ');
    s= s.replace(/%([EF][0-9A-F])%([89AB][0-9A-F])%([89AB][0-9A-F])/gi,
        function(code,hex1,hex2,hex3)
        {
            var n1= parseInt(hex1,16)-0xE0;
            var n2= parseInt(hex2,16)-0x80;
            if (n1 == 0 && n2 < 32) return code;
            var n3= parseInt(hex3,16)-0x80;
            var n= (n1<<12) + (n2<<6) + n3;
            if (n > 0xFFFF) return code;
            return String.fromCharCode(n);
        });
    s= s.replace(/%([CD][0-9A-F])%([89AB][0-9A-F])/gi,
        function(code,hex1,hex2)
        {
            var n1= parseInt(hex1,16)-0xC0;
            if (n1 < 2) return code;
            var n2= parseInt(hex2,16)-0x80;
            return String.fromCharCode((n1<<6)+n2);
        });
    s= s.replace(/%([0-7][0-9A-F])/gi,
        function(code,hex)
        {
            return String.fromCharCode(parseInt(hex,16));
        });
    return s;
};

QueryString.prototype.value= function (key)
{
    var a= this.dict[key];
    return a ? a[a.length-1] : undefined;
};

QueryString.prototype.values= function (key)
{
    var a= this.dict[key];
    return a ? a : [];
};

QueryString.prototype.keys= function ()
{
    var a= [];
    for (var key in this.dict)
        a.push(key);
    return a;
};
The constructor here parses the whole query string and stores the key/value pairs in the associative array QueryString.dict. The functions to retrieve values then work off that data structure. Most of the examples I saw on the net didn't do this, but did a whole separate parse each time the value of an parameter was requested. Since I usually use all the parameters I pass to a page, I think it makes sense to parse the whole string first. But honestly, there are usually few enough parameters that efficiency isn't that important an issue here.

I used a regular expression to parse apart the string. Lots of people use first split('&') and then split('=') on each component. That's actually OK, but I prefer the regular expression, partly because I think it handles bad query strings like "search=E=mc2" better. Mine will interpret that as saying the value of "search" is "E=mc" while most of the split-based solutions would say the value of "search" is "E".

You could pretty easily add methods to this to set key values and add a toString method that would convert the hash table back into a properly encoded query string, and then it would be a pretty nice library for constructing query strings as well as parsing them, but I only needed parsing functionality,

A few days after I wrote all this, I decided I needed to redesign my application. Some of the data I was passing from page to page was just sensitive enough that I didn't want it showing in user's location bars. So I decided to pass data in a cookie instead of in query strings. Thus I've never actually used this code in a production application, and it could possibly have some bugs. Count yourself warned.

If you'd like to test this code, just add query parameters to the URL of this page. The parsed out values should appear below:

KeyLast
Value
All
Values
For example, try these arguments:
key1=&key2&search=Rock+%26+Roll&rock%26roll=here+to+stay&key3=dog&key3=cat&key3=mouse&weird=%26%CE%A8%E2%88%88

Last Update: Thu Sep 26 08:56:24 EDT 2013

Thanks to Nathan Alden for pointing out that the regular expressions for decoding hex strings in the decode functions should be case insensitive.