Recently I had a Javascript application that needed to parse the query parameters that were passed to it in the URL. That is I have a page that is loaded by a URL something like:
http://site.com/page.html?search=Fruit+Bat&results=25
and I want the Javascript that is embedded in the HTML page to be retrieve those command line arguments, so that it will know that the search string is "Fruit Bat" and the number of results to display is "25".
I did indeed find plenty of solutions to this problem generously proffered on the net, but my first impression, as I read through them, all was that they were all wrong.
I noticed four distinct problems with their parsing of query strings. Among the dozens of solutions I saw, only a few handled even one of these issues completely correctly. (Since writing this page I found this solution which appears to do everything correctly.)
There is really only one case where there is a standard for how the query string is to be encoded. If you have an HTML form with method="GET", and submit it, then it will be posted with content type "application/x-www-form-urlencoded" where whatever was filled into the form is passed in the query string on the URL, instead of in the request body as is done with the more normally used "method="POST". There is a W3C standard for that. It says:
Forms submitted with this content type must be encoded as follows:So in the specific case that the query string you are parsing was generated by a form submission with method="GET", then this is how the arguments will be encoded.
- Control names and values are escaped. Space characters are replaced by `+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
- The control names/values are listed in the order they appear in the document. The name is separated from the value by `=' and name/value pairs are separated from each other by `&'.
In my case, the parameters are just generated by another one of my pages. In that case I can encode the parameters any darned way I please. If in my application my query strings never contain special characters or spaces, never have null values, and never have multiple values, then the parsers I saw on the net would be fine. I'm not required to follow the W3C recommendations. In fact, the W3C recommends that you make at least one change in such circumstances - they recommend that you use semicolons to separate the key/value pairs instead of ampersands.
However it is certainly standard practice to encode query strings using the "application/x-www-form-urlencoded" rules, and if we are offering a query string parser to the public, it ought to obey those rules, since they are the only ones around. So that's what I'm going to do in this article.
Now encoding the space as a "%20" instead of a plus sign is harmless. Any decent query string parser will accept that. But not encoding the plus as "%2B" is bad because then it will be interpreted as a space when we decode it. So you can't use escape() or encodeURI for query string encoding. You can use encodeURIComponent(), but if you really want to get the formally correct result, you have to do:
String: "A + B" Expected Query String Encoding: "A+%2B+B" escape("A + B") = "A%20+%20B" Wrong! encodeURI("A + B") = "A%20+%20B" Wrong! encodeURIComponent("A + B") = "A%20%2B%20B Acceptable, but strange
encoded= encodeURIComponent(text).replace(/%20/g,'+');
But our interest is in decoding, not encoding, so let's see how the three available decode functions work on the correctly encoded version of the "A + B" string:
So none of these decode standard query strings correctly. If you want to decode a query string component correctly, you need to do:
Encoded String: "A+%2B+B" Expected Decoding: "A + B" unescape("A+%2B+B") = "A+++B" Wrong! decodeURI("A+%2B+B") = "A+++B" Wrong! decodeURIComponent("A+%2B+B") = "A+++B" Wrong!
text= decodeURIComponent(encoded.replace(/\+/g,' '));This isn't really a bug in the functions. They just aren't really meant for query string encoding.
Here's the concept. There are a few characters that clearly are the most important ones to encode, because if they weren't encoded then they could mess up the parsing of the URL. Things like '&' and '=' which could be confused with query string separators, and '#' which could be confused with the start of a location tag.
So what encodeURI() does is encode all sorts of other characters, but NOT those important-to-encode characters. So it encodes all the stuff that isn't really all that critical to encode, but leaves behind the important stuff. The list of characters it does not encode (but which encodeURIComponent() does encode) is:
# $ & + , / ; = ? @
How could this possibly be useful? Well, if you assemble a whole URL without doing any encoding, and you happen to know somehow that the only instances of those particular characters that appear in it are ones used for their special functions, then you could encode the whole URL at once by calling encodeURI() on the whole thing. I guess that's the theory anyway. It seems clearly more robust just to encode the bits as you assemble them.
The decodeURI() function is the inverse of this. It will decode much of the string, but it will leave things like "%26" unchanged, because that would decode into an ampersand, and you wouldn't want to decode those, now would you?
So these are basically useless and you should forget they exist. The versions you want are the ones with the ridiculously long names, encodeURIComponent() and decodeURIComponent().
For an explanation of the exact differences, I'll refer you to Javascripter.net which has a nice little table. I assume that the behavior of encodeURIComponent() is fashionable, but I can't say I've actually seen a standard on this.
The main thing to know is that they aren't interchangable. Strings encoded by escape() can't be correctly decoded by decodeURIComponent(), and strings encoded by encodeURIComponent() can't be decoded by unescape(). Either way, every single non-ASCII character will be mangled beyond recognition. Things will work fine if you stick with one pair or the other, but they don't mix and match.
Well, encodeURIComponent() is fine, but decodeURIComponent() has a serious problem (which it actually shares with decodeURI()).
Let's suppose that the encoded string isn't quite encoded correctly. Maybe it was generated on some other site that used a slightly different encoding standard. Maybe it was encoded by escape() so that there is a © symbol in it encoded as '%A9' instead of '%C2%A9'. Maybe it got mangled a bit when someone cut/pasted it or copied it incorrectly. What happens?
Well, when nasty old depreciated unescape() saw something it didn't understand it just left it untouched, unencoding everything else. That's actually quite sensible. You usually want parsers to be as forgiving as possible.
When decodeURIComponent() sees something it doesn't understand, it throws an exception. If you didn't embed the call in a try/catch block, your Javascript program crashes. If you did use the try/catch block, then you can save your program from crashing, but you don't get any kind of partial result back. You've lost the whole string because of one bad character. If you want to be super nit-picky and not accept any input that isn't perfect, then decodeURIComponent() is just what you need, but if you want something more forgiving you are out of luck.
For situations like that, I wrote the decoder below. This should correctly decode anything encodeURIComponent() can generate, just like decodeURIComponent() does, but instead of throwing an error when it sees things it doesn't understand, it just silently passes them through.
Note that like all the built-in functions this does not convert pluses into spaces. If you want that you can modify the function (as I do below) or just replace all pluses with spaces before (not after) passing the string in.function decode(s) { s= s.replace(/%([EF][0-9A-F])%([89AB][0-9A-F])%([89AB][0-9A-F])/gi, function(code,hex1,hex2,hex3) { var n1= parseInt(hex1,16)-0xE0; var n2= parseInt(hex2,16)-0x80; if (n1 == 0 && n2 < 32) return code; var n3= parseInt(hex3,16)-0x80; var n= (n1<<12) + (n2<<6) + n3; if (n > 0xFFFF) return code; return String.fromCharCode(n); }); s= s.replace(/%([CD][0-9A-F])%([89AB][0-9A-F])/gi, function(code,hex1,hex2) { var n1= parseInt(hex1,16)-0xC0; if (n1 < 2) return code; var n2= parseInt(hex2,16)-0x80; return String.fromCharCode((n1<<6)+n2); }); s= s.replace(/%([0-7][0-9A-F])/gi, function(code,hex) { return String.fromCharCode(parseInt(hex,16)); }); return s; }
The constructor here parses the whole query string and stores the key/value pairs in the associative array QueryString.dict. The functions to retrieve values then work off that data structure. Most of the examples I saw on the net didn't do this, but did a whole separate parse each time the value of an parameter was requested. Since I usually use all the parameters I pass to a page, I think it makes sense to parse the whole string first. But honestly, there are usually few enough parameters that efficiency isn't that important an issue here.// Query String Parser // // qs= new QueryString() // qs= new QueryString(string) // // Create a query string object based on the given query string. If // no string is given, we use the one from the current page by default. // // qs.value(key) // // Return a value for the named key. If the key was not defined, // it will return undefined. If the key was multiply defined it will // return the last value set. If it was defined without a value, it // will return an empty string. // // qs.values(key) // // Return an array of values for the named key. If the key was not // defined, an empty array will be returned. If the key was multiply // defined, the values will be given in the order they appeared on // in the query string. // // qs.keys() // // Return an array of unique keys in the query string. The order will // not necessarily be the same as in the original query, and repeated // keys will only be listed once. // // QueryString.decode(string) // // This static method is an error tolerant version of the builtin // function decodeURIComponent(), modified to also change pluses into // spaces, so that it is suitable for query string decoding. You // shouldn't usually need to call this yourself as the value(), // values(), and keys() methods already decode everything they return. // // Note: W3C recommends that ; be accepted as an alternative to & for // separating query string fields. To support that, simply insert a semicolon // immediately after each ampersand in the regular expression in the first // function below. function QueryString(qs) { this.dict= {}; // If no query string was passed in use the one from the current page if (!qs) qs= location.search; // Delete leading question mark, if there is one if (qs.charAt(0) == '?') qs= qs.substring(1); // Parse it var re= /([^=&]+)(=([^&]*))?/g; while (match= re.exec(qs)) { var key= decodeURIComponent(match[1].replace(/\+/g,' ')); var value= match[3] ? QueryString.decode(match[3]) : ''; if (this.dict[key]) this.dict[key].push(value); else this.dict[key]= [value]; } } QueryString.decode= function(s) { s= s.replace(/\+/g,' '); s= s.replace(/%([EF][0-9A-F])%([89AB][0-9A-F])%([89AB][0-9A-F])/gi, function(code,hex1,hex2,hex3) { var n1= parseInt(hex1,16)-0xE0; var n2= parseInt(hex2,16)-0x80; if (n1 == 0 && n2 < 32) return code; var n3= parseInt(hex3,16)-0x80; var n= (n1<<12) + (n2<<6) + n3; if (n > 0xFFFF) return code; return String.fromCharCode(n); }); s= s.replace(/%([CD][0-9A-F])%([89AB][0-9A-F])/gi, function(code,hex1,hex2) { var n1= parseInt(hex1,16)-0xC0; if (n1 < 2) return code; var n2= parseInt(hex2,16)-0x80; return String.fromCharCode((n1<<6)+n2); }); s= s.replace(/%([0-7][0-9A-F])/gi, function(code,hex) { return String.fromCharCode(parseInt(hex,16)); }); return s; }; QueryString.prototype.value= function (key) { var a= this.dict[key]; return a ? a[a.length-1] : undefined; }; QueryString.prototype.values= function (key) { var a= this.dict[key]; return a ? a : []; }; QueryString.prototype.keys= function () { var a= []; for (var key in this.dict) a.push(key); return a; };
I used a regular expression to parse apart the string. Lots of people use first split('&') and then split('=') on each component. That's actually OK, but I prefer the regular expression, partly because I think it handles bad query strings like "search=E=mc2" better. Mine will interpret that as saying the value of "search" is "E=mc" while most of the split-based solutions would say the value of "search" is "E".
You could pretty easily add methods to this to set key values and add a toString method that would convert the hash table back into a properly encoded query string, and then it would be a pretty nice library for constructing query strings as well as parsing them, but I only needed parsing functionality,
A few days after I wrote all this, I decided I needed to redesign my application. Some of the data I was passing from page to page was just sensitive enough that I didn't want it showing in user's location bars. So I decided to pass data in a cookie instead of in query strings. Thus I've never actually used this code in a production application, and it could possibly have some bugs. Count yourself warned.
If you'd like to test this code, just add query parameters to the URL of this page. The parsed out values should appear below:
For example, try these arguments:
Key Last
ValueAll
Values
key1=&key2&search=Rock+%26+Roll&rock%26roll=here+to+stay&key3=dog&key3=cat&key3=mouse&weird=%26%CE%A8%E2%88%88
Last Update: Thu Sep 26 08:56:24 EDT 2013
Thanks to Nathan Alden for pointing out that the regular expressions for decoding hex strings in the decode functions should be case insensitive.