We have a PHP app with a dynamic URL scheme which requires characters to be percent-encoded, even “unreserved characters” like parentheses or aphostrophes which aren’t actually required to be encoded. URLs which the app deems to be encoded the “wrong” way are canonicalized and then redirected to the “right” encoding.
But Google and other user agents will canonicalize percent-encoding/decoding differently, meaning when Googlebot requests the page it will ask for the “wrong” URL, and when it gets back a redirect to the “right” URL, Googlebot will refuse to follow the redirect and will refuse to index the page.
Yes, this is a bug on our end. The HTTP specs require that servers treat percent-encoded and non-percent-encoded unreserved characters identically. But fixing the problem in the app code is non-straightforward right now, so I was hoping to avoid a code change by using an Apache rewrite rule which would ensure that URLs are encoded “properly” from the point-of-view of the app, meaning that apopstrophes, parentheses, etc. are all percent-encoded and that spaces are encoded as + and not %20.
Here’s one example, where I want to rewrite the first and end up with the second form:
- http://www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+(Linux)
- http://www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+%28Linux%29
Here’s another:
- http://www.splunkbase.com/apps/All/4.x/app:Benford’s+Law+Fraud+Detection+Add-on
- http://www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection+Add-on
Here’s another:
- http://www.splunkbase.com/apps/All/4.x/app:Benford%27s%20Law%20Fraud%20Detection%20Add-on
- http://www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection+Add-on
If the app sees only the second form of these URLs, then it won’t send any redirects and Google will be able to index the page.
I’m a newbie with rewrite rules, and it was clear from my read of the mod-rewrite documentation that mod_rewrite does some automatic encoding/decoding which may help or hurt what I want to do, although not sure.
Any advice for rewrite rules to handle the above cases? I’m OK with a rule for each special character since there’s not many of them, but a single rule (if possible) would be ideal.
The solution actually may be fairly simple, though it will only work in Apache 2.2 and later due to the use of the
Bflag. I’m not sure whether or not it takes care of every case correctly (admittedly I’m a bit skeptical it doesn’t involve more work than this), but I’m led to believe it should by the source code.Keep in mind too that the value of
REQUEST_URIis not updated by mod_rewrite transformations, so if your application relies on that value to determine the requested URL, the changes you make won’t be visible anyway.The good news is that this can be done in .htaccess, so you have the option of leaving the main configuration untouched if that works better for you.
So, why is there a need to use the
Bflag instead of letting mod_rewrite escape the rewritten URL automatically? When mod_rewrite automatically escapes the URL, it usesap_escape_uri(which apparently has been turned into a macro forap_os_escape_pathfor some reason…), a function that escapes a limited subset of characters. TheBflag, however, uses an internal module function calledescape_uri, which is modeled on PHP’surlencodefunction.The implementation of
escape_uriin the module suggests that alphanumeric characters and underscores are left as-is, spaces are converted to +, and everything else is converted to its escaped equivalent. This seems to be the behaviour that you want, so presumably it should work.If not, you do have the option of setting up an external program
RewriteMapthat could manipulate your incoming URLs into the correct format. This requires manipulating the Apache configuration though, and a renegade script could cause problems for the server on the whole, so I don’t consider it an ideal solution if it can be avoided.