I would like to write a C# method that would transform any title into a URL friendly string, similar to what Stack Overflow does:
- replace spaces with dashes
- remove parenthesis
- etc.
I’m thinking of removing Reserved characters as per RFC 3986 standard (from Wikipedia) but I don’t know if that would be enough? It would make links workable, but does anyone know what other characters are being replaced here at stackoverflow? I don’t want to end up with %-s in my URLs…
Current implementation
string result = Regex.Replace(value.Trim(), @"[!*'""`();:@&+=$,/\\?%#\[\]<>«»{}_]");
return Regex.Replace(result.Trim(), @"[\s*[\-–—\s]\s*]", "-");
My questions
- Which characters should I remove?
- Should I limit the maximum length of resulting string?
- Anyone know which rules are applied on titles here on SO?
Rather than looking for things to replace, the list of unreserved chars is so short, it’ll make for a nice clear regex.
(Note that I didn’t include the dash in the list of allowed chars; that’s so it gets gobbled up by the "1 or more" operator [
+] so that multiple dashes (in the original or generated or a combination) are collapsed, as per Dominic Rodger’s excellent point.)You may also want to remove common words ("the", "an", "a", etc.), although doing so can slightly change the meaning of a sentence. Probably want to remove any trailing dashes and periods as well.
Also strongly recommend you do what SO and others do, and include a unique identifier other than the title, and then only use that unique ID when processing the URL. So
http://example.com/articles/1234567/is-the-pop-catholic(note the missing ‘e’) andhttp://example.com/articles/1234567/is-the-pope-catholicresolve to the same resource.