After researching a bit how the different way people slugify titles, I’ve noticed that it’s often missing how to deal with non english titles.
url encoding is very restrictive. See http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
So, for example how do folks deal with for title slugs for things like
“Una lágrima cayó en la arena”
One can come up with a reasonable table for indo european languages, ie. things that can be encoded via ISO-8859-1. For example, a conversion table would translate ‘á’ => ‘a’, so the slug would be
“una-lagrima-cayo-en-la-arena”
However, I’m using unicode (in particular using UTF-8 encoding), so no guaranties about what sort code points I’m going to get (I have to prepare for things that can’t be ISO-8859-1 encoded.
I a nushell. How do deal with this? Should I come up with a conversion table for chars in the ISO_8859-1 range (<255) and drop everything else?
EDIT: To give a bit more context, a priori, I don’t really expect to slugify data in non indo european languages, but I’d like to have a plan if I encounter such data.
A conversion table for the extended ASCII would be nice. Any pointers?
Also, since people are asking, I’m using python, running on Google App Engine
Nearly-complete transliteration table (for latin, greek and cyrillic character sets) can be found in slughifi library. It is geared towards Django, but can be easily modified to fit general needs (I use it with Werkzeug-based app on AppEngine).