In Microsoft SQL Server, it’s possible to specify an “accent insensitive” collation (for a database, table or column), which means that it’s possible for a query like
SELECT * FROM users WHERE name LIKE 'João'
to find a row with a Joao name.
I know that it’s possible to strip accents from strings in PostgreSQL using the unaccent_string contrib function, but I’m wondering if PostgreSQL supports these “accent insensitive” collations so the SELECT above would work.
Update for Postgres 12 or later
Postgres 12 adds nondeterministic ICU collations, enabling case-insensitive and accent-insensitive grouping and ordering. The manual:
If so, this works for you:
fiddle
Read the manual for details.
This blog post by Laurenz Albe may help to understand.
But ICU collations also have drawbacks. The manual:
My "legacy" solution is typically still superior:
For all versions
Use the unaccent module for that – which is completely different from what you are linking to.
Install once per database with:
If you get an error like:
Install the contrib package on your database server like instructed in this related answer:
Among other things, it provides the function
unaccent()you can use with your example (whereLIKEseems not needed).Index
To use an index for that kind of query, create an index on the expression. However, Postgres only accepts
IMMUTABLEfunctions for indexes. If a function can return a different result for the same input, the index could silently break.unaccent()onlySTABLEnotIMMUTABLEUnfortunately,
unaccent()is onlySTABLE, notIMMUTABLE. According to this thread on pgsql-bugs, this is due to three reasons:search_path, which can change easily.Some tutorials on the web instruct to just alter the function volatility to
IMMUTABLE. This brute-force method can break under certain conditions.Others suggest a simple
IMMUTABLEwrapper function (like I did myself in the past).There is an ongoing debate whether to make the variant with two parameters
IMMUTABLEwhich declares the used dictionary explicitly. Read here or here.Another alternative would be this module with an IMMUTABLE
unaccent()function by Musicbrainz, provided on Github. Haven’t tested it myself. I think I have come up with a better idea:Best for now
This approach is more efficient than other solutions floating around, and safer.
Create an
IMMUTABLESQL wrapper function executing the two-parameter form with hard-wired, schema-qualified function and dictionary.Since nesting a non-immutable function would disable function inlining, base it on a copy of the C-function, (fake) declared
IMMUTABLEas well. Its only purpose is to be used in the SQL function wrapper. Not meant to be used on its own.The sophistication is needed as there is no way to hard-wire the dictionary in the declaration of the C function. (Would require to hack the C code itself.) The SQL wrapper function does that and allows both function inlining and expression indexes.
Then:
In Postgres 14 or later, an SQL-standard function is slightly cheaper, yet. Using the short form for a single statement:
See:
Drop
PARALLEL SAFEfrom both functions for Postgres 9.5 or older.publicbeing the schema where you installed the extension (publicis the default).The explicit type declaration (
regdictionary) defends against hypothetical attacks with overloaded variants of the function by malicious users.Previously, I advocated a wrapper function based on the
STABLEfunctionunaccent()shipped with the unaccent module. That disabled function inlining. This version executes ten times faster than the simple wrapper function I had here earlier.And that was already twice as fast as the first version which added
SET search_path = public, pg_tempto the function – until I discovered that the dictionary can be schema-qualified, too. Still (Postgres 12) not too obvious from documentation.If you lack the necessary privileges to create C functions, you are back to the second best implementation: An
IMMUTABLEfunction wrapper around theSTABLEunaccent()function provided by the module:Finally, the expression index to make queries fast:
Remember to recreate indexes involving this function after any change to function or dictionary, like an in-place major release upgrade that would not recreate indexes. Recent major releases all had updates for the
unaccentmodule.Adapt queries to match the index (so the query planner will use it):
We don’t need the function in the expression to the right of the operator. There we can also supply unaccented strings like
'Joao'directly.The faster function does not translate to much faster queries using the expression index. Index look-ups operate on pre-computed values and are very fast either way. But index maintenance and queries not using the index benefit. And access methods like bitmap index scans may have to recheck values in the heap (the main relation), which involves executing the underlying function. See:
Security for client programs has been tightened with Postgres 10.3 / 9.6.8 etc. You need to schema-qualify function and dictionary name as demonstrated when used in any indexes. See:
Ligatures
In Postgres 9.5 or older ligatures like ‘Œ’ or ‘ß’ have to be expanded manually (if you need that), since
unaccent()always substitutes a single letter:You will love this update to unaccent in Postgres 9.6:
Bold emphasis mine. Now we get:
Pattern matching
For
LIKEorILIKEwith arbitrary patterns, combine this with the modulepg_trgmin PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN:Can be used for queries like:
GIN and GIST indexes are more expensive (to maintain) than plain B-tree:
There are simpler solutions for just left-anchored patterns. More about pattern matching and performance:
pg_trgmalso provides useful operators for "similarity" (%) and "distance" (<->).Trigram indexes also support simple regular expressions with
~et al. and case insensitive pattern matching withILIKE: