I’m working on an internationalized database application that supports multiple locales in a single

Question

0

Asked: May 21, 20262026-05-21T21:11:59+00:00 2026-05-21T21:11:59+00:00

I’m working on an internationalized database application that supports multiple locales in a single

0

I’m working on an internationalized database application that supports multiple locales in a single instance. When international users sort data in the applications built on top of the database, the database theoretically sorts the data using a collation appropriate to the locale associated with the data the user is viewing.

I’m trying to find sorted lists of words that meet two criteria:

the sorted order follows the collation rules for the locale
the words listed will allow me to exercise most / all of the specific collation rules for the locale

I’m having trouble finding such trusted test data. Are such sort-testing datasets currently available, and if so, what / where are they?

“words.en.txt” is an example text file containing American English text:

Andrew
Brian
Chris
Zachary

I am planning on loading the list of words into my database in randomized order, and checking to see if sorting the list conforms to the original input.

Because I am not fluent in any language other than English, I do not know how to create sample datasets like the following sample one in French (call it “words.fr.txt”):

cote
côte
coté
côté

The French prefer diacritical marks to be ordered right to left. If you sorted that using code-point order, it likely comes out like this (which is an incorrect collation):

cote
coté
côte
côté

Thank you for the help,
Chris

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-21T21:12:00+00:00

Here’s what I found.

The Unicode Common Locale Data Repository (CLDR) is pretty much the authority on collations for international text. I was able to find several lists of words conforming to the rules found in CLDR in the ICU Project’s ICU Demonstration – Locale Explorer tool. It turns out that ICU (International Components for Unicode) uses CLDR rules to help solve common internationalization problems. It’s a great library; check it out.

In some cases, it was useful to construct some nonsense terms by reverse-engineering the CLDR rules directly. Search engines available in the United States were not suited for finding foreign terms with the case/diacritic/other nuances I was interested in for this testing (in retrospect, I wonder if international search engines would have been better-suited for this task).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working on an internationalized database application that supports multiple locales in a single

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply