L10N: Trusted test data for Locale Specific Sorting

Posted by Chris Betti on Stack Overflow See other posts from Stack Overflow or by Chris Betti
Published on 2011-01-13T19:05:33Z Indexed on 2011/01/13 20:53 UTC
Read the original article Hit count: 263

Filed under:
|
|

I'm working on an internationalized database application that supports multiple locales in a single instance. When international users sort data in the applications built on top of the database, the database theoretically sorts the data using a collation appropriate to the locale associated with the data the user is viewing.

I'm trying to find sorted lists of words that meet two criteria:

  1. the sorted order follows the collation rules for the locale
  2. the words listed will allow me to exercise most / all of the specific collation rules for the locale

I'm having trouble finding such trusted test data. Are such sort-testing datasets currently available, and if so, what / where are they?

"words.en.txt" is an example text file containing American English text:

Andrew
Brian
Chris
Zachary

I am planning on loading the list of words into my database in randomized order, and checking to see if sorting the list conforms to the original input.

Because I am not fluent in any language other than English, I do not know how to create sample datasets like the following sample one in French (call it "words.fr.txt"):

cote
côte
coté
côté

The French prefer diacritical marks to be ordered right to left. If you sorted that using code-point order, it likely comes out like this (which is an incorrect collation):

cote
coté
côte
côté

Thank you for the help, Chris

© Stack Overflow or respective owner

Related posts about testing

Related posts about sorting