Java UTF-8 to ASCII conversion with supplements

Posted by bozo on Stack Overflow See other posts from Stack Overflow or by bozo
Published on 2010-03-30T12:43:19Z Indexed on 2010/03/30 13:13 UTC
Read the original article Hit count: 392

Hi,

we are accepting all sorts of national characters in UTF-8 string on the input, and we need to convert them to ASCII string on the output for some legacy use. (we don't accept Chinese and Japanese chars, only European languages)

We have a small utility to get rid of all the diacritics:

public static final String toBaseCharacters(final String sText) {
    if (sText == null || sText.length() == 0)
        return sText;

    final char[] chars = sText.toCharArray();
    final int iSize = chars.length;
    final StringBuilder sb = new StringBuilder(iSize);

    for (int i = 0; i < iSize; i++) {
        String sLetter = new String(new char[] { chars[i] });
        sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFC);

        try {
            byte[] bLetter = sLetter.getBytes("UTF-8");
            sb.append((char) bLetter[0]);
        } catch (UnsupportedEncodingException e) {
        }
    }
    return sb.toString();
}

The question is how to replace all the german sharp s (ß, Ð, d) and other characters that get through the above normalization method, with their supplements (in case of ß, supplement would probably be "ss" and in case od Ð supplement would be either "D" or "Dj").

Is there some simple way to do it, without million of .replaceAll() calls?

So for example: Ðonardan = Djonardan, Blaß = Blass and so on.

We can replace all "problematic" chars with empty space, but would like to avoid this to make the output as similar to the input as possible.

Thank you for your answers,

Bozo

© Stack Overflow or respective owner

Related posts about java

Related posts about special-characters