Java UTF-8 to ASCII conversion with supplements
        Posted  
        
            by bozo
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by bozo
        
        
        
        Published on 2010-03-30T12:43:19Z
        Indexed on 
            2010/03/30
            13:13 UTC
        
        
        Read the original article
        Hit count: 456
        
Hi,
we are accepting all sorts of national characters in UTF-8 string on the input, and we need to convert them to ASCII string on the output for some legacy use. (we don't accept Chinese and Japanese chars, only European languages)
We have a small utility to get rid of all the diacritics:
public static final String toBaseCharacters(final String sText) {
    if (sText == null || sText.length() == 0)
        return sText;
    final char[] chars = sText.toCharArray();
    final int iSize = chars.length;
    final StringBuilder sb = new StringBuilder(iSize);
    for (int i = 0; i < iSize; i++) {
        String sLetter = new String(new char[] { chars[i] });
        sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFC);
        try {
            byte[] bLetter = sLetter.getBytes("UTF-8");
            sb.append((char) bLetter[0]);
        } catch (UnsupportedEncodingException e) {
        }
    }
    return sb.toString();
}
The question is how to replace all the german sharp s (ß, Ð, d) and other characters that get through the above normalization method, with their supplements (in case of ß, supplement would probably be "ss" and in case od Ð supplement would be either "D" or "Dj").
Is there some simple way to do it, without million of .replaceAll() calls?
So for example: Ðonardan = Djonardan, Blaß = Blass and so on.
We can replace all "problematic" chars with empty space, but would like to avoid this to make the output as similar to the input as possible.
Thank you for your answers,
Bozo
© Stack Overflow or respective owner