How do you get Matlab to write the BOM (byte order markers) for UTF-16 text files?

Posted by Richard Povinelli on Stack Overflow See other posts from Stack Overflow or by Richard Povinelli
Published on 2011-11-08T15:24:34Z Indexed on 2011/11/11 17:52 UTC
Read the original article Hit count: 666

Filed under:
|
|
|
|

I am creating UTF16 text files with Matlab, which I am later reading in using Java. In Matlab, I open a file called fileName and write to it as follows:

fid = fopen(fileName, 'w','n','UTF16-LE');
fprintf(fid,"Some stuff.");

In Java, I can read the text file using the following code:

FileInputStream fileInputStream = new FileInputStream(fileName);
Scanner scanner = new Scanner(fileInputStream, "UTF-16LE"); 
String s = scanner.nextLine();

Here is the hex output:

Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13
00000000  73 00 6F 00 6D 00 65 00 20 00 73 00 74 00 75 00 66 00 66 00  s.o.m.e. .s.t.u.f.f.

The above approach works fine. But, I want to be able to write out the file using UTF16 with a BOM to give me more flexibility so that I don't have to worry about big or little endian. In Matlab, I've coded:

fid = fopen(fileName, 'w','n','UTF16');
fprintf(fid,"Some stuff.");

In Java, I change the code to:

FileInputStream fileInputStream = new FileInputStream(fileName);
Scanner scanner = new Scanner(fileInputStream, "UTF-16");
String s = scanner.nextLine();

In this case, the string s is garbled, because Matlab is not writing the BOM. I can get the Java code to work just fine if I add the BOM manually. With the added BOM, the following file works fine.

Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15
00000000  FF FE 73 00 6F 00 6D 00 65 00 20 00 73 00 74 00 75 00 66 00 66 00  ÿþs.o.m.e. .s.t.u.f.f.

How can I get Matlab to write out the BOM? I know I could write the BOM out separately, but I'd rather have Matlab do it automatically.

Addendum

I selected the answer below from Amro because it exactly solves the question I posed.

One key discovery for me was the difference between the Unicode Standard and a UTF (Unicode transformation format) (see http://unicode.org/faq/utf_bom.html). The Unicode Standard provides unique identifiers (code points) for characters. UTFs provide mappings of every code point "to a unique byte sequence." Since all but a handful of the characters I am using are in the first 128 code points, I'm going to switch to using UTF-8 as Romeo suggests. UTF-8 is supported by Matlab (The warning shown below won't need to be suppressed.) and Java, and for my application will generate smaller text files.

I suppress the Matlab warning

Warning: The encoding 'UTF-16LE' is not supported.

with

warning off MATLAB:iofun:UnsupportedEncoding;

© Stack Overflow or respective owner

Related posts about java

Related posts about matlab