Re: Creating a UNICODE text file

From: Bob Crispen (bob.crispen_at_boeing.com)
Date: 18 March 2003


Found this news:b55gbf$25dcke$1_at_ID-103400.news.dfncis.de in comp.os.ms-windows.programmer.win32:

-= BEGIN forwarded message =-

Subject: Re: Creating a UNICODE text file From: "Tim Robinson" <tim.at.gaat.freeserve.co.uk_at_nowhere.com> Newsgroups: comp.os.ms-windows.programmer.win32

"Eric Bolstad" <eric.bolstad_at_sas.com> wrote in message news:b55b41$6ek$1_at_license1.unx.sas.com...
> Anybody out there have a more informated answer...I'd appreciate it.

The way to do this depends on what encoding you want the data to have. The ANSI/OEM file APIs are a bit of a red herring since they only deal with the way file *names* are translated, not the data in the file themselves.

There are three main ways of encoding Unicode in text files: UCS-2, UTF-8 and UTF-7. UCS-2 is raw 16-bit character strings dumped to disk; for example, fwrite(L"Hello, world", 12, 2, file).

UTF-8 and UTF-7 encode Unicode characters as varying length strings of 8- or 7-bit bytes. The process of encoding UCS-2 (i.e. LPWSTR strings) to UTF-8 and UTF-7 is trivial but fiddly; luckily, Windows NT 4/98 and above provide a flag in WideCharToMultiByte which does it for you. Once you've done the conversion it's just a matter of writing the encoded strings to disk.

One further point is the Byte Order Mark (BOM), which some programs (including Unicode Notepad) use to identify the file. Recall that UCS-2 writes 16-bit characters to disk. On an Intel machine the bytes will be stored one way; on another platform they might be stored the other way round. The BOM character is written at the start of the file and lets programs identify the file as Unicode and, if so, which encoding is used. I found this table in the Platform SDK: (under Base Services/International Features/Unicode and Character Sets/Using Strings and Unicode/Using Special Characters)

Byte-order mark Description

EF BB BF         UTF-8
FE FF            UTF-16/UCS-2, little endian
FF FE            UTF-16/UCS-2, big endian
FF FE 00 00      UTF-32/UCS-4, little endian.
00 00 FE FF      UTF-32/UCS-4, big-endian.

If in doubt, paste various Unicode characters into Notepad and save the file using your encoding of choice, then open the file in a hex editor and examine it.

BTW: the ultimate authority on Unicode is the Unicode Consortium, who are at http://www.unicode.org/.

--
Tim Robinson (MVP, Windows SDK)
http://www.themobius.co.uk/
-= END forwarded message =-
-- 
Bob Crispen
bob.crispen_at_boeing.com
That which does not kill us has made its last mistake.


This archive was generated by hypermail pre-2.1.8 : 26 April 2003 EDT