From: Bob Crispen (bob.crispen_at_boeing.com)
Date: 18 March 2003
Found this news:b55gbf$25dcke$1_at_ID-103400.news.dfncis.de in comp.os.ms-windows.programmer.win32:
-= BEGIN forwarded message =-
Subject: Re: Creating a UNICODE text file From: "Tim Robinson" <tim.at.gaat.freeserve.co.uk_at_nowhere.com> Newsgroups: comp.os.ms-windows.programmer.win32
"Eric Bolstad" <eric.bolstad_at_sas.com> wrote in message
news:b55b41$6ek$1_at_license1.unx.sas.com...
> Anybody out there have a more informated answer...I'd appreciate it.
The way to do this depends on what encoding you want the data to have. The ANSI/OEM file APIs are a bit of a red herring since they only deal with the way file *names* are translated, not the data in the file themselves.
There are three main ways of encoding Unicode in text files: UCS-2, UTF-8 and UTF-7. UCS-2 is raw 16-bit character strings dumped to disk; for example, fwrite(L"Hello, world", 12, 2, file).
UTF-8 and UTF-7 encode Unicode characters as varying length strings of 8- or 7-bit bytes. The process of encoding UCS-2 (i.e. LPWSTR strings) to UTF-8 and UTF-7 is trivial but fiddly; luckily, Windows NT 4/98 and above provide a flag in WideCharToMultiByte which does it for you. Once you've done the conversion it's just a matter of writing the encoded strings to disk.
One further point is the Byte Order Mark (BOM), which some programs (including Unicode Notepad) use to identify the file. Recall that UCS-2 writes 16-bit characters to disk. On an Intel machine the bytes will be stored one way; on another platform they might be stored the other way round. The BOM character is written at the start of the file and lets programs identify the file as Unicode and, if so, which encoding is used. I found this table in the Platform SDK: (under Base Services/International Features/Unicode and Character Sets/Using Strings and Unicode/Using Special Characters)
Byte-order mark Description
EF BB BF UTF-8 FE FF UTF-16/UCS-2, little endian FF FE UTF-16/UCS-2, big endian FF FE 00 00 UTF-32/UCS-4, little endian. 00 00 FE FF UTF-32/UCS-4, big-endian.
If in doubt, paste various Unicode characters into Notepad and save the file using your encoding of choice, then open the file in a hex editor and examine it.
BTW: the ultimate authority on Unicode is the Unicode Consortium, who are at http://www.unicode.org/.
-- Tim Robinson (MVP, Windows SDK) http://www.themobius.co.uk/ -= END forwarded message =- -- Bob Crispen bob.crispen_at_boeing.com That which does not kill us has made its last mistake.
This archive was generated by hypermail pre-2.1.8 : 26 April 2003 EDT