A UTF-16 Class for Reading and Writing Unicode Files

Introduction

As Unicode becomes more popular, programmers will find themselves performing more file-based operations using Unicode. Currently, familiar MFC classes such as CFile and CStdioFile do not properly handle reading and writing a Unicode file. The class file presented addresses the need to read and write files as UTF-16 Unicode files.

Downloads

There are two downloads associated with this article. RevisedCUtf16File.zip is the February 2005 release of the code. RevisedCUtf16File.zip is Jordan Walter's revision to the original code. Jordan's enhancements include bug fixes and improved support for Unicode projects. Both downloads provide a test harness.

Using the Code

During construction or with the use of the Open() member function, the class will examine the first two bytes of the file after appropriate size checking. The two-byte sequence (BOM) 0xFE, 0xFF indicates the file is UTF-16 encoded. If this is the case, m_bIsUnicode is set to TRUE. If the bytes are not present, the class performs a CStdioFile::Seek(0, CFile::begin ) to return the consumed bytes.

CStdioFile::Read( &wcBOM, sizeof( WCHAR ) );

if( wcBOM == UNICODE_BOM ) {

   m_bIsUnicode   = TRUE;
   m_bByteSwapped = FALSE;
}

if( wcBOM == UNICODE_RBOM ) {

   m_bIsUnicode   = TRUE;
   m_bByteSwapped = TRUE;
}

// Not a BOM mark - treat it as an ANSI file
//   and defer to CStdioFile...
if( FALSE == m_bIsUnicode ) {

   CStdioFile::Seek( 0, CFile::begin );
}

ReadString(...) occurs as follows: If m_bIsUnicode is FALSE, the class returns the appropriate CStdioFile::ReadString(...) operation. If the file is UTF-16 encoded, the class will draw from an internal accumulator until a "\r" or "\n" is encountered when using CUTF16File::ReadString(CString& rString ). If using the CUTF16File::ReadString( LPWSTR lpsz, UINT nMax ) overload, CStdioFile::ReadString() behavior is duplicated. See the underlying comment from fgets().

The above read is accomplished through an accumulator. The accumulator is a STL list of WCHARs. When filling the accumulator, byte swapping occurs if a Big Endian stream (0xFF, 0xFE) is encountered.

Writing to a file is accomplished by extending the normal function with WriteString(LPCTSTR lpsz, BOOL bAsUnicode ). CStdioFile will handle the ANSI conversion internally, so CUTF16File simply yields to CStdioFile. If bAsUnicode is TRUE, the program will write the BOM (if file position is 0), and then call CFile::Write(...).

The program will open two files on the hard drive, write out both Unicode and ANSI text files, and then read the files back in. The driver program then uses OutputDebugString(...) to write messages to the debugger's output window.

CUTF16File output1( L"unicode_write.txt", CFile::modeWrite |
CFile::modeCreate );
output1.WriteString( L"Hello World from Unicode land!", TRUE );
output1.Close();

...

CString szInput;
CUTF16File input1( L"unicode_write.txt", CFile::modeRead );
input1.ReadString( szInput );

Figure 1 is the result of writing a test file with the provided driver program. Notice that the BOM bytes are swapped on the disk.

Figure 1: Result of test program.

Figure 2 examines a similar file created with Notepad on Windows 2000 while saving the file as Unicode.

Figure 2: A Unicode sample created in Notepad.

Additional Reading

  • http://www.unicode.org/.
  • International Programming for Microsoft Windows by D. Schmitt, ISBN 1-57231-956-9.
  • Programming Windows with MFC by J. Prosise, ISBN 1-57231-695-0.
  • Programming Server-Side Applications for Microsoft Windows 2000 by J. Richter and J. Clark, ISBN 0-73560-753-2.

Revisions

  • 12.23.2006 Added Jordan Walter's Improvements and Bug Fixes
  • 02.10.2005 Original Release

Checksums

OriginalCUtf16File.zip
  MD5: 696F5C035367A70E5F53B1EC7572FDD2
  SHA-1: 180354760E120319F813EC618DDB0BC8BA96DEF2
  RIPEMD-160: AA9157FC548795237E15EC29B94E426892689A61
  SHA-256: 03D6E6C9E0D3C4EB3C0FB328F15C72006417393C31CAE1508EAFAB1165228E01

RevisedCUtf16File.zip
  MD5: 8F87A671C7EEB935A9833B32860189D9
  SHA-1: C3D7868C67EE9BF0EF888C8AC64433658DBC0172
  RIPEMD-160: 283F9D07F3AA904D42E5EE67D31822C1B95B61A1
  SHA-256: 3B13D5503420C4C9851C62A8951ED3388E2976F16E1342F0CFBF5FD8C4EAE9D8



About the Author

Jeffrey Walton

In the past, I have worked as an IT consultant for County Government (Anne Arundel County), the Nuclear Energy Institute, the Treasury Department, and Social Security Administration as a Network Engineer and System Administrator. Primary Administration experience includes Microsoft Windows and Novell Netware, with additional exposure and familiarity with Mac and Linux OSes. Previous to the US government, I was a programmer for a small business using Microsoft Visual Languages (Basic 5.0, 6.0, and C++ 5.0, 6.0) and Scripting Languages. An undergraduate degree (BS in Computer Science) was obtained from University of Maryland, Baltimore County. Graduate work includes a Masters of Science (Computer Science) from Johns Hopkins University (expected before 2009). Training and Certifications include Microsoft, Checkpoint, and Cisco.

Downloads

Comments

  • Couple of questions

    Posted by Mike Pliam on 01/15/2009 03:20pm

    What is the 'provided driver program' and where can I find it? I can get the 'original' version to load and save Unicode to a CRichEditCtrl (RichEdit20W) using VC 2005 C++, but the 'revised' version doesnt work. Have I mistakenly exchanged the two? Notice that the OutputDebugString for Unicode fails in the 'revised' version, reflecting the fact that it only converts the BOM.

    Reply
  • UNICODE compile flag may not be needed.

    Posted by jjwalters on 04/27/2005 11:39am

    I have made the class work so that it does not need the UNICODE pre-processor flag to be specified in order to work.  I have not checked that it will work in a UNICODE build but I can confirm that it works in a non-Unicode build.  I make use of TCHAR's and the conversion macros T2W etc to allow the code to remain unchanged regardless of whether the UNICODE pre-processor flag is specified or not.
    I invite the author ro contact me so that I can send him my source code so that he can a). verify that it works in both types of build, and b). if it does work to update the source code download available to other developers.
    
    Cheers,
    Jordan.

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • As all sorts of data becomes available for storage, analysis and retrieval - so called 'Big Data' - there are potentially huge benefits, but equally huge challenges...
  • The agile organization needs knowledge to act on, quickly and effectively. Though many organizations are clamouring for "Big Data", not nearly as many know what to do with it...
  • Cloud-based integration solutions can be confusing. Adding to the confusion are the multiple ways IT departments can deliver such integration...

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date