Unicode, MBCS and Generic text mappings

of Dundas.

Environment: Unicode, MBCS

In order to allow your programs to be used in international markets it is worth making your application Unicode or MBCS aware. The Unicode character set is a "wide character" (2 bytes per character) set that contains every character available in every language, including all technical symbols and special publishing characters. Multibyte character set (MBCS) uses either 1 or 2 bytes per character and is used for character sets that contain large numbers of different characters (eg Asian language character sets).

Which character set you use depends on the language and the operating system. Unicode requires more space than MBCS since each character is 2 bytes. It is also faster than MBCS and is used by Windows NT as standard, so non-Unicode strings passed to and from the operating system must be translated, incurring overhead. However, Unicode is not supported on Win95 and so MBCS may be a better choice in this situation. Note that if you wish to develop applications in the Windows CE environment then all applications must be compiled in Unicode.

Using MBCS or Unicode

The best way to use Unicode or MBCS - or indeed even ASCII - in your programs is to use the generic text mapping macros provided by Visual C++. That way you can simply use a single define to swap between Unicode, MBCS and ASCII without having to do any recoding.

To use MBCS or Unicode you need only define either _MBCS or _UNICODE in your project. For Unicode you will also need to specify the entry point symbol in your Project settings as wWinMainCRTStartup. Please note that if both _MBCS and _UNICODE are defined then the result will be unpredictable.

 

Generic Text mappings and portable functions

The generic text mappings replace the standard char or LPSTR types with generic TCHAR or LPTSTR macros. These macros will map to different types and functions depending on whether you have compiled with UNICODE or MBCS (or neither) defined. The simplest way to use the TCHAR type is to use the CString class - it is extremely flexible and does most of the work for you.

In conjunction with the generic character type, there is a set of generic string manipulation functions prefixed by _tcs. For instance, instead of using the strrev function in your code, you should use the _tcsrev function which will map to the correct function depending on which character set you have compiled for. The table below demonstrates:

#define Compiled Version Example
_UNICODE Unicode (wide-character) _tcsrev maps to _wcsrev
_MBCS Multibyte-character _tcsrev maps to _mbsrev
None (the default: neither _UNICODE nor _MBCS defined) SBCS (ASCII) _tcsrev maps to strrev

Each str* function has a corresponding tcs* function that should be used instead. See the TCHAR.H file for all the mapping and macros that are available. Just look up the online help for the string function in question in order to find the equivalent portable function.

Note: Do not use the str* family of functions with Unicode strings, since Unicode strings are likely to contain embedded null bytes.

The next important point is that each literal string should be enclosed by the TEXT() (or _T()) macro. This macro prepends a "L" in front of literal strings if the project is being compiled in Unicode, or does nothing if MBCS or ASCII is being used. For instance, the string _T("Hello") will be interpreted as "Hello" in MBCS or ASCII, and L"Hello" in Unicode.If you are working in Unicode and do not use the _T() macro, you may get compiler warnings.

Note that you can use ASCII and Unicode within the same program, but not within the same string.

All MFC functions except for database class member functions are Unicode aware.

Converting between Generic types and ASCII

Visual C++ provides a bunch of very useful macros for converting between different character format. The basic form of these macros is X2Y(), where X is the source format. Possible conversion formats are shown in the following table.

String Type Abbreviation
ASCII (LPSTR) A
WIDE (LPWSTR) W
OLE (LPOLESTR) OLE
Generic (LPTSTR) T
Const C

Thus, A2W converts an LPSTR to an LPWSTR, OLE2T converts an LPOLESTR to an LPTSTR, and so on.

There are also const forms (denoted by a C) that convert to a const string. For instance, A2CT converts from LPSTR to LPCTSTR.

When using the string conversion macros you need to include the USES_CONVERSION macro at the beginning of your function:

void foo(LPSTR lpsz)

{

   USES_CONVERSION;

   

   ...

   LPTSTR szGeneric = A2T(lpsz)

   // Do something with szGeneric

   ...

}

Two caveats on using the conversion macros:

  1. Never use the conversion macros inside a tight loop. This will cause a lot of memory to be allocated each time the conversion is performed, and will result in slow code. Better to perform the conversion outside the loop and pass the converted value into the loop.

  2. Never return the result of the macros directly from a function, unless the return value implies making a copy of the data before returning. For instance, if you have a function that returns an LPOLESTR, then do not do the following:
    LPTSTR BadReturn(LPSTR lpsz)
    
    {
    
        USES_CONVERSION;
    
        // do something
    
        return A2T(lpsz);
    
    }

    Instead, you should return the value as a CString, which would imply a copy of the string would be made before the function returns:

    CString GoodReturn(LPSTR lpsz)
    
    {
    
        USES_CONVERSION;
    
        // do something
    
        return A2T(lpsz);
    
    }

 

Tips and Traps

- The TRACE statement

The TRACE macros have a few cousins - namely the TRACE0, TRACE1, TRACE2 and TRACE3 macros. These macros allow you to specify a format string (as in the normal TRACE macro), and either 0,1,2 or 3 parameters, without the need to enclose your literal format string in the _T() macro. For instance,

TRACE(_T("This is trace statement number %d\n"), 1);

can be written

TRACE1("This is trace statement number %d\n", 1);

 

- Viewing Unicode strings in the debugger

If you are using Unicode in your applciation and wish to view Unicode strings in the debugger, then you will need to go to Tools | Options | Debug and click on "Display Unicode Strings".

 

- The Length of strings

Be careful when performing operations that depend on the size or length of a string. For instance, CString::GetLength returns the number of characters in a string, NOT the size in bytes. If you were to write the string to a CArchive object, then you would need to multiply the length of the string by the size of each character in the string to get the number of bytes to write:

   CString str = _T("Hello, World");

   archive.Write( str, str.GetLength( ) * sizeof( TCHAR ) ); 

 

- Reading and Writing ASCII text files

If you are using Unicode or MBCS then you need to be careful when writing ASCII files. The safest and easiest way to write text files is to use the CStdioFile class provided with MFC. Just use the CString class and the ReadString and WriteString member functions and nothing should go wrong. However, if you need to use the CFile class and it's associated Read and Write functions, then if you use the following code:

   CFile file(...); 

   CString str = _T("This is some text"); 

   file.Write( str, (str.GetLength()+1) * sizeof( TCHAR ) ); 

instead of

   CStdioFile file(...); 

   CString str = _T("This is some text"); 

   file.WriteString(str); 

then the results will be Significantly different. The two lines of text below are from a file created using the first and second code snippets respectively:

(This text was viewed using WordPad)

- Not all structures use the generic text mappings

For instance, the CHARFORMAT structure, if the RichEditControl version is less than 2.0, uses a char[] for the szFaceName field, instead of a TCHAR as would be expected. You must be careful not to blindly change "..." to _T("...") without first checking. In this case, you would probably need to convert from TCHAR to char before copying any data to the szFaceName field.

 

- Copying text to the Clipboard

This is one area where you may need to use ASCII and Unicode in the same program, since the CF_TEXT format for the clipboard uses ASCII only. NT systems have the option of the CF_UNICODETEXT if you wish to use Unicode on the clipboard.

 

- Installing the Unicode MFC libraries

The Unicode versions of the MFC libraries are not copied to your hard drive unless you select them during a Custom installation. They are not copied during other types of installation. If you attempt to build or run an MFC Unicode application without the MFC Unicode files, you may get errors.

(From the online docs) To copy the files to your hard drive, rerun Setup, choose Custom installation, clear all other components except "Microsoft Foundation Class Libraries," click the Details button, and select both "Static Library for Unicode" and "Shared Library for Unicode."

 



Comments

  • Beats by Dre bedste hovedtelefoner lytte og tale, skal du vælge det er den mest korrekte

    Posted by fkhojf280 on 07/17/2013 07:25pm

    Så hvorfor kan meget vel blive set alle omkring dig Beats hovedtelefoner? Dette er helt sikkert helt en god måde Dre headset sammen med lyd annullerer headsets BIDRAGET er bare ikke headsets hypotese. Enhver sang på dette tidspunkt har jeg vokset til at være en god must, mange inden for den nødvendige. tage et godt udseende med alle gode? Song overlegen tone overstrømmende på dine vegne, i din lethed, ingen tvivl om du kender overlegen sang indgivet til din mobiltelefon eller blot film battler. [url=http://beatsbydrdredanmark.blinkweb.com/]beats by dre Hovedtelefoner[/url] Enhver Beats altid går, som vil komme op med en god måde undersøgelse. Det kunne gives, at disse headsets ikke kan være grundlæggende for dig, hvis du virkelig nyde populær musik, især dem, hvad personen virkelig nyde måde i processen. Det er udskiftelige pandebånd, bemærkelsesværdige gode, vil sammen med fuldstændig struktur bidrage til at gøre entusiaster sammen med fashionistas identisk tegn op, der vil disse smukke ørepropper. For virkelig at drage fordel af en persons sang alle nødt til at bruge en top-end sortiment af headset og også kombinere med bare Ogre Tilskyndelse tilfældigvis at være uden tvivl det faktum, at. Vedrørende afslapning, er de fleste en af de bedste, og giver en overordentlig fast golf greb om en persons trommehinder. Så igen, tæt forbundet med ville betyde disse ferie trygt og sikkert for din venture.. [url=http://beatsbydrdredanmark.bloguedobebe.com/]beats by dre[/url] Jeg har efterhånden været indehaver af dette pragteksemplar i knap ét og jeg nyder hvert et øjeblik i selskab med mine Beats. Der er ingen tvivl om, at de har taget verden med storm og de er kommet for at blive. Det kan tage lang tid at finde det rette headset, men med disse Beats by Dr. Dre Studio er jeg tydeligvis ikke gået galt i byen på noget tidspunkt. Hverken i forhold til min stationær,

    Reply
  • Thank you...

    Posted by Legacy on 02/20/2004 12:00am

    Originally posted by: CLAW

    this bit of information...

    "To use MBCS or Unicode you need only define either _MBCS or _UNICODE in your project. For Unicode you will also need to specify the entry point symbol in your Project settings as wWinMainCRTStartup."

    saved me lots of time... Thanks much...

    CLAW

    Reply
  • read/write unicode characters with a Cfile

    Posted by Legacy on 07/09/2003 12:00am

    Originally posted by: vincent

    how can i write and read japanese characters from a Cfile ?

    Reply
  • Image manipulation problem

    Posted by Legacy on 02/24/2003 12:00am

    Originally posted by: Saiman Lau

    I need to load a .jpg file into the VC project and display this file in an edit box in the client area of a window (dialog window). In VC++6 I don't know what .h file from the MFC I should include to make my code passing the compiler complaint that the CImage class is not defined.
    I read the book "Teach yourself Visual C++.Net" which described to include "#include <atlimage.h>" in the .cpp file and compile. but this never worked. The compiler complained that the header file could not be found. It looked like that the "altimage.h" is only available in VC++ .Net version, not in VC++6. Any one can sugest how can I get around this?

    Saiman Lau

    Reply
  • WideCharToMultiByte() vs. CStdioFile

    Posted by Legacy on 09/06/2002 12:00am

    Originally posted by: comiv

    Hi !
    Since I read your article, I thought that I have to use the WideCharToMultiByte() function.
    I use the preprocessor directive _UNICODE in my project. I have to write an ANSI text into a File.
    So is it better to use the WideCharToMultiByte() or the CStdioFile class ?
    Regards
    comiv

    Reply
  • How do I programatically map ascii char set code points to unicode?

    Posted by Legacy on 08/22/2002 12:00am

    Originally posted by: Andy Pickersgill

    I need to be able to convert character codes from specified fonts in old data files to the new unicode ones .Is there a function to do this or do I have to explicitly convert each character code within the source code. I am working in C++ and mainly using Symbol and Latin characters. Can anyone point me in the right direction?
    
    Thanks
    A.Pickersgill

    Reply
  • Resilience against Unicode Bugs with WinAPI

    Posted by Legacy on 08/16/2002 12:00am

    Originally posted by: David Hayes

    Every function I have come across in the WinAPI that uses a string based on char has a wide character varient. To protect yourselves against this, (and using MFC has to be avoided at all cost :D):

    Always write hard-coded strings using the _T() macro.
    Check the MSDN documentation for the WinAPI function call, there is almost always an _t macro for it, eg fopen becomes _tfopen, fprintf becomes _ftprintf.

    Even if you never ever expect to write a Unicode program, or cannot test for Unicode (W95/98 etc... peeps), it is always prudent to code in this way. You never know when you will sell your code and make millions, and then cause headaches when people do try and compile in Unicode.

    Of course, you could always use MFC............

    David Hayes
    Programmer
    ABM UK Ltd

    (recently converted a non Unicode program based on WinAPI to Unicode)

    Reply
  • MSLU for writing single unicode app for all platform

    Posted by Legacy on 07/22/2002 12:00am

    Originally posted by: Crystal

    I hope this would solve some of your problems.

    http://msdn.microsoft.com/msdnmag/issues/01/10/MSLU/print.asp

    cheers
    Crystal

    Reply
  • Unicode, W95, W98, WMe and IME

    Posted by Legacy on 07/17/2002 12:00am

    Originally posted by: Doctor Luz

    I am working under W98 and when I add unicode support to my program I can't run my program and I obtain the following message.

    The application or DLL cannot be loaded on Windows 95 or Windows 3.1 .It takes advantage of Unicode features availble only on Windows- NT". then again it is popping up message box saying that "The MFC42UD.DLL file cannot start . Check the file to determine the problem

    My program has an Edit control and I want to make it unicode because I want to activate the microsoft Input method editor in japanese for entering text into this edit control.

    �How can i activate IME in an Edit control not using _UNICODE?

    �Should i use other type of control?

    �Can i add unicode support to my program under W95, W98 and WMe?

    Reply
  • Language Localization in VC++

    Posted by Legacy on 06/26/2002 12:00am

    Originally posted by: Seshadri

    I am unix C++ programer. I need to know how to localize the english into the selected language in VC++.Trying to do with that withe resource files but not confident about the concepts of resource files creation.Can you please help me.

    Reply
  • Loading, Please Wait ...

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Learn How A Global Entertainment Company Saw a 448% ROI Every business today uses software to manage systems, deliver products, and empower employees to do their jobs. But software inevitably breaks, and when it does, businesses lose money -- in the form of dissatisfied customers, missed SLAs or lost productivity. PagerDuty, an operations performance platform, solves this problem by helping operations engineers and developers more effectively manage and resolve incidents across a company's global operations. …

  • Today's agile organizations pose operations teams with a tremendous challenge: to deploy new releases to production immediately after development and testing is completed. To ensure that applications are deployed successfully, an automatic and transparent process is required. We refer to this process as Zero Touch Deployment™. This white paper reviews two approaches to Zero Touch Deployment--a script-based solution and a release automation platform. The article discusses how each can solve the key …

Most Popular Programming Stories

More for Developers

RSS Feeds