Converting from Code Page to Unicode

Environment: C++, MFC, databases

Overview

Some time ago, I was put in a very new and strange situation. My company built a new financial site based on an old skeleton site. It was necessary to transfer a lot of data from the old database (Access 97) to the new one (Microsoft SQL 7).

What was the problem? All the data from the old DB was written in Greek code page and the new DB must be written in Unicode. Another request from the customer was to have online data in both sites. That means that when someone introduces code page data in a table, this information has to be translated automatically into Unicode data (in the other table) and vice versa.

I lost four days finding in the Microsoft SQL transact language a few very simple functions that make translations between code pages and Unicode characters. Unfortunately, I didn’t find them… Microsoft SQL server supports both Unicode and code page formats, but does not have a translation function between them.

I was sure I would find on the Internet some very simple translation functions or programs. After another two days, I gave up. I found only the Microsoft API functions—WideCharToMultiByte and MultiByteToWideChar—and a very huge program that uses them to translate files.

That’s why I decided to make my own functions.

Details

WideCharToMultiByte and MultiByteToWideChar are well documented on the Microsoft MSDN. I made two simple wrappers for these functions, with some defaults parameters:

bool CDialogDlg::U2CP(
LPCWSTR pUSourceData,
char    pOutData[],
UINT    giUDestinationCodePage,
char    pErrorData[]
)
// pUSourceData           = This is Unicode multibyte string that
//                          has to be translated
// pOutData               = In this variable will be put
//                          the code page translated string
// giUDestinationCodePage = Integer that must contain the
//                          code page number (1253 for Greek)
// pErrorData             = A string that eventually contains
//                          an error.

bool CDialogDlg::CP2U(
LPCSTR   pSourceData,
WCHAR    pOutData[],
UINT     giSourceCodePage,
char     pErrorData[]
         )
// pUSourceData     = This is code page input string that has to
//                    be translated
// pOutData         = In this multibyte variable the unicode
//                    translated string will be put
// giSourceCodePage = Integer that must contain the code page
//                    number (1253 for Greek)
// pErrorData       = A string that eventually contains an error.

From the Microsoft sample, I kept a utility function that is used to allocate memory:

  LPVOID ManageMemory (UINT message, UINT sourcedestination,
                       DWORD nBytes, LPVOID p)

Now, if you want to test the functions, just try the C++ Dialog client:

Change the keyboard settings in the needed language (for example, Greek) and input some words characters on the first edit box. It is possible to adjust the settings of the edit box to see the needed code page characters correctly (by default, you will see some ASCII characters but this is correct—just copy and paste to Word to see that). On the right edit box, you have to enter the code page (1253 for Greek). Push the CP2U button.

Because the C++ edit boxes don’t know Unicode, you will obtain some strange characters in the Unicode edit box (“1 ; ? 1 “). To see the code page data in the third button again, push the U2CP button.

Of course, it is possible to use these functions in many situations:

  • Directly from C++, to manipulate strings from/to files or DB.
  • Encapsulate the functions in a COM component and use it from other programs or language platforms.
  • More…

To use them directly in the Microsoft SQL transact language, you need to encapsulate the functions in two C extended stored procedures. For some deployment reasons, I made two functions: xp_u2cp and xp_cp2u_web.

The Microsoft SQL server gives the user the possibility to make his own functions. To build one:

  • Build an Extended stored procedure Microsoft Visual C++ project.
  • In the wizard, give to the function the same name as the project name.
  • After completing this Wizard, copy XP_U2CP.dll to your SQL Server\Binn directory.
  • Add your new Extended Stored Procedure from Query Analyzer, executing the following SQL command:
  • sp_addextendedproc 'xp_u2cp', 'XP_U2CP.DLL'
  • Will find the function in the master database in the Extended Stored Procedures group . Give him rights if it is necessary!
  • You may drop the extended stored procedure by using the SQL command:
  • sp_dropextendedproc 'xp_u2cp'
  • You may release the DLL from the Server (to delete or replace the file), by using the SQL command:
  • DBCC xp_u2cp(FREE)
  • Access the functions from Query Analyzer or from one stored procedure:
  • Master.dbo.xp_u2cp

A. The Unicode to Code Page Translation

In the picture is a general script that proves the function:

To use it, you have to use a SQL stored procedure:

  Exec spDu2cp N'at?e?s ? ????s?a', 1253

The spDu2cp stored procedure is a wrapper for the xp_u2cp extended procedure. The first parameter is the nvarchar Unicode string and the second is the code page. The stored procedure will print the varchar result. In the spDu2cp stored procedure are used the xp_u2cp extended procedures with some settings.

About the Unicode to code page extended stored procedure parameters: The extended procedure must be used in this way:

  exec master.dbo.xp_u2cp @param1, @param2 OUTPUT, @cp

@param1 = must be varbinary(8000). This will contain the Unicode
          characters in hexadecimals format.
The CAST translation is needed because the extended procedure
         parameters don't know multibyte characters.
         set @param1 = CAST(@u_value AS varbinary(8000))
@param2 = must be  varchar(4000). This output variable will
          contain the code page translated string.
@cp     = must be a int. This parameter contains the needed
          code page (1253  Greek, for example).

B. The Code Page to Unicode Translation

In the picture is a general script that proves the function:

To use it, you have to use a SQL stored procedure:

  Exec spDcp2u 'gica in code page', 1253

The spDcp2u stored procedure is a wrapper for the xp_cp2u_web extended procedure. The first parameter is the varchar string and the second is the code page. The stored procedure will print the nvarchar result. In the spDcp2u stored procedure are used the the xp_cp2u_web extended procedure with some settings.

About the code page to Unicode extended stored procedure parameters: The extended procedure must be used in this way:

  exec master.dbo.xp_cp2u_web @param1, @param2 OUTPUT, @cp

@param1 = must be varchar (2000). This will contain the code
          page characters
@param2 = must be  varbinary(4000). This output variable will
          contain the code page translated string in hexadecimal
          format.
In order to use it, this must be translated in nvarchar Unicode
         format.
The CAST translation is needed because the extended procedure
         parameters don't know multibyte characters.
  set @param1 = CAST(@u_value AS varbinary(8000))
@cp     = must be a int. This parameter contains the needed code
          page (1253  Greek, for example).

These functions have to get their data and put them directly in/from Unicode (nvarchar)/codepage (varchar) data tables. The string variables in the example are only for testing. Microsoft Query Analyzer SQL editor only knows Unicode, so you cannot give a real codepage string parameter to these functions (the editor converts the input string into Unicode format).

Steps made inside of C extended procedure:

  • I looked at each parameter received and I made some tests regarding their type and dimension:
  • //Returns the data type of a remote stored procedure call
    //parameter
    srv_paramtype (srvproc, 1 );
    // Returns the maximum data length of a remote stored
    //procedure call parameter.
    srv_parammaxlen (srvproc, 1 );
    //Returns the data length of a remote stored procedure call
    //parameter.
    srv_paramlen (srvproc, 1 );
    
    
  • After that I retrieved the input parameter:
  • // Returns the value of a remote stored procedure call
    // parameter.
    srv_paramdata (srvproc, 1 );
    
  • I made the Unicode translation:
  •   U2CP( pUSourceData, pOutData, iCodePage, pErrorData )
  • I put the output parameter:
  • // Sets the value of a remote stored procedure call return
    // parameter.
    srv_paramset(srvproc, 2, (void*)pOutData, strlen(pOutData)
    

Inside of the extended procedure C code, I kept some utility functions samples from Microsoft:

// send XP usage info to client
void printUsage (SRV_PROC *srvproc)

// send szErrorMsg to client
void printError (SRV_PROC *srvproc, CHAR* szErrorMsg)

//send a simple message to the client sql console
void printMessage (SRV_PROC *srvproc, DBCHAR* szMsg)

Installation

  • Use the Dialog.exe sample C++ program directly
  • Copy the XP_U2CP.dll and XP_CP2U_web.dll over to your SQL Server \Binn directory
  • Add your new Extended Stored Procedures from Query Analyzer, executing the following SQL commands:
    • sp_addextendedproc ‘xp_u2cp’, ‘XP_U2CP.DLL’
    • sp_addextendedproc ‘xp_cp2u_web’, ‘XP_CP2U.DLL’
  • Access the functions from Query Analyzer or from one stored procedure:
    • Master.dbo. xp_u2cp
    • Master.dbo. xp_cp2u_web
  • Made the wrappers stored procedure for these extended procedure with spDcp2u.sql and spDu2cp.sql SQL transaction scripts in Query Analyzer.

Downloads

Download demo project – 90 Kb
It also provides a sample C++ client that tests the u2cp and cp2u conversions functions.

More by Author

Get the Free Newsletter!

Subscribe to Data Insider for top news, trends & analysis

Must Read