Internationalize and Localize Your C/C++ Code with ICU

To paraphrase Jane Austen: "It is a truth universally acknowledged, that a successful application in possession of a good customer base must be in want of an internationalization strategy."

All joking aside, internationalization and localization are key parts of the maturation process of an application, whether it is deployed throughout a single enterprise or through a diverse customer base.

The Problem

The history of computing is littered with good ideas for encoding characters from various languages. Even for something as "straightforward" as English, I've had to deal with several different encodings in my career: ASCII, EBCDIC, and FIELDATA. When you expand your domain of interest to Asian languages—for example, Japanese—you find a wide variety of choices, including SJIS and JEUC.

The basic problem is the same "glyph" (or character) has a different representation (coding) depending on your language locale. The Unicode manifesto offers a helpful starting place to avoid the mire of this alphabet soup: "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language."

More or less, this means a unique 16-bit number as opposed to an 8-bit number, though there is a lot more to it than just that. This article looks at how the International Component for Unicode (ICU) library can keep you from being hopelessly mired in the glyph alphabet soup.

ICU: International Component for Unicode

ICU is a mature, portable set of open source C/C++ and Java libraries for Unicode support and software internationalization. (Don't be put off by the fact that IBM provides the funding and support for ICU; it really is an open source initiative in the best sense.) It gives applications the same results on all platforms. Its major features revolve around locale-sensitive string comparison, formatting, text boundary detection, and character set conversion. So, what exactly does ICU offer?

  • Text: Unicode text handling, full character properties, and character set conversions (500+ code pages)
  • Analysis: Unicode regular expressions, full Unicode sets, and character, word, and line boundaries
  • Comparison: Language-sensitive collation and searching
  • Transformations: Normalization, upper/lowercase, and script transliterations
  • Locales: Comprehensive data and resource bundle architecture (200+ locales)
  • Complex Text Layout: Arabic, Hebrew, Indic, and Thai
  • Formatting and Parsing: Multi-calendar and time zone, dates, times, numbers, currencies, and messages

Examples in this article require that you download a version of ICU. Although they show code in Win32 environments, rest assured that ICU has precompiled binaries for AIX 5.2, HPUX 11.11, Red Hat Linux 3.0, Solaris 9, and Visual Studio .NET 2003. With source code in hand, you can build your own binaries for these environments plus Mac OS X, Cygwin, MinGW, BSD, QNX, and many other popular platforms. Running the built-in test suite is highly recommended whenever creating your own binaries.

ICU Text Boundary Analysis

Altlthough many people will use ICU just for codeset conversion and localization, both ideas are difficult to demonstrate in a short demo. Instead, this article delves into another useful functionality that ICU provides: text boundary analysis.

Text boundary analysis is the process of locating linguistic indicators when formatting or parsing text. For example, if the user double-clicks into the middle of a word in a word processor, you have to be able to figure out where the word starts and ends to highlight it. Other applications might include automatic capitalization, word counts, or constructing a concordance. The demo shows how a BreakIterator object can solve these problems independent of language encoding. An ICU BreakIterator can locate boundaries of characters, words, line-breaks, and sentences.

To begin, install ICU into a convenient location; this example uses D:\ICU. The Visual Studio .NET 2003 command-line to build the application looks like this:

cl wrap.cpp /EHsc /Z7 /ID:\icu\include /link
            /LIBPATH:d:\icu\lib icuuc.lib /debug

And, of course, you must placate your old friend, the DOS PATH:

path=%path%;d:\icu\bin

The first thing the demo program will need to do is establish a locale. A "locale" includes information about the user's language, his or her country, and possibly other preferences. For example, "en_US" specifies English and USA conventions (for collation, currency, and calendar format), whereas "en_IE_PREEURO" specifies English in Ireland ("IE" is the ISO-639 abbreviation) with non-Euro (for example, GBP) currency. Simply hardcode the Locale constructor, although the proper method would be to read it out of the machine environment:

Locale myLoc("en", "US");

Although you probably are used to thinking in terms of one character = one glyph (displayable symbol), this is not the case when you wander very far from English. For example, the glyph "ö" (lowercase "o" with an umlaut) could legally be represented by a single Unicode character or less obviously by the "o" followed by a second code for the umlaut. Indeed, this method of construction is what allows many Asian languages to be represented without their permutations overrunning the 64,000 characters available in 16-bit representations.

This demo's code uses a character-based iterator to figure out how to word-wrap a paragraph:

#include "unicode/uchar.h"
#include "unicode/brkiter.h"
#include <iostream>     // for cout
using namespace std;    // for cout

int32_t wrapParagraph(const UnicodeString& s,
                   const Locale& locale,
                   int32_t lineStarts[],
                   int32_t trailingwhitespace[],
                   int32_t maxLines,
                   int32_t maxCharsPerLine,
                   UErrorCode &status) {

    int32_t        numLines = 0;
    int32_t        p=0, q;
    UChar          c;

    BreakIterator *bi =
       BreakIterator::createLineInstance(locale, status);
    if (U_FAILURE(status)) {
        delete bi;
        return 0;
    }
    bi->setText(s);
    while (p < s.length()) {
        // jump ahead in the paragraph by the maximum number
        // of characters that will fit
        q = p + maxCharsPerLine;

        // if this puts us on a white space character, a
        // control character (which includes newlines),
        // or a non-spacing mark, seek forward and stop on
        // the next character that is not any of these
        // things since none of these characters will be
        // visible at the end of a line, we can ignore them
        // for the purposes of figuring out how many
        // characters will fit on the line)
        if (q < s.length()) {
            c = s[q];
            while (q < s.length() && (u_isspace(c)
                       || u_charType(c) == U_CONTROL_CHAR
                       || u_charType(c) == U_NON_SPACING_MARK)) {
                ++q;
                c = s[q];
            }
        }

        // then locate the last legal line-break decision
        // at or before the current position
        // ("at or before" is what causes the "+ 1")
        q = bi->preceding(q + 1);

        // if this causes us to wind back to where we
        // started, then the line has no legal
        // line-break positions. Break the line at the
        // maximum number of characters
        if (q == p) {
            p += maxCharsPerLine;
            lineStarts[numLines] = p;
            trailingwhitespace[numLines] = 0;
            ++numLines;
        }
        // otherwise, we got a good line-break position.
        // Record the start of this line (p) and then seek
        // back from the end of this line (q) until you find
        // a non-white space character (same criteria as
        // above) and record the number of white space
        // characters at the end of the line in the other
        // results array
        else {
            lineStarts[numLines] = p;
            int32_t nextLineStart = q;
            for (q--; q > p; q--) {
                c = s[q];
                if (!(u_isspace(c)
                       || u_charType(c) == U_CONTROL_CHAR
                       || u_charType(c) == U_NON_SPACING_MARK)) {
                    break;
                }
            }
            trailingwhitespace[numLines] = nextLineStart - q -1;
            p = nextLineStart;
           ++numLines;
        }
        if (numLines >= maxLines) {
            break;
        }
    }
    delete bi;
    return numLines;
}



int main(int argc, char **argv)
{

  const int MAX_LINES=255;
  int32_t numLines, maxLines=MAX_LINES, lineStarts[MAX_LINES],
     trailingwhitespace[MAX_LINES];
  UErrorCode status=U_ZERO_ERROR;

  UnicodeString s1 = "Eschew obfuscation intentionally
                      recreating penultimate epistemological
                      valuation";
  UnicodeString s2 = s1 + s1 + s1;    // create a somewhat
                                      // longer string for
                                      // testing
  Locale myLoc("en", "US");
  numLines = wrapParagraph(s2, myLoc, lineStarts,
                           trailingwhitespace, maxLines,
                           70, status);
  for (int ii=0; ii<numLines; ii++)
    cout << "Line " << ii << " starts at pos "
         << lineStarts[ii] << endl;

}

The main() program simply builds up a string called s2, which contains the nonsense phrase that you are going to format into word-wrapped lines. Next, you create your locale as mentioned previously. Then, you call the wrapParagraph() function, which contains a lot of inline comments to explain its purpose.

You can see from the output below that the line lengths vary quite a bit from the requested 70 characters. This is because the shortest word used was six letters:

D:\icu>wrap
Line 0 starts at pos 0
Line 1 starts at pos 56
Line 2 starts at pos 125
Line 3 starts at pos 195

Just the Tip of the ICU-berg

The ICU Library can do a lot more than, of course. It has a powerful regular expression library, a system for optimizing localization resources, a way to produce locale-friendly messages and numeric strings, normalization and translation services, and a whole lot more. Check it out.

Book Recommendation

Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard by Richard Gillam is a great way to start your education of globalization issues. The book covers the history of Unicode, normalization forms, storage, and serialization. It pays particular attention to implementation techniques such as conversion, searching and sorting, and rendering. Of course, no Unicode book would be complete without covering salient aspects of the world languages of Europe, the Middle East, Africa, and Asia.

About the Author

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries; just drop an e-mail to sysop@HAL9K.com.



About the Author

Victor Volkman

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries, just drop an e-mail to sysop@HAL9K.com.

Comments

  • http://www.oakleysunglassesoutc.com/ alwimm

    Posted by http://www.oakleysunglassesoutc.com/ Mandygcp on 03/29/2013 05:44am

    Without support of the Industrial and Commercial Bank is very difficult to do, there is not just because of the sake of funds, political background is more profound.ghd straightener, According to China's exchange rate, equivalent to almost ten million pounds to one hundred million yuan.ghd sale, But credibility in the international market places, the RMB is naturally and sterling can not be mentioned in the same breath, from £ ten million years of interest-free loans will have a very large role in promoting China's stable currency.ghd hair straightener, Pound is equivalent to gold, holding the hands of the Chinese government enough sterling and gold, financial reform will Rulvpingdi to now an object of the war in China is the traditional powers such as Russia, those speculators ready to make a national calamity fiscal, which undoubtedly will give China's financial extremely unstable factors.cheap ghd, China needs such a hard currency like £.ghd australia, However, the relationship between China and Britain has been on in a lukewarm manner the degree.

    Reply
  • http://www.raybansunglassesouty.com/ nnrmdj

    Posted by http://www.raybansunglassesouty.com/ Suttonxic on 03/28/2013 02:19pm

    Chopped green onion, Xiaoshuai, come back to me why every time I'm not in ghd sale around, ghd hair straightener'll both find someone to fight it?cheap ghd, Small warm you just quickly put his point of living!ghd australia, Xiaoshuai waving his ax, anxious cried.ghd hair straightener, Point your head, I'm not Pirates St., not what sunflower acupuncture hand, you quickly give me Stop it, you would like to try the taste of the groups tied Soul ah. Small warm threatened her look of helpless Road. Little warm, you mercenary head of how his arm elbow out Shui ah. Xiaoshuai look of discontent stop hand - actually a word on the threat of living this temper Devil, it seems the last thing on her really deep enough impression. Xiaoshuai a stop, green onion, followed by stopped his hand, the two everyone is looking at the small warm unconvinced - estimated but for a few days because of a few women down Results from the firm friendship, as early as two female Tyrannosaurus on lifts her arms her Asago.

    Reply
  • Zhou Li Xiu 场图 时装 å·´ shoot up and summer 2013 women registered frame Wei Yi Lane follicle follicle LV 2013 LV modish Subsection

    Posted by woshizifengRWd on 03/25/2013 04:33am

    Rated 棋盘 题为 major spring-summer 2013 a 发布 Zhou 时装 黎 巴 Louis Vuitton registered Wei Yi Carriageway, decorative arrangement a undersized non-为必 animate, fundamental neonate 开格 not 也离 ??nature create a body female climbing Wei, lattice portly lattice slight, Acts Metropolitan 无刻 无时 easy trail LV Attachment of one be attractive to 份女 您更. fdfdf dsfdsfsd Zhou Li Xiu 场图 时装 巴 spring and summer 2013 women registered structure Wei Yi Road follicle follicle LV 2013 LV recent Subsection Rated 棋盘 题为 effort spring-summer 2013 a 发布 Zhou 时装 黎 巴 Louis Vuitton registered Wei Yi Direction, decorative settlement a petty non-为必 master-work, fundamental child 开格 not 也离 ??nature pattern a hull better half climbing Wei, lattice mammoth lattice immature, Acts Metropolitan 无刻 无时 calmly way LV Friendliness of harmonious interest 份女 您更 Dior 迪 终于 开秀 second a [b][/b] husky Hideyuki whole expected a lap cover receiver 时装 黎 Tomoe, let someone in on 时尚 站在 good fortune again next spring-summer 2013 ruin 秀 transvestite 迪 Dior. Method of arriving 种穿 organization, major trends 眥漕|磬赅眢礤|镳彐溴|镥疱鋧?right away 语言 时尚 锋的 于先 genus crop up b grow unique, 拼接 hypsochromic persuade shearing rule needlessly infant 龄女 surprising if 很适 交融 Unequalled 传统 given future, become 演变 veil instrumentation west 经典 牌 goods; mold Dior 经典 a 缎面 闪亮 裙 half. Floret 哨 Yayu 极简 Yes, 圈可 single out 也可 细节 peremptorily shearing skin charge.

    Reply
  • ghd australia tbxjab

    Posted by Suttonzst on 03/08/2013 03:50pm

    christian louboutin outlet wawpqplz cheap christian louboutin llzogfze christian louboutin shoes wjvheevo christian louboutin sale ctieckrc christian louboutin shoes sale fwmurfuv

    Reply
  • ugg boots psocuc

    Posted by Mandynbk on 02/20/2013 11:34am

    Michael Kors outlet rdfnrieu

    Reply
  • ugg boots yposgl

    Posted by Suttontsr on 02/18/2013 08:18pm

    beats by dr dre evjfmhze beats by dre jggoztra beats dr dre cijkbenj beats for sale rkbjgrwu beats headphones ijjiafsf cheap monster beats famxuokx dr dre beats oieepvww dr dre headphones sipuiuqn monster beats by dre nbvchjgv monster beats headphones meqdtnhm monster beats bnfepbvn monster headphones gdhoanti

    Reply
  • ugg boots vdvqao http://www.cheapfashionshoesan.com/

    Posted by Mandyteu on 02/18/2013 08:04pm

    ugg france eamrdvbe ugg pas cher bfrlfjmk bottes ugg erpdsrkb ugg australia kbzhojpa bottes ugg pas cher qjldylxa ugg

    Reply
  • ghd australia poatex

    Posted by Suttonbya on 02/04/2013 02:49am

    9zCqc ugg wEbh ¥È¥ê©`¥Ð©`¥Á µêÅn tFqc nike shox norge 5ySxu toms on sale 3cYjl cheap hollister 5jBzk ugg france 4dVzu sac longchamp 2dGux louis vuitton handbags 7oJxf michael kors outlet 0fUmq christian louboutin norge 2mLus 49ers jerseys 1xKez 3wPlv ghd 9mQew lisseur ghd pas cher 9oPte ugg sale

    Reply
  • cheap ugg boots zFps sUka http://www.cheapfashionshoesas.com/

    Posted by Mandylzr on 01/29/2013 12:21pm

    wWac chaussures louboutin jCbv longchamp bags hRpp michael kors bags 8lWnz ugg boots 0xUlf chi straighteners 5iEpw michael kors outlet 3kHmd wholesale nfl jerseys 9nXyb cheap nike air max 9gZrt ghd 7eFxo botas ugg 8iLfj toms outlet 3kMgw Tory Burch Lady Black Handbags CheapTory Burch Pink Shoulderbag CheapTory Burch Orange Tote Handbags CheapTory Burch Metallic Canvas Ella Apricot Tote CheapTory Burch Magenta Wedge High Heel Cheap 7bDiu hollister lyon 4vHmx planchas ghd baratas 5iGtj ugg boots uk

    Reply
  • ugg boots tpajbb http://www.cheapfashionshoesas.com/

    Posted by Mandyuzk on 01/27/2013 12:21pm

    8bDpt nike shoes eNdz Michael Kors outlet aBns ugg boots 0lMis beats by dr dre 7mHeb Cheap nfl jerseys 0lGzu uggs australia 8vOeo burberry sale 3qCtc longchamp bags 0qWiy nike shoes online 0tGmq ugg boots 0jUsr monster headphones 6bEzh botas ugg 2dAzl cheap ghd 0bAqr 1pKyo

    Reply
  • Loading, Please Wait ...

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Java developers know that testing code changes can be a huge pain, and waiting for an application to redeploy after a code fix can take an eternity. Wouldn't it be great if you could see your code changes immediately, fine-tune, debug, explore and deploy code without waiting for ages? In this white paper, find out how that's possible with a Java plugin that drastically changes the way you develop, test and run Java applications. Discover the advantages of this plugin, and the changes you can expect to see …

  • The explosion in mobile devices and applications has generated a great deal of interest in APIs. Today's businesses are under increased pressure to make it easy to build apps, supply tools to help developers work more quickly, and deploy operational analytics so they can track users, developers, application performance, and more. Apigee Edge provides comprehensive API delivery tools and both operational and business-level analytics in an integrated platform. It is available as on-premise software or through …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds