Internationalize and Localize Your C/C++ Code with ICU
To paraphrase Jane Austen: "It is a truth universally acknowledged, that a successful application in possession of a good customer base must be in want of an internationalization strategy."
All joking aside, internationalization and localization are key parts of the maturation process of an application, whether it is deployed throughout a single enterprise or through a diverse customer base.
The Problem
The history of computing is littered with good ideas for encoding characters from various languages. Even for something as "straightforward" as English, I've had to deal with several different encodings in my career: ASCII, EBCDIC, and FIELDATA. When you expand your domain of interest to Asian languages—for example, Japanese—you find a wide variety of choices, including SJIS and JEUC.
The basic problem is the same "glyph" (or character) has a different representation (coding) depending on your language locale. The Unicode manifesto offers a helpful starting place to avoid the mire of this alphabet soup: "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language."
More or less, this means a unique 16-bit number as opposed to an 8-bit number, though there is a lot more to it than just that. This article looks at how the International Component for Unicode (ICU) library can keep you from being hopelessly mired in the glyph alphabet soup.
ICU: International Component for Unicode
ICU is a mature, portable set of open source C/C++ and Java libraries for Unicode support and software internationalization. (Don't be put off by the fact that IBM provides the funding and support for ICU; it really is an open source initiative in the best sense.) It gives applications the same results on all platforms. Its major features revolve around locale-sensitive string comparison, formatting, text boundary detection, and character set conversion. So, what exactly does ICU offer?
- Text: Unicode text handling, full character properties, and character set conversions (500+ code pages)
- Analysis: Unicode regular expressions, full Unicode sets, and character, word, and line boundaries
- Comparison: Language-sensitive collation and searching
- Transformations: Normalization, upper/lowercase, and script transliterations
- Locales: Comprehensive data and resource bundle architecture (200+ locales)
- Complex Text Layout: Arabic, Hebrew, Indic, and Thai
- Formatting and Parsing: Multi-calendar and time zone, dates, times, numbers, currencies, and messages
Examples in this article require that you download a version of ICU. Although they show code in Win32 environments, rest assured that ICU has precompiled binaries for AIX 5.2, HPUX 11.11, Red Hat Linux 3.0, Solaris 9, and Visual Studio .NET 2003. With source code in hand, you can build your own binaries for these environments plus Mac OS X, Cygwin, MinGW, BSD, QNX, and many other popular platforms. Running the built-in test suite is highly recommended whenever creating your own binaries.
ICU Text Boundary Analysis
Altlthough many people will use ICU just for codeset conversion and localization, both ideas are difficult to demonstrate in a short demo. Instead, this article delves into another useful functionality that ICU provides: text boundary analysis.
Text boundary analysis is the process of locating linguistic indicators when formatting or parsing text. For example, if the user double-clicks into the middle of a word in a word processor, you have to be able to figure out where the word starts and ends to highlight it. Other applications might include automatic capitalization, word counts, or constructing a concordance. The demo shows how a BreakIterator object can solve these problems independent of language encoding. An ICU BreakIterator can locate boundaries of characters, words, line-breaks, and sentences.
To begin, install ICU into a convenient location; this example uses D:\ICU. The Visual Studio .NET 2003 command-line to build the application looks like this:
cl wrap.cpp /EHsc /Z7 /ID:\icu\include /link
/LIBPATH:d:\icu\lib icuuc.lib /debug
And, of course, you must placate your old friend, the DOS PATH:
path=%path%;d:\icu\bin
The first thing the demo program will need to do is establish a locale. A "locale" includes information about the user's language, his or her country, and possibly other preferences. For example, "en_US" specifies English and USA conventions (for collation, currency, and calendar format), whereas "en_IE_PREEURO" specifies English in Ireland ("IE" is the ISO-639 abbreviation) with non-Euro (for example, GBP) currency. Simply hardcode the Locale constructor, although the proper method would be to read it out of the machine environment:
Locale myLoc("en", "US");
Although you probably are used to thinking in terms of one character = one glyph (displayable symbol), this is not the case when you wander very far from English. For example, the glyph "ö" (lowercase "o" with an umlaut) could legally be represented by a single Unicode character or less obviously by the "o" followed by a second code for the umlaut. Indeed, this method of construction is what allows many Asian languages to be represented without their permutations overrunning the 64,000 characters available in 16-bit representations.
This demo's code uses a character-based iterator to figure out how to word-wrap a paragraph:
#include "unicode/uchar.h"
#include "unicode/brkiter.h"
#include <iostream> // for cout
using namespace std; // for cout
int32_t wrapParagraph(const UnicodeString& s,
const Locale& locale,
int32_t lineStarts[],
int32_t trailingwhitespace[],
int32_t maxLines,
int32_t maxCharsPerLine,
UErrorCode &status) {
int32_t numLines = 0;
int32_t p=0, q;
UChar c;
BreakIterator *bi =
BreakIterator::createLineInstance(locale, status);
if (U_FAILURE(status)) {
delete bi;
return 0;
}
bi->setText(s);
while (p < s.length()) {
// jump ahead in the paragraph by the maximum number
// of characters that will fit
q = p + maxCharsPerLine;
// if this puts us on a white space character, a
// control character (which includes newlines),
// or a non-spacing mark, seek forward and stop on
// the next character that is not any of these
// things since none of these characters will be
// visible at the end of a line, we can ignore them
// for the purposes of figuring out how many
// characters will fit on the line)
if (q < s.length()) {
c = s[q];
while (q < s.length() && (u_isspace(c)
|| u_charType(c) == U_CONTROL_CHAR
|| u_charType(c) == U_NON_SPACING_MARK)) {
++q;
c = s[q];
}
}
// then locate the last legal line-break decision
// at or before the current position
// ("at or before" is what causes the "+ 1")
q = bi->preceding(q + 1);
// if this causes us to wind back to where we
// started, then the line has no legal
// line-break positions. Break the line at the
// maximum number of characters
if (q == p) {
p += maxCharsPerLine;
lineStarts[numLines] = p;
trailingwhitespace[numLines] = 0;
++numLines;
}
// otherwise, we got a good line-break position.
// Record the start of this line (p) and then seek
// back from the end of this line (q) until you find
// a non-white space character (same criteria as
// above) and record the number of white space
// characters at the end of the line in the other
// results array
else {
lineStarts[numLines] = p;
int32_t nextLineStart = q;
for (q--; q > p; q--) {
c = s[q];
if (!(u_isspace(c)
|| u_charType(c) == U_CONTROL_CHAR
|| u_charType(c) == U_NON_SPACING_MARK)) {
break;
}
}
trailingwhitespace[numLines] = nextLineStart - q -1;
p = nextLineStart;
++numLines;
}
if (numLines >= maxLines) {
break;
}
}
delete bi;
return numLines;
}
int main(int argc, char **argv)
{
const int MAX_LINES=255;
int32_t numLines, maxLines=MAX_LINES, lineStarts[MAX_LINES],
trailingwhitespace[MAX_LINES];
UErrorCode status=U_ZERO_ERROR;
UnicodeString s1 = "Eschew obfuscation intentionally
recreating penultimate epistemological
valuation";
UnicodeString s2 = s1 + s1 + s1; // create a somewhat
// longer string for
// testing
Locale myLoc("en", "US");
numLines = wrapParagraph(s2, myLoc, lineStarts,
trailingwhitespace, maxLines,
70, status);
for (int ii=0; ii<numLines; ii++)
cout << "Line " << ii << " starts at pos "
<< lineStarts[ii] << endl;
}
The main() program simply builds up a string called s2, which contains the nonsense phrase that you are going to format into word-wrapped lines. Next, you create your locale as mentioned previously. Then, you call the wrapParagraph() function, which contains a lot of inline comments to explain its purpose.
You can see from the output below that the line lengths vary quite a bit from the requested 70 characters. This is because the shortest word used was six letters:
D:\icu>wrap Line 0 starts at pos 0 Line 1 starts at pos 56 Line 2 starts at pos 125 Line 3 starts at pos 195
Just the Tip of the ICU-berg
The ICU Library can do a lot more than, of course. It has a powerful regular expression library, a system for optimizing localization resources, a way to produce locale-friendly messages and numeric strings, normalization and translation services, and a whole lot more. Check it out.
Book Recommendation
| Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard by Richard Gillam is a great way to start your education of globalization issues. The book covers the history of Unicode, normalization forms, storage, and serialization. It pays particular attention to implementation techniques such as conversion, searching and sorting, and rendering. Of course, no Unicode book would be complete without covering salient aspects of the world languages of Europe, the Middle East, Africa, and Asia. | |
About the Author
Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries; just drop an e-mail to sysop@HAL9K.com.

Comments
http://www.oakleysunglassesoutc.com/ alwimm
Posted by http://www.oakleysunglassesoutc.com/ Mandygcp on 03/29/2013 05:44amWithout support of the Industrial and Commercial Bank is very difficult to do, there is not just because of the sake of funds, political background is more profound.ghd straightener, According to China's exchange rate, equivalent to almost ten million pounds to one hundred million yuan.ghd sale, But credibility in the international market places, the RMB is naturally and sterling can not be mentioned in the same breath, from ã ten million years of interest-free loans will have a very large role in promoting China's stable currency.ghd hair straightener, Pound is equivalent to gold, holding the hands of the Chinese government enough sterling and gold, financial reform will Rulvpingdi to now an object of the war in China is the traditional powers such as Russia, those speculators ready to make a national calamity fiscal, which undoubtedly will give China's financial extremely unstable factors.cheap ghd, China needs such a hard currency like ã.ghd australia, However, the relationship between China and Britain has been on in a lukewarm manner the degree.
Replyhttp://www.raybansunglassesouty.com/ nnrmdj
Posted by http://www.raybansunglassesouty.com/ Suttonxic on 03/28/2013 02:19pmChopped green onion, Xiaoshuai, come back to me why every time I'm not in ghd sale around, ghd hair straightener'll both find someone to fight it?cheap ghd, Small warm you just quickly put his point of living!ghd australia, Xiaoshuai waving his ax, anxious cried.ghd hair straightener, Point your head, I'm not Pirates St., not what sunflower acupuncture hand, you quickly give me Stop it, you would like to try the taste of the groups tied Soul ah. Small warm threatened her look of helpless Road. Little warm, you mercenary head of how his arm elbow out Shui ah. Xiaoshuai look of discontent stop hand - actually a word on the threat of living this temper Devil, it seems the last thing on her really deep enough impression. Xiaoshuai a stop, green onion, followed by stopped his hand, the two everyone is looking at the small warm unconvinced - estimated but for a few days because of a few women down Results from the firm friendship, as early as two female Tyrannosaurus on lifts her arms her Asago.
ReplyZhou Li Xiu åºå¾ æ¶è£ å·´ shoot up and summer 2013 women registered frame Wei Yi Lane follicle follicle LV 2013 LV modish Subsection
Posted by woshizifengRWd on 03/25/2013 04:33amRated æ£ç é¢ä¸º major spring-summer 2013 a åå¸ Zhou æ¶è£ é» å·´ Louis Vuitton registered Wei Yi Carriageway, decorative arrangement a undersized non-ä¸ºå¿ animate, fundamental neonate 弿 ¼ not ä¹ç¦» ??nature create a body female climbing Wei, lattice portly lattice slight, Acts Metropolitan æ å» æ æ¶ easy trail LV Attachment of one be attractive to 份女 æ¨æ´. fdfdf dsfdsfsd Zhou Li Xiu åºå¾ æ¶è£ å·´ spring and summer 2013 women registered structure Wei Yi Road follicle follicle LV 2013 LV recent Subsection Rated æ£ç é¢ä¸º effort spring-summer 2013 a åå¸ Zhou æ¶è£ é» å·´ Louis Vuitton registered Wei Yi Direction, decorative settlement a petty non-ä¸ºå¿ master-work, fundamental child 弿 ¼ not ä¹ç¦» ??nature pattern a hull better half climbing Wei, lattice mammoth lattice immature, Acts Metropolitan æ å» æ æ¶ calmly way LV Friendliness of harmonious interest 份女 æ¨æ´ Dior 迪 ç»äº å¼ç§ second a [b][/b] husky Hideyuki whole expected a lap cover receiver æ¶è£ é» Tomoe, let someone in on æ¶å° ç«å¨ good fortune again next spring-summer 2013 ruin ç§ transvestite 迪 Dior. Method of arriving ç§ç©¿ organization, major trends ç¥æ¼|ç£¬èµ ç¢ç¤¤|é³å½æº´|é¥ç±é§?right away è¯è¨ æ¶å° éç äºå genus crop up b grow unique, æ¼æ¥ hypsochromic persuade shearing rule needlessly infant é¾å¥³ surprising if å¾é 交è Unequalled ä¼ ç» given future, become æ¼å veil instrumentation west ç»å ¸ ç goods; mold Dior ç»å ¸ a ç¼é¢ éªäº® è£ half. Floret å¨ Yayu æç® Yes, åå¯ single out ä¹å¯ ç»è peremptorily shearing skin charge.
Replyghd australia tbxjab
Posted by Suttonzst on 03/08/2013 03:50pmchristian louboutin outlet wawpqplz cheap christian louboutin llzogfze christian louboutin shoes wjvheevo christian louboutin sale ctieckrc christian louboutin shoes sale fwmurfuv
Replyugg boots psocuc
Posted by Mandynbk on 02/20/2013 11:34amMichael Kors outlet rdfnrieu
Replyugg boots yposgl
Posted by Suttontsr on 02/18/2013 08:18pmbeats by dr dre evjfmhze beats by dre jggoztra beats dr dre cijkbenj beats for sale rkbjgrwu beats headphones ijjiafsf cheap monster beats famxuokx dr dre beats oieepvww dr dre headphones sipuiuqn monster beats by dre nbvchjgv monster beats headphones meqdtnhm monster beats bnfepbvn monster headphones gdhoanti
Replyugg boots vdvqao http://www.cheapfashionshoesan.com/
Posted by Mandyteu on 02/18/2013 08:04pmugg france eamrdvbe ugg pas cher bfrlfjmk bottes ugg erpdsrkb ugg australia kbzhojpa bottes ugg pas cher qjldylxa ugg
Replyghd australia poatex
Posted by Suttonbya on 02/04/2013 02:49am9zCqc ugg wEbh ¥åê©`¥é`¥à µêà n tFqc nike shox norge 5ySxu toms on sale 3cYjl cheap hollister 5jBzk ugg france 4dVzu sac longchamp 2dGux louis vuitton handbags 7oJxf michael kors outlet 0fUmq christian louboutin norge 2mLus 49ers jerseys 1xKez 3wPlv ghd 9mQew lisseur ghd pas cher 9oPte ugg sale
Replycheap ugg boots zFps sUka http://www.cheapfashionshoesas.com/
Posted by Mandylzr on 01/29/2013 12:21pmwWac chaussures louboutin jCbv longchamp bags hRpp michael kors bags 8lWnz ugg boots 0xUlf chi straighteners 5iEpw michael kors outlet 3kHmd wholesale nfl jerseys 9nXyb cheap nike air max 9gZrt ghd 7eFxo botas ugg 8iLfj toms outlet 3kMgw Tory Burch Lady Black Handbags CheapTory Burch Pink Shoulderbag CheapTory Burch Orange Tote Handbags CheapTory Burch Metallic Canvas Ella Apricot Tote CheapTory Burch Magenta Wedge High Heel Cheap 7bDiu hollister lyon 4vHmx planchas ghd baratas 5iGtj ugg boots uk
Replyugg boots tpajbb http://www.cheapfashionshoesas.com/
Posted by Mandyuzk on 01/27/2013 12:21pm8bDpt nike shoes eNdz Michael Kors outlet aBns ugg boots 0lMis beats by dr dre 7mHeb Cheap nfl jerseys 0lGzu uggs australia 8vOeo burberry sale 3qCtc longchamp bags 0qWiy nike shoes online 0tGmq ugg boots 0jUsr monster headphones 6bEzh botas ugg 2dAzl cheap ghd 0bAqr 1pKyo
ReplyLoading, Please Wait ...