Click to See Complete Forum and Search --> : Culture SENSITIVE regex split?


dy13
July 19th, 2005, 12:50 PM
Hi, I'm trying to split a database of English and Chinese sentences into arrays of individual words using Regex.Split. The problem is, English words get separated by spaces while Chinese words don't. This gets even more confusing when Chinese and English words exist in the same sentence.

Is there a way for Regex to automatically detect the language and perform the proper splits accordingly? Thanks so much!

dy13
July 19th, 2005, 04:58 PM
Took me a while but I figured it out! For anyone else interested in multilanguage splits, here's my way:

Use Match instead of Split
E.g.

Regex* rg = new Regex(S"[A-z]+|\\w);
Match* match = rg->Match(yourString);

The [A-z] part will target words in English, the \w will target any non-English characters(Chinse, Japanese, etc.). You can also use [A-z|0-9]+ to include attached numbers.

Working so far...