Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
The subject of this article is encoding Kazakh language characters in OS400. It seems there is no need to investigate this problem because it must a be national standard. Leading vendors have developed their strategies, and all questions will be solved in time.
The problem of national language support in former USSR republics appeared more important as new countries created their own state machinery. The main language in all republics was Russian, but after the SU collaped, some of the new states have been included in the EU; they have developed and registered new standards for national language encoding in ISO, thus giving sufficient resources for software developers. Other countries have moved from the Cyrillic alphabet to Roman, shifting the problem inside. The Republic of Kazakhstan also has seriously considered moving to the Roman alphabet, but a special committee analyzed expenses and time involved for such a move, closely scrutinized other countries' experiences (for example, Turkey's); the project was postponed until the conditions are more suitable. Thus, people remained in the Cyrillic universe and should use today's standard RK 1048-2002, having nothing better at this time. This standard was created in the year 2002; its first goal was to remove the chaos that existed among the different encodings used for ASCII, ANSI, and UNICODE. This was well described.
The RK 1048-2002 standard defines two encodings: UNICODE and ANSI. The first (2-byte) encoding was happily congruent with the existing ISO standard, but the second (1-byte) is local and not ISO registered. Leading vendors (Microsoft and IBM) support Kazakh-UNICODE, but not Kazakh-ANSI. The last actually is supported by a group of volunteers that offers proprietary drivers, fonts, conversion tables, and procedures.
Additionally, there were few IBM AS400 computers in CIS countries in the year 2002. This explains why the RK 1048-2002 standard did not try to link with EBCIDIC encoding and AS400 applications.
The company I work for had bought AS400 (IBM i-series with OS400) and great Banking System (Equation). All seemed okay until the support for Kazakh was requested. All our applications and databases use simple byte encoding, actually CCID 1025. The main AS400 applications installed work via a terminal 5250 emulation program from the IBM Client Access package. Really, this is a Windows application (more precisely, a Java program) and it performs its own international support and the Kazakh language is not on the list.
Here is what we have: On the AS400 side, data and modules use CCID 1025 (EBCIDIC Cyrillic); on the client side (workstation), the page is CP_1251 (Windows Cyrillic). All the necessary conversions are made automatically, corresponding to the type of interaction implemented (ODBC, ADO, JDBC, Data transfer to and from, File Data sharing, and so on).
Today's encoding of the Kazakh language is actually an extension of Cyrillic and it seems robust to use this relation. I mean, one may use some code of CP1025 for his own purpose to represent Kazakh letters instead of its original use. I have made a test of using the RK 1048-2002 standard as described below. The C_1251.nls on a workstation was replaced by one that needed Kazakh standard support on Windows (this is included in the KAZWIN version 3.0 driver package), the necessary keyboard layout was added, the data was entered, and transferred to AS400 table. Then, they were requested by the query on the terminal 5250 screen. As was found, the Kazakh letters (Ө and ө) were mapped into control characters (called field modifiers) and they impact onto the displaying information. The effect is shown in Figure 1. Standard RK 1048-2002 encoding is shown in Figure 2.
Figure 1: The control symbols impact.
Figure 2: Standard RK 1048-2002 encoding.
Investigation was continued and a productive idea was found. That idea used and offered for encoding Kazakh letters in the KAZWIN driver version 2.5 package. Encoding Kazakh letters used to encode other subfamilies of Cyrillic was not necessary for Kazakh—I mean Serbian, Macedonian, and others! This may work because these languages are supported by CP1025 (EBCIDIC Cyrillic), and in this case all the letters were mapped into letters during the conversion from CP_1251 (Windows Cyrillic) to CP1025 (EBCIDIC Cyrillic) and vice versa. And, voilà; all work perfectly. The thing left to was to create a sorting table, and an uppercase table on the AS400 side; this may be easily done by the corresponding OS400 service. Thus, my task may be resolved by installing a CP 1251k (see Figure 3) from the KAZWIN version 2.5. package on every PC.
Figure 3: CP 1251k.
- Kazakh language support as extension of Russian, both for Windows and OS400, gives the opportunity to buy software created for Russia and use their experience with no additional changes and programming.
- The opportunity to get the last modern solutions for Russia, released by IBM and other third parties.
But, using CP1251k on one PC and the standard RK 1048-2002 on the others simultaneously was very uncomfortable. So, there is an insistent need to correct the national standard and use CP1251k instead, offered in RK 1048-2002, and register this new standard in ISO! Of course, this may lead to some losses, but how much it will be?
Costs: There will be the need to convert ANSI-coded data. This may be done saving it in UNICODE, making the transition, and saving in the ANSI. One may use conversion programs; there are a lot and they may be easily created. It is expected that some programs using proprietary sorting procedures, case changing, and raster fonts should be rewritten.
I should mention IBM has its own vision about how to support the Kazakh language. It have created CCID 01166 EBCDIC, and many applications support UNICODE. See the Kazakh language support by IBM. But, Kazakh ANSI was not supported until its registration by ISO.
Losses in the Case of Leaving Today's Standard as is
- There is no way to use OS400 software created for Russia.
- One needs to especially reorder such software.
- One needs additional support for such software.
- There is an unavoidable lag from the modern state and cost increasing for software, developing, and support as an impact of the decreased RK market.
- Large expenses for IBM and other vendors for Kazakh language support. (The list of IBM products mentioned above contains about 400 packages and a large number of third-party products).
Ideas used may be applicable for other languages' support.
Questions to the Reader
- What do you think about this solution?
- Do you recommend that RK accept these ideas and change their standard?
Your comments and advice are welcome here: email@example.com.
You may send also any questions and requests, your contras and pros, to standards changing to the same address.
Turmukhambetov Radmir N.
Monday, February 11, 2008