Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
It is now rather exactly 25 years ago that I was plagued by the megalomaniac idea to get the whole German vocabulary into the (at that time) huge memory of 64 Kb of a Z80-processor. If you wonder now why I serve you such old cookies here, read on. This is the story of how I succeeded and why this can be of interest for you today.
At that time, I was looking for a commercial application for my (self developed, self constructed) Z80 'machine'. Because it was a rather expensive piece of hardware (a 8" floppy cost around 1000 bucks at that time and there were two of them, the 64k were even more expensive). I thought it ought to bring some money back. After thinking a while, I came up with the strange idea that developing a program that delivers crossword puzzles could be the solution to my problem. (They were made by hand in those days by home workers.) So, I decided to write a crossword puzzle program in assembler. There was no C compiler for the Z80 at that time.
The first and main problem was how to get all those words that make up crossword puzzles into 64 Kb of memory? An ordinary German dictionary has more than 10 Mb of single words, not to mention better dictionaries. It was easy to see that working on disk would make no sense; the output would be one puzzle a month. After thinking and puzzling a while, I found out that at first, happily, I didn't need to put the whole German vocabulary in memory but a rather restricted set of words—let's call them "the crossword puzzle set" of words—and second, that a sorted list of words of a natural language has a rather obvious redundancy: every following word is at least partially contained in the previous word. So, I devised a sheme that allowed me to store only the diverging rest of the following word by simply putting in front of every word the count of letters that stayed the same (later I learned that I had reinvented the wheel, as we have done often those days).
Although this was a perfect compression technique for the given task (compression around 10:1) you can imagine that it wasn't enough because the program itself needed some room and than RTS80 (a CP/M clone), which needed too around 8-10 Kb. So, at first I eliminated the delimiter by or'ing the length of the string into the first byte. Still, not enough. Because it was obvious that I didn't needed the whole ASCII set, but only upper case letters, I put every letter in 5 bits, saving 3 bits per letter. After some other tricks, such as restricting the set of words even further by allowing only words of a special range of length and handling long words separately (this is partly a specialty of German: compound words), I finally managed to get the whole "crossword puzzle set" of words which was around 1 Mb big into 48-50 Kb of memory! This is a compression ratio of 20:1, compared to the typical zip compression ratio of 2:1. Not bad for that time, no? By the way, the bigger the dictionary, the better is the reduction in size with this sort of compression.
Now, let's come to today's problems; they haven't changed much. Two years ago I had the problem of devising a fast search routine for on-disk searches. To search the whole disk physically was not practical; it took more than an hour to search the whole 300 Gb disk. In times of immediate responses by Google, this is in no way acceptable. So, I devised sort of a cache system for this search. But, it showed that the main directory file, which holds all the names of the files (around 3 Mio), grew to an unacceptable size of 168 Mb. So, in half an hour I wrote a C++ class that implements the above-described ancient compression technique, sorted the files on a per directory basis, and voila: The file size was reduced to 19.8 Mb. And, this was quite naturally without any further tricks such as reducing the character set or eliminating the delimiter.... These specialties would only have consumed time, and speed was my goal, after all. Because the search times with this file are below 5 seconds, I felt no need to reduce the size even further. But, if you feel you need better compression, see above. Extra benefit of the whole procedure: reading (and writing!) of the file is much faster than originally and, because disk actions are the most time consuming, this is essential.
Need It More General?
I always wondered why this sort of compression technique isn't used more generally. It's because it brings at least one aspect of all compression techniques to the point: The redundancy that is implicit in every natural language (and other natural givings) is eliminated to the greatest possible extent. Which other compression technique gives you compression ratios of 3:1-4:1 without any sophisticated big programming logic??
In a second pass, the structural redundancy then can be reduced by some well-known methods like LZW, to gain even further compression. Quite naturally, the above described situation is the straightforward application for this algorithm. To have a simple list of (natural) words alphabetically sorted gives the best results. Because English is very similar to German, this is also true for English. Spanish also gives very good results. French is slightly less suited to this compression technique because of the many accents.
But, besides these simple applications, it is possible to convert many other problems to such a format that you get nearly the same compression ratios.
Some examples: A dictionary for two languages. Instead of the conventional form—expression first language, translation second language—you put every language in a separate sorted list of words, compress them as described, and combine them by an index. After compressing the whole complex further by LZW or some other technique, you get the best possible compression in the west.
Second example: a text. Simply sort the words in the text and index them according to the occurence of the words in the text. Zip it. Done. The longer the text is, the better the results. By the way, if you think this to the end, you get an even stronger compression technique that is rather futuristic: a general—public—list of words. Texts are then only indices into this dictionary. Absolutely unbeatable compression ratio!! But, what if a foolish hacker gets hold of this public list of words? Hey, who volunteers to write this science fiction story?
The files you can download are the C++ classes that do the work: compressing by reading and writing in the above described manner. Examples how to use them are included. Hints for further enhancements also.