Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
by Marcus Tucker
On the forums that I participate in (and at work) I often see ASP code making extensive use of string concatenation, and although I want to point out why it's bad and how to improve it, I usually bite my tongue and say nothing because to cover the topic in sufficient depth would take more time than I want to spend! It is disappointing to see that so many developers are unaware of the performance implications of doing so, and in other cases where inefficient techniques are used I would normally provide a link to a relevant article, but so far I haven't yet come across one that does this particular subject justice, which is why I thought that I'd write one myself!
Now, as you can see, the AddToDelimitedList() sub uses VBScript string concatenation to tack each value onto the end of the delimited list, and in use this function would be called many times to add items to the list. During development (and perhaps testing), this seems to perform fine, but that's because there's only a few items in the list - only test data is being used. When the code becomes used in a production environment, it could easily cause a major performance because unfortunately VBScript's string concatenation peformance has an N-squared cost, i.e. the time taken to add each additional substring is proportional to the length of the entire string, and thus the total time taken to perform N concatenations follows an exponential curve, which quickly comes to dwarf the execution time for the rest of the code in the script. More on this later.
Of course, the delimited string manipulation code above is merely one example of where you might need to concatenate strings, but it's an interesting example because it's all-to-easy to call code like this from within a loop without realising that it could become a major bottleneck in your application's performance once the usage and/or volume of data increases significantly. Testing during development rarely accurately simulates real-world usage, so performance hits like this may only become apparent once the system is in place... and then it can often be too late to fix! Therefore, you *must* keep an eye out for hidden gotchas like this throughout the development process. So what can you do about it? Read on...
Years ago when I first became aware of the problem I rolled my own VBScript class which used arrays and the Join() function to provide huge efficiency improvements over the native concatenation performance - a technique I had seen in a tutorial somewhere. It was a great improvement, but still had its own problems which were caused by the need to continually ReDim the class's internal array - this became the bottleneck. I later refined this technique by doubling the size of the array with each reallocation, which of course continually halves the frequency of reallocation (at the expense of greater memory use, but that's not usually a problem with web servers), and therefore improves performance quite a bit. This has since always been my preferred technique for concatenating strings together, and the one that I have recommended to others in my capacity as Sitepoint's "ASP Guru", a position I have held for the last couple of years.
However, with the advent of ADO v2.5, some new and interesting objects became available, most significantly the Stream object, which replaces the antiquated (and extremely inefficient) file reading and writing functionality provided by the FSO (FileSystem Object), and adds native binary support (as well as the ability to convert between different data types and character sets). I use the Stream object all the time for manipulation of bulk binary data and text, since it can easily be used as an in-memory buffer (in the same way that a disconnected Recordset object can be used as a simple in-memory flat file database), but I hadn't thought to compare it to my array-based string concatenation class to see if it offered any performance improvements. I therefore wrote a new class to encapsulate the functionality, and a benchmark script to compare the three approaches - native VBScript concatenation, array joining, and the stream object. The results of these tests form the basis for the rest of this article. I should also mention that there are other solutions in the form of 3rd party freeware COM+/ActiveX components (such as Michael Balloni's StrCat.Catter), but as we all know, getting components installed on shared hosting space is nigh on impossible (for good reasons, but that's another story), so it's probably only a viable option for dedicated servers that are yours to do with as you please. And besides, from a portability perspective, it's usually wise to leverage the standard toolset (ADO in this case) as much as possible rather than to use non-standard/custom solutions.
All testing was performed on a spare PC dedicated to running IIS on Windows 2000 Server, and connected to my desktop PC via 100Mb Ethernet. Further details of the machines are irrelevant to these tests since it is the relative performance of figures within each set of results that is important, but be assured that the machine acting as the web server was not modified in any way during the time it took to perform all the tests.
To start with, here are the results of the benchmarks with a substring length of 100 characters, going up to a maximum of 2000 iterations (i.e. for every iteration, a 100-character string was appended to the end of the string being built):
|Substring length: 100|
Note that the resolution of the PC clock is +/- approx. 15.6 milliseconds, and therefore timings in the same order of magnitude have a high innacuracy (i.e. accuracy is poor for times below approx. 150ms).
The results in Table A clearly demonstrate that the native VBScript concatenation is extremely inefficient, taking 1.3 seconds to do what the array method can do in a mere 16 milliseconds (approx.)! The stream method takes a similar amount of time as the array method, but due to the innacuracy of the measurements at such short timings, it would be unwise to attempt to infer any more from these results.
|Substring length: 100|
Note that because of the long execution times that the native VBScript concatenation incurs, my benchmark script detects step execution times over 2 seconds and skips subsequent timings for that method. Note also that the table does display a couple of timings over 2 seconds and has odd increments of iterations - this is because I have appended results from another run with a different iteration range (but the keeping the substring size the same so that the tests are fair and scientific).
By increasing the iterations, this picture becomes clearer - table B shows that once the number of iterations has been increased, the native method takes a ludicrous 43.5 seconds for 10000 iterations while the array method is 400x quicker, taking a mere 110ms to the same, and the stream method close behind with 172ms.
However, there is another important factor to consider - the size of the substring that is being appended each time. So far all the tests have used a substring 100 characters long - what happens if it's smaller?
|Substring length: 10|
Table C shows that the native method performs better with smaller substrings - 10000 iterations in this test took 1.3 seconds, compared to taking 43.5 seconds with a substring of 100 characters (see the previous table). This is because the main cause of the method's performance hit is the overhead of copying around large blocks of memory, so the smaller the substring, the less memory used / copied, and the quicker the time. The array method is still faster than the stream method, but it's now almost 2x as fast! This is most likely to be because the mere act of accessing or calling a method on a COM object incurs overhead, and at small substring sizes these overheads become proportionally significant when compared to the cost of actually storing the 10 character substring once the COM method has been invoked.
|Substring length: 2000|
When the substring length is increased, the tables are turned - table D shows that the stream-based method takes almost a third of the time of the array method to build the string. Again, lest we forget, the array method is approx 180x faster than native, and the stream method is approx 540x faster! Some quick math shows that the final string is 20 million characters long - which to be fair is quite a large volume of data to manipulate - and while it's safe to say that it must occupy a minimum of 9.5MB of memory, since each method stores it in a different way it's impossible to say *exactly* how much actually is used to store it (without extensive further investigation - unicode further complicates matters, for example). Regardless, what *is* clear is that with large substrings the continual ReDim'ing of the array becomes more expensive than simply writing more characters to the end of the in-memory stream. I suspect that the stream object uses a linked list internally to store data across non-contiguous blocks of dynamically allocated memory instead of reallocating memory as a single block each time, since the latter case is of course the Achilles' heel of the native concatenation.
So, we've compared the three approaches, but what have we learned? Well, if nothing else, it should be crystal clear that VBScript string concatenation should be avoided at all costs! That said, it is safe to use native string concatenation when you *know* that only a fixed number of substrings will be concatenated together, such as when you're dynamically constructing a dynamic SQL statement, or assembling a line of HTML to be written directly to an output (file, response, etc). However, as a general rule of thumb, in all other cases, and particularly when the volume of the source data (in data size or number of items) is likely to increase over time, if the number of records which are processed to generate the output are subject to control by the end user (e.g. a selectable search results limit), or you are incrementally generating output which isn't written directly to the output (Response object, disk, etc), you must give due consideration to this issue and use a scalable string building technique.
And which technique should you use? As the results demonstrate, it all depends on exactly *what* you are appending. For short substrings, the array-based methods have the edge, but as you increase the length of the substrings, the stream-based method becomes the winner. Therefore, you should choose your method depending on the nature and volume of the data that you are using. To generalise, since most applications are unlikely to require the frequent appending of thousands of substrings which are each thousands of characters in length, the array method will probably do you just fine in 99% of cases. Incidentally, if you are already using an array-based VBScript concatenation class, check the code to make sure that it is NOT ReDimming the array each time a new substring is added - it's significantly more expensive to use that approach (as I mentioned at the start of the article), so edit it so that it uses the lazy length-doubling method, or simply switch to using my code (see end).
As well as ensuring that you are now fully aware of the performance issues associated with using native string concatenation, hopefully this article has also made you realise that you should try to test your code with real-world data, and to examine alleged performance-enhancing techniques carefully, since whether or not you get a significant gain (or perhaps even a decrease in performance) all depends on the precise circumstances in which you use a given technique - in particular, what data you are passing through it, and how often it gets called.
Finally, the code for the benchmark and concatenation classes is available here: StringConcatenation.zip (2 KB), and you are free to use the code in your own work. I ask only that you leave my credits in place. I hope you've found this interesting reading, and it's always nice to know whether advice has paid off for people, so let me know your feedback, particularly if as a result you've implemented changes which have made a significant difference to the performance of your scripts / web applications.
An example of poorly performing code using the native method - CodingForums
The internal workings of VB/VBScript native string concatenation - Microsoft MSDN