Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
XML—Yes, it is an eXtremely Magic Language. It has created an industry-standard way to represent data, express data relationships, and interact with data that has been well accepted in the world. Every computing platform today has some sort of XML support. XML became a universal standard because of the fact that XML+HTTP= SOAP and Web services and because it makes the idea of interoperability very real.
WSE supports the Direct Internet Message Encapsulation (DIME) protocol that defines a mechanism for passing attachments in a SOAP message. It is often necessary for a Web service to send a large file of text or binary data, such as an image file, in a SOAP message. SOAP messages, by default, are not necessarily a good mechanism for transporting these large files because they are plain-text XML. To be added to a SOAP message, a file must be serialized into XML, which could be more than double the size of the original file. The DIME protocol solves that problem by defining a mechanism for placing the entire contents of the original file outside the SOAP envelope, eliminating the need to serialize the file into XML.
In a baseline Web service, we use a base-64 encoding style to attach a binary file as a SOAP message. This won't work well when we want to attach a large file such as a WAV file or big image file.
There is another commonly used mechanism for packaging multiple pieces of data together besides DIME. That is the Multipurpose Internet Mail Extensions (MIME) specification and, in particular, MIME multipart. MIME multipart has been used for a number of purposes, but its most common use is sending e-mail messages with attachments. It makes perfect sense that the problem that DIME addresses could also be solved by using MIME. In fact, there is an earlier specification called SOAP with Attachments that was created with the help of Microsoft; it uses MIME multipart to solve the attachments problem for SOAP messages. So, why was DIME created in the first place?
First of all, it is important to understand some key differences between MIME multipart and DIME. Like DIME, MIME multipart has a number of "data records" that each have a header and a payload. Instead of using data lengths to indicate where the next header begins, MIME multipart uses a separator string. The separator string is provided at the beginning of a MIME message, and then is inserted between each data record. Code parsing a MIME message must look through the data until it finds the separator string, at which point it knows that it has found the next data record.
The MIME multipart approach is designed to be efficient for the sender of a MIME message into some stream of data. The sender does not need to know how much data they are sending before they send it. They simply stream the data until they reach the end and then they append the separator string.
The problem with this approach is the inefficiencies that may occur on the receiving side. If you are receiving a data record, you have no idea how large the data record is. You can guess at how big to allocate your receiving buffer, but you are invariably going to have to deal with situations where there is more data then you allocated. Now, you have to re-allocate your buffer and still you have no idea whether it will be large enough for the incoming data.
A related problem is the difficulty in simply finding the data record boundaries. For instance, an application reading a compound document may not even be interested in the next three data records in a particular stream. However, with a MIME approach to this problem, they will have to inspect each byte of those three data records. The act of looking at each byte in the data to see whether it is the beginning of the boundary string is an excruciatingly painful task compared to the DIME approach, where you can directly step from record to record based on the fact that DIME includes the lengths of the data in the data record header.
Certainly, the buffer size and boundary searching problems are solvable with a MIME-based solution. MIME is very flexible in how its data record headers can be used, and there are numerous situations where people have added content-length headers to their MIME data records. But now, the question arises: Why are there separator strings if we already know the length of the data record? Also, if you mandate adding a content-length MIME header to the data records, you now have lost the flexibility provided by the sender being able to simply stream the data until its end. It will have to know the complete size of the data record before it can even start to send it.
Of course, needing to know the size of the data before sending it is also a solvable problem to address. Just as DIME has support for chunking, a solution for chunking data could be created within MIME multipart as well that would solve this problem. But think about what the solution we are describing would now look like.
Basically, we have come to the conclusion that a MIME multipart solution should 1) include content lengths of the data records, which means that 2) the separator string delineating data records is no longer needed, and 3) a chunking mechanism would have to be defined to resolve the unknown data size issues. If you imagine trying to resolve all these issues within the MIME multipart framework, it is not much of a stretch to think that you would start to wonder about a simpler solution to these problems. If you throw in the added benefit of the ease of parsing the fixed length portion of a DIME record header and the simplicity of hopping between records in a DIME message, it is not hard to understand the appeal of the DIME approach.
It's important to note that simplicity has an additional benefit that is critical to Web services. Interoperability is a gigantic part of why Web services have become as popular and appealing as they are. The key to successful interoperability is simplicity. Code that is simple to write and straightforward to design can easily be reproduced on multiple platforms. When it comes down to sending a compound document from a Windows platform to a Unix platform, if the format of the compound document is straightforward, without a lot of extra rules and with little room for interpretation, the chances are that the document will be understood on both platforms. A good measure of the simplicity is the number of lines of code required to implement the solution. Consider the logic required to implement a DIME solution compared to the logic of a MIME multipart solution. MIME multipart requires special logic to create separator strings that are unlikely to occur in the data being sent. It requires logic to walk through the data to find the separator strings. It requires header parsing logic that can handle a variable number of headers of variable lengths parsed via string comparisons. The most complex part of creating or parsing a DIME record is probably dealing with the bit-order of the integers being passed in the fixed-length portion of the data record headers. Very little complexity means very few bugs and very few interpretations. The result should be easy interoperability.