Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
For some odd reason, I've spent a good bit of time in my career figuring out how other people did things. In this column, I want discuss how you can get started with your own reverse engineering tasks. I'll start out with the biggest mistakes that most people make reverse engineering. Finally, it takes quite a bit of knowledge to become really good at reverse engineering so I'll point out the areas were you can study to learn more. In my next column, I'll go through a real life example of something I reversed engineered so you can see the thought process in action.
Before we get started, I have to discuss a bit about the legal ramifications to reverse engineering. Most of your software licenses have clauses in them where you are not supposed to do any reverse engineering. What I discuss in this column might cause you to break those licenses. Therefore, caveat emptor, buyer beware. If a software manufacturer does sue you, you cannot hold me responsible as I am giving you plenty of warning. I am not a lawyer so check with your own legal council before you proceed. The final note about reverse engineering is that it can make your life very difficult in future releases of your product. While you might figure out how something works internally, if you rely on that internal knowledge in your product, you can easily break the next time the operating system or third party product you integrate changes. You should never rely on hacks you figured out through reverse engineering unless you are prepared to spend a considerable amount of time re-reverse engineering each time a new release comes out. Operating system writers and third party vendors spend a considerable amount of time working on documented interfaces for you to use. If you circumvent them, you can pay an exorbitant price down the road. They are not called undocumented interfaces or techniques for nothing!
The Big MistakesThe best way to show the first mistake is to start out with the first two lines of an email I received recently: "I need to figure out how Word does word wrapping with variable pitched fonts. How do I start?" The mistake is that people think they can reverse engineer their way to an algorithmic design for their product. While I'm sure if you were given enough resources and infinite time, you could probably figure it out. However, you would take the remainder of your 30+-year career looking at the same four billion assembly language instructions. Reverse engineering will never take the place of designing your application.
What You Have To KnowWhile most people think being an assembly language programming god is the first step to reverse engineering, it really isn't. It helps quite a bit, but I've figured out how many things work without ever cracking a disassembler. The most important thing when reverse engineering is to step back and figure out how you would implement the functionality you are reverse engineering. By writing out the algorithm you would use to solve a problem, you can many times "see" very quickly how something works.
An excellent example is when I needed to figure out how compiled VB binaries and p-code VB binaries called into the VB run time, MSVBVM60.DLL. My first thought was that if I were responsible for designing the VB run time, I would want the interfaces to be the same no matter how the VB code was compiled. That way I would have only one way of testing interface calling. I had heard that p-code executes directly and not run through a Just-In-Time (JIT) compilation process. Therefore, the p-code calls would have to go through some "thunk" to call the run time. In scripting languages, thunks allow the scripting language to call into actual CPU code. The interesting thing with thunks is that they are allocated memory that the programmer has the CPU instruction pointer jump to. With this thought, I figured that if I were writing the compiled VB portions, I would use the same technique.
When I was going through this thought process, I never once used the debugger or looked at a disassembly. In essence, I was making a hypothesis. The good old scientific method proves itself yet again. Armed with my hypothesis on how VB made the calls, loaded up one native compiled application and one p-code compiled application into two debuggers. I set a breakpoint on rtcBeep exported from MSVBVM60.DLL because I guessed that the VB intrinsic function, Beep, must call down into rtcBeep. When each compiled program stopped on rtcBeep, I looked up the call stack at the calling function. The Call stack window showed that the address for the caller did not have symbols. I then checked the address of the memory against the Modules dialog and noticed the address of the memory did not appear in any of the loaded modules. I then when through the same process with the p-code compiled application, so I could verify my hypothesis again. Therefore, memory containing the thunk callers came from allocated memory and both native compiled and p-code compiled VB both called through thunks the same way. It didn't take any knowledge of assembly language to figure out the solution, just a hypothesis on how I would have implemented the functionality if I were to write it, and a way to verify that hypothesis.
As you can see from the previous discussion, it also helps to have an idea how different problems can be solved using the facilities provided by the operating system. In the Windows world, that means knowing about how Windows itself works. The first book you need to read cover to cover is Charles Petzold's Programming Windows. Charles covers how the basics of Windows and shows you how it all fits together. Fundamentally, Windows is a simple messaging based system and if you know messaging like the back of your hand, you will have a much better chance at figuring out how to consider solving various reverse engineering challenges. You will learn more about Windows if you sit down and write Notepad in straight C programming than almost anything else. The second book you need to read from cover to cover is Jeffrey Richter's Programming Applications for Windows. Once you understand the fundamentals of Windows, Jeffrey's book will get you up to speed on things like memory management and DLLs. Once you have a good grasp of those two technologies, you will be able to see how many problems in Windows get solved. Depending on what you are doing, a few other books might be useful as well. David Solomon and Mark Russinovich's Inside Windows 2000 can give you insight as to how Windows 2000 works at the kernel level. If you want to learn how to take advantage of the debugger, my own Debugging Applications can show you how to do advanced things with the Visual C++ debugger.
As much as you would like to avoid it, you do need to know assembly language in order to do the most advanced reverse engineering. There are still a few books floating around on how to program Intel x86 assembly language. The one I used to learn with was Mastering Turbo Assembler by Tom Swan, which I am sure is out of print. Assembly language is still taught at the college level so there are good learning books out there. In order to learn assembly language you should look at using the Microsoft Assembler (MASM), which is available with your Universal MSDN subscription, to write either a few simple programs or a DLL with some routines in them. You don't have to get super proficient at assembly language, you just need to be able to read it.