As engineers and computer folks, we all like to figure out how things work. We just can’t leave well enough alone, we have to poke and prod at things until we can see exactly how the implementers did it. While we sometimes pull things apart just out of curiosity, sometimes we have to get in there and figure out how something was done so that we can take advantage of a feature or work around a bug in the implementation. Reverse engineering allows us to peel the layers of engineering back one at a time until you can see enough information to see the item works.
For some odd reason, I’ve spent a good bit of time in my career figuring out how other people did things. In this column, I want discuss how you can get started with your own reverse engineering tasks. I’ll start out with the biggest mistakes that most people make reverse engineering. Finally, it takes quite a bit of knowledge to become really good at reverse engineering so I’ll point out the areas were you can study to learn more. In my next column, I’ll go through a real life example of something I reversed engineered so you can see the thought process in action.
Before we get started, I have to discuss a bit about the legal ramifications to reverse engineering. Most of your software licenses have clauses in them where you are not supposed to do any reverse engineering. What I discuss in this column might cause you to break those licenses. Therefore, caveat emptor, buyer beware. If a software manufacturer does sue you, you cannot hold me responsible as I am giving you plenty of warning. I am not a lawyer so check with your own legal council before you proceed. The final note about reverse engineering is that it can make your life very difficult in future releases of your product. While you might figure out how something works internally, if you rely on that internal knowledge in your product, you can easily break the next time the operating system or third party product you integrate changes. You should never rely on hacks you figured out through reverse engineering unless you are prepared to spend a considerable amount of time re-reverse engineering each time a new release comes out. Operating system writers and third party vendors spend a considerable amount of time working on documented interfaces for you to use. If you circumvent them, you can pay an exorbitant price down the road. They are not called undocumented interfaces or techniques for nothing!
The Big Mistakes
The best way to show the first mistake is to start out with the first two lines of an email I received recently: “I need to figure out how Word does word wrapping with variable pitched fonts. How do I start?” The mistake is that people think they can reverse engineer their way to an algorithmic design for their product. While I’m sure if you were given enough resources and infinite time, you could probably figure it out. However, you would take the remainder of your 30+-year career looking at the same four billion assembly language instructions. Reverse engineering will never take the place of designing your application.
The second, and most common, mistake is that people try to reverse engineer far more than they should. It’s the age-old case of biting off more than you can chew. To successfully reverse engineer, you need to have a clear and concise goal. My rule of thumb is to never embark on a reverse engineering task unless I feel it’s solvable in less than a day or two. It’s just not worth the effort to reverse engineer something for several weeks when you could spend a couple of days designing around the problem or issue right up front.
What You Have To Know
While most people think being an assembly language programming god is the first step to reverse engineering, it really isn’t. It helps quite a bit, but I’ve figured out how many things work without ever cracking a disassembler. The most important thing when reverse engineering is to step back and figure out how you would implement the functionality you are reverse engineering. By writing out the algorithm you would use to solve a problem, you can many times “see” very quickly how something works.
An excellent example is when I needed to figure out how compiled VB binaries and p-code VB binaries called into the VB run time, MSVBVM60.DLL. My first thought was that if I were responsible for designing the VB run time, I would want the interfaces to be the same no matter how the VB code was compiled. That way I would have only one way of testing interface calling. I had heard that p-code executes directly and not run through a Just-In-Time (JIT) compilation process. Therefore, the p-code calls would have to go through some “thunk” to call the run time. In scripting languages, thunks allow the scripting language to call into actual CPU code. The interesting thing with thunks is that they are allocated memory that the programmer has the CPU instruction pointer jump to. With this thought, I figured that if I were writing the compiled VB portions, I would use the same technique.
When I was going through this thought process, I never once used the debugger or looked at a disassembly. In essence, I was making a hypothesis. The good old scientific method proves itself yet again. Armed with my hypothesis on how VB made the calls, loaded up one native compiled application and one p-code compiled application into two debuggers. I set a breakpoint on rtcBeep exported from MSVBVM60.DLL because I guessed that the VB intrinsic function, Beep, must call down into rtcBeep. When each compiled program stopped on rtcBeep, I looked up the call stack at the calling function. The Call stack window showed that the address for the caller did not have symbols. I then checked the address of the memory against the Modules dialog and noticed the address of the memory did not appear in any of the loaded modules. I then when through the same process with the p-code compiled application, so I could verify my hypothesis again. Therefore, memory containing the thunk callers came from allocated memory and both native compiled and p-code compiled VB both called through thunks the same way. It didn’t take any knowledge of assembly language to figure out the solution, just a hypothesis on how I would have implemented the functionality if I were to write it, and a way to verify that hypothesis.
As you can see from the previous discussion, it also helps to have an idea how different problems can be solved using the facilities provided by the operating system. In the Windows world, that means knowing about how Windows itself works. The first book you need to read cover to cover is Charles Petzold’s Programming Windows. Charles covers how the basics of Windows and shows you how it all fits together. Fundamentally, Windows is a simple messaging based system and if you know messaging like the back of your hand, you will have a much better chance at figuring out how to consider solving various reverse engineering challenges. You will learn more about Windows if you sit down and write Notepad in straight C programming than almost anything else. The second book you need to read from cover to cover is Jeffrey Richter’s Programming Applications for Windows. Once you understand the fundamentals of Windows, Jeffrey’s book will get you up to speed on things like memory management and DLLs. Once you have a good grasp of those two technologies, you will be able to see how many problems in Windows get solved. Depending on what you are doing, a few other books might be useful as well. David Solomon and Mark Russinovich’s Inside Windows 2000 can give you insight as to how Windows 2000 works at the kernel level. If you want to learn how to take advantage of the debugger, my own Debugging Applications can show you how to do advanced things with the Visual C++ debugger.
As much as you would like to avoid it, you do need to know assembly language in order to do the most advanced reverse engineering. There are still a few books floating around on how to program Intel x86 assembly language. The one I used to learn with was Mastering Turbo Assembler by Tom Swan, which I am sure is out of print. Assembly language is still taught at the college level so there are good learning books out there. In order to learn assembly language you should look at using the Microsoft Assembler (MASM), which is available with your Universal MSDN subscription, to write either a few simple programs or a DLL with some routines in them. You don’t have to get super proficient at assembly language, you just need to be able to read it.
What You Have To Use
After reading the books, you need to start developing your toolkit. There are many tools you can use, but I thought I would list the tools that I have purchased or acquired and I move from machine to machine when reverse engineering. I’ll start out with the free products and work my way to the commercial products.
Matt Pietrek wrote PEDUMP and it’s available on the MSDN CD or MSDN Online. PEDUMP dumps all the information about a Portable Executable (PE) binary. You can get the same output with DUMPBIN from Visual Studio, but I like the format of PEDUMP better. When looking for imported and exported functions, you need PEDUMP.
REGMON and FILEMON
Mark Russinovich wrote both REGMON and FILEMON, which are free and downloadable from www.sysinternals.com. REGMON monitors and completely reports all registry access on your computer. FILEMON monitors all disk and file accesses on you computer. Both of these tools allow you to easily see who’s doing what to whom. One time I purchased a product that was downloadable and as a challenge, I wanted to see if I could break their registration scheme before I entered my valid, purchased ID. A total of two minutes with REGMON and I broke the scheme.
The DEPENDS program from the Platform SDK reports all imported functions used by a program. You can even run an application under depends and see what functions it acquires through GetProcAddress. DEPENDS is the tool for monitoring what exports are used out of a DLL.
BoundsChecker is a commercial error detection tool from Compuware/NuMega. You can get more information about BoundsChecker by visiting www.numega.com. What many people don’t realize about BoundsChecker is that it will monitor and record each and every API call a program makes and show them in the wonderful Event view. What makes it even more interesting is that BoundsChecker will record the complete parameter information and function return values as well. While you can’t see into the APIs, BoundsChecker makes it quite easy to see API functions an algorithm called to get the work done. When I worked at NuMega, one of the demos we had was to show how the Solitaire game did the card magic at the end of the game.
SoftICE is also a commercial product from Compuware/NuMega. When you think of reverse engineering in Windows, SoftICE is right there because it’s used by more people to reverse engineer things than anything else. I described how to get started with SoftICE in a previous column so you can turn there to get an idea how to use it. What I’ve always found amusing is that SoftICE is one of the most heavily pirated pieces of software around today. The beauty of SoftICE is that it allows you to see anywhere and everywhere, as well as get more information about the operating system than anything else.
The final tool you need for larger reverse engineering chores is a disassembler. You already have one with the -DISASM switch to DUMPBIN. What makes DUMPBIN a little more useable is that it will use any symbols it can find so you can get more information. What you will probably want to do is to write a Perl script to process the output to make it more readable. While you can always use the debugger’s Disassembly window, you sometimes need the disassembly in a text file.
I hope I’ve given you an idea on how to get started with your reverse engineering challenges and how to deploy it properly. It’s a big commitment to reverse engineer something so use it only when you have no other choice. In my next column, I’ll apply the lessons and reverse engineer a few things in the operating system so you can see how they work.