SOS from Your Production Environment

Introduction

Developing large enterprise applications is a complex and difficult undertaking. Writing the code is just one of many tasks you have to do. You worry about requirements, designs, architecture, unit testing, daily builds, release builds to QC, and many more things. All this effort is spent to create a reliable, scalable, well-performing, and functioning application. Then comes the day where you move it into production (if you are lucky and it is hosted by your own organization) or the customer starts installing it on his servers. This is a big day, a day of celebration. You see the fruits of all your labor and you are excited to see users using the application, getting their feedback, and improving the application. But, too often it starts to haunt you. The customer reports crashes, instability, or unpredictable behavior. You tell yourself, but it is working on our environments. What is different between our test environments and the customer environments?

This is one of the most difficult challenges a development team can face. Your options are suddenly limited. On your development environment, you fire up the VS .NET debugger, set breakpoints, look at the application state, and so forth. Through that, you are finally able to figure out what is going on and then make your code change. But, you tell the customer that you need to install VS .NET to debug this issue. Watch out for the reaction; it might be pretty nasty. Production environments are very locked down and only approved applications can be installed. Very often, any change applied to production needs to go through a stringent test process, which takes time. All this while the end users have to bear with the stability, performance, or functional problems. If this goes on for too long, users will abandon the application and the organization has to fight an uphill battle to convince end users to come back. This creates lots of frustration, noise, problems, and can result in large losses. This is ultimate hell for every developer. You have no idea what is going on while everyone expects a resolution by yesterday.

Gather Data to Make Informed Decisions

Applications can behave very differently in various environments and under load. First, stop worrying about all the shouting. Concentrate on gathering the right data so you can narrow down what is going on. Start with basic information, like which OS and Windows patches are installed. Look at the event log to find out if there are system or application errors reported. If not done automatically, run a virus check too make sure there is no virus infection going on. Enable your custom application logs and comb through them to find out what is happening. If all that does not uncover anything, understand how the application is used: which features are used heavily by users, how many concurrent users are on the system, and so on. Then, replicate a similar environment in house and run a load test against it; this simulates a usage scenario as close as possible (see my article about concurrent users stress testing).

If all that does not bring you closer to a resolution, you need to take a snapshot of the application in production and analyze it. This article will introduce you to the basic approach for this and then point you to more advanced articles. It is easier than most people believe. Microsoft has built a very nice debugging story—in the unmanaged as well as managed world.

The “Debugging Tools for Windows”

Microsoft provides debugging tools for Windows NT 4.0, Windows 2000, Windows XP, and Windows 2003. The homepage for the “Debugging Tools for Windows” can be found here. Follow the “Install Debugging Tools for Windows 32-bit Version” link to download the latest version of them (this article uses version 6.4.7.2). The tools, by default, are installed in the “c:program filesdebugging tools for windows” folder. The install also adds a “Debugging Tools for Windows” menu group under “All Programs.” This includes a “Debugging Help” that provides some very good information.

There are a number of debuggers that you can use to debug your application. This article will concentrate on how you can take a dump of your application and then analyze these dumps on another environment and not the production environment itself. You will see how you can take a dump when the application hangs, crashes, or just while it is running. These dumps include a complete memory dump so you can see all the threads executing, all the objects on the stack, and the like. This is the least intrusive approach in really understanding what is happening in your application while used in production. This does also not require any files to be registered; this makes it easier to get permission to use it in production and also to remove again when no longer needed (which the customer might request). Install the debugging tools on any machine you want and then copy the following five files from the “c:program filesdebugging tools for windows” folder to the production environment:

  • adsplus.vbs
  • cdb.exe
  • dbgeng.dll
  • dbghelp.dll
  • tlist.exe

You don’t need to register the DLLs. The cdb.exe file is the “Microsoft Console Debugger” and the adsplus.vbs file is a Windows scripting file that is used to automate the CDB debugger. This requires the Windows Scripting Host 5.6 to be installed (run cscript.exe to check the version number). If required, download the version from here and install it on the production server.

Always Create the Symbol Files for Your Binaries

A debugger needs symbol files to show you more then just class, method, and object addresses. Symbols enable debuggers to show you the class names, variable names, and so forth. You can debug an application without symbols, but it is much harder and needs a lot of experience. You want to make your life as easy as possible; therefore, always generate the symbol files. When you compile your application in debug mode, you will see in the same folder where the DLL or EXE gets generated, also a PDB file. The PDB file is the symbol file that you need for debugging purposes. Of course, you do not want to release the debugging version of your binaries. You can tell the compiler also to generate these symbol files when compiling in release mode. Open the project settings in your Visual Studio .NET IDE (menu Project | Settings). Select the Build tab, select “Release” in the Configuration drop-down box if not already selected, and then click on the Advanced button. In the “Debug Info” drop-down box, select “PDB-only.” Close your project settings and rebuild your project. You need to do that for all project files. Make it a habit that, when you release your application, you not just release the binaries (DLLs and EXEs) but also all its symbols. Therefore, you have the symbols ready anytime you need them for debugging purposes.

Symbol files contain information such as all the class names, method names, global and local variable names, as well as source line numbers. They are kept separate so that your binaries are smaller and faster when running. Later in the article, I explain how you can load these symbols into the debugger. You can also obtain all the symbols for the Windows OS, the .NET framework, and many other Microsoft products. You can tell the debugger to download it as needed from the Internet or, if you do not have access to the Internet while debugging, you can download them from the Microsoft site (Windows symbols ). The article will explain how to set up your debugger to download Microsoft symbols files as needed.

Using ADPlus to Take Application Dumps

Now, you are ready to take dumps. First, start your application. The article has a ThrowException .NET sample application attached; it allows you to generate two unhandled exceptions. You will use this sample application to walk through all the examples in this article. Next, open the task manager and go to the “Process” tab. Select the “Show processes from all users” check box at the bottom so you can see all processes running. Next, find the process named “ThrowException.exe” and note down the process ID (shown in the PID column).

ADPlus has a number of command line operations. First, you need to decide whether you want to perform a crash dump or hang dump. A crash dump is for situations when your application unexpectedly terminates. Hang dumps can be used to take a dump when your application hangs or any time while it is running. ADPlus cannot be used in scenarios where your application crashes while starting up. It can only be used for applications that are running and then crash. Use the CDB or WinDbg debuggers for scenarios where your application crashes during startup. ADPlus automates the CDB debugger and attaches it to your process. It also can be used to attach it to multiple processes; for example, when your application runs under IIS and uses also COM+. When CDB kicks in, it freezes all processes it has been attached to, takes a dump for each asynchronously, and then lets these processes continue to run.

Running ADPlus in Crash Mode

Open a command prompt and go to the folder where you installed or copied the debugging files. You need to provide at a minimum the following command line arguments when running ADPlus:

  • Mode: The mode you want the CDB debugger to run in. Add “-crash” for crash mode or “-hang” for hang mode.
  • Process to monitor: Add “-p <process id>” to tell CDB which process to attach to. You can repeat that option for each process you want to monitor. For each process, it spawns a separate instance of CDB.
  • Quiet mode: When you run ADPlus, it will show a dialog box at the beginning, telling you which mode has been chosen and where the log files will be created. When you run ADPlus on a remote machine, you need to suppress this dialog box; otherwise, ADPlus itself will hang (see later in the article). Add the option “-quiet”.
  • Location of log files: With the “-o <log file path>” option, you can specify the path where the log file will be created. The CDB debugger creates a unique folder each time it runs under that log file path. The folder name will be a combination of the mode and date and time the CDB has been started, for example:
    Crash_Mode__Date_04-01-2005__Time_19-57-18PM

    This guarantees that no dump will be overwritten with another dump. In that folder, you find the actual memory dump as well as a number of log files. The file “ADPlus_report.txt” contains information about the configuration the CDB debugger has been started up with. The “Process_List.txt” file lists information about all the processes running when CDB started. The “PID-<process id>__<process name>__<date>__<time>.log” file contains all the output of the CDB debugger while running. The actual dump generated by CDB gets placed in the “PID-<process id>__<process name>__<…>.dmp” file.

  • Symbol path: The option “-y <path> specifies the path where the symbol files can be found. The path contains three pieces of information:
    • Symbol server: The symbol server to use. This should always be “srv” unless you have a custom symbol server you utilize.
    • Downstream store: The downstream symbol store; for example, “c:symbols”. CDB will cache symbols from the upstream store to the downstream store, providing a cascading symbol store cache.
    • Upstream store: the upstream symbol store. This can be a local path, a network path, or a URL.

    All three pieces of the path should be separated by a “*”. The following example points to the public symbol store from Microsoft and uses a local downstream store:

    -y "srv*c:local symbols*http://msdl.microsoft.com/
                                    download/symbols"
    

    This allows you to download CDB the symbols to your local store; this makes it much faster for any subsequent access to the symbol file. Symbols are copied to the downstream store as CDB requires it. So that it doesn’t, just go ahead and copy every symbol file. You also can list multiple symbol stores by separating each with a semicolon. The next example points to the Microsoft public symbol store as well as the symbol files of your application:

    -y "srv*c:local symbols*http://msdl.microsoft.com/download/
                                    symbols;
        srv*c:local symbols*c:ThrowExceptionbinRelease"
    

    You also can use the “_NT_SYMBOL_PATH” environment variable instead of using the “-y” option. As mentioned earlier in the article, you can download all the Microsoft symbols if the production environment does not have Internet access. This also means that all your application symbols should be copied to a folder on the production environment. The following article provides a much more comprehensive explanation of the symbol stores and symbol server.

  • Exception mode: Any exception can be raised to the debugger as a first-chance or second-chance exception. First chance exceptions are non-fatal exceptions that are handled by the application. If a first-chance exception is not handled by the application, it gets raised as a second-chance exception. Only debuggers can handle second-chance exceptions. Second-chance exceptions normally cause the application to shut down unless a debugger is attached to it. By default, ADPlus takes a minimum dump for all first-chance exceptions except unknown and EH exceptions (these are quite common and would generate too much overhead). This pauses the thread, and then logs in the log file the exception, thread ID, and call stack of the thread that raised the exception as well as the date and time when the exception occurred. Finally, it takes the mini dump and then resumes the process. The following four command-line options control what action is taken when a first chance or second chance exception happens:

    • Full dump on first-chance exceptions : The “-FullOnFirst” option tells ADPlus to take a full dump for first-chance exceptions.
    • No dump on first-chance exceptions : The “-NoDumpOnFirst” option tells ADPlus to take no dumps at all for first-chance exceptions.
    • Mini dump for second-chance exceptions : By default, ADPlus takes a full dump for second-chance exceptions. The “-MiniOnSecond” option tells ADPlus to take only mini dumps at second-chance exceptions. This is useful when you need to send the dump to someone to look at. These are small dumps, whereas full dumps can be hundreds of megabytes and are difficult to send around.
    • No dump on second exceptions . The “NoDumpOnSecond” option tells ADPlus not to generate any dumps on second-chance exceptions.
  • Notification : The “-notify <machine name> option will send an alert to the machine when a crash dump is taken. This will bring up a message box on the machine and is useful so you don’t have to wait till a crash happens.

For a complete list of all the ADPlus command line arguments, please refer to the topic “ADPlus Command-Line Options” in the “Debugging Help” section. It also explains how you can create a configuration file with all these settings and tell ADPlus with the “-c <configuration file path>” option to use the configuration file instead. Assuming that the application ThrowException runs under the process ID 2828, here is how to start ADPlus in crash mode, logging all information in the “c:crashlogs” folder.

ADPlus .crash -p 2828 -o c:crashlogs
              -y "srv* symbols*c:ThrowException;
                 srv* c:symbols*http://msdl.microsoft.com/download/symbols"
              -quiet -FullOnFirst

This spawns a new window that shows the CDB debugger attached to your application. You can press Ctrl+C in that window any time to take a hang dump if no crash happens. But, this will terminate the process. ADPlus cannot be run in crash mode through Terminal Server on Windows NT 4.0 and Windows 2000. The following article explains how to run in crash mode remotely. It also contains more detailed information about how to use ADPlus.

More by Author

Must Read