Click to See Complete Forum and Search --> : [RESOLVED] New Thread causes Server 2003 heap error


nice_guy_mel
June 28th, 2007, 10:40 AM
Using visual studio 6 with service pack 6, and multithreaded run time library.

I'm coming to my wits end. For 2 weeks now we've been trying to find out why our software is crashing on Server 2003 OS. I'll describe it as best I can:
Our software is like a router, taking in multiple data inputs, then combining them, then sending them out of multiple servers, who then send the combined data to any users who have connected to the server. This application has been working flawlessly for months on all tested operating systems. The output communication is handled by a single thread which systematically goes through it's container of servers and sends the data to each server. The servers are (so we thought) thread safe because of the following:

void CTopServer::SendMesgToAll(std::string& strMesg)
{
m_lockConx.lock();
///Goes through it's container of connected users and sends data to each
m_lockConx.unlock();
}

So when my superiors approached me and asked if it would be possible to create a new thread that sends a separate data feed out through the servers, I didn't think it would be a problem. A new thread was created that reads in data from a file and calls the same SendMesgToAll for the servers. We tested it thoroughly on Windows 2000 and Windows XP, and the software ran for days. Days before the promised deadline, our customer informed us that the OS they want to use is Windows 2003 server. Just to be safe, we installed and tested the software on a pc running 2003 server. That's when the problems started. Suddenly our software that had been running for weeks on XP could only run for hours on 2003 server before crashing.
We had to install Visual Studio on the 2003 server pc to find out what the problem was. The software runs for about 2 hours then the output windows shows "heap corruption detected at _____" with a different address each time. We used a trial version of HeapAgent to see if it could catch anything while the software was running. HeapAgent reported nothing on the computers running XP, but HeapAgent running on Server 2003 were catching "Double Free" heap errors. The cause seemed to be 2 STL strings that were both pointing to the same heap memory. It reported that the second string who tried to free the same memory was found in the SendMesgToAll method.
To check if the problem was comming from the critical section, we ran the test on 2003 again, only this time no users connected to the servers. The software ran for days again. It was only once we connected a user to the server, that the heap error would reappear.

Does anyone know of any issues that Server 2003 has with STL strings? Or if there are any known issues with Server 2003 and multithreaded console applications? Any advice would be appreciated. Sorry for the long-winded description.

Arjay
June 28th, 2007, 12:18 PM
To make a long story short, it's not Win 2003 server. When you created the new thread, did you use the same lock object that was shared across the other threads (e.g. m_lockConx) or did you create a new one? If you didn't share the lock between the other threads, that could be the problem.

In addition, W2K3 handles std::string's fine, but if a string is shared across threads, you need to protect it with some sort of locking object (cs, mutex, etc.).

Finally, don't assume that the code you've inherited was thread safe to begin with. It may not have been, but due to timing, luck, what have you (except for the rare crash here and there), it might have seemed to be thread safe.

I would go over the code with a find tooth comb and ensure that any resource that is shared between threads is properly protected.

JVene
June 28th, 2007, 12:36 PM
I doubt this is specifically Windows 2003, despite the localization of the symptom. I've had situations in years gone by where I swear that HAD to be it (different OS profiles, though) - and it wasn't.

It could be, for example, just that since the server version favors background threads, and services the thread scheduling a little differently, that you simply never exacerbate the problem while under XP. Is it possible you have a single/core multiple/core issue? Even, say, a hyperthread processor under 2003 and the XP has hyperthreading disabled?

My point is, when it comes to synchronization bugs, while they can appear and disappear in various configurations, it's almost always NOT the OS difference, but a bug that simply isn't excited in one configuration the way it is in another.

Protecting the container with this critical section/mutex lock looks fine, but is that done everywhere? Even the subcontainer implications?

That is, while you've protected the container HERE, is there ANYWHERE else that you've accessed either that container, or any element it contains?

Are you sure you're particular string class is thread aware? Is it possible the 'double delete' has occurred by an operation in a string class where THAT class doesn't have any provision for threaded ownership?

Is there any function/operation that hands off an element from that container for use in another thread? Is that thread's actions controlled by a synchronization object?

These are just about always what I find whenever this puzzle has hit me.

Alternate advice point - always use objects that control this stuff. That is, here you have an object representing the mutex/critical section, but why don't you have an object that represents the lock?



void CTopServer::SendMesgToAll(std::string& strMesg)
{
Locker l( m_lockConx );
///Goes through it's container of connected users and sends data to each
/// -> No longer needed, can't forget it anymore m_lockConx.unlock();
}



An object that represents the lock simply calls for a lock at construction, and always releases (unlocks) at destruction. This is classic RAII, and it means you're unlock is safe for use with exceptions that might otherwise bypass the unlock STEP.

Similar object solutions to coordination can be helpful and reducing, and I submit outright eliminating, the potential for issues like double deletion.

Smart pointers, with reference counting (which are themselves processed under a lock) is an excellent way of solving the 'which thread should delete this' question. It effectively removes the question.

The string class should have, in a way, done this for you. Most string classes use reference counting, and that would mean the character array underneath that should be managed in a thread-aware and safe fashion, but that assumes your string class was written to do that.

Similar issues apply to objects that contain other objects via pointers (which should be smart pointers, so this question never comes up), etc.

Best of luck, and I'd enjoy hearing how well you fare. A 'lesson' for all of us from an application production in the real world is exactly what visitors to this board need, so they know what's ahead.

nice_guy_mel
June 28th, 2007, 01:54 PM
Good suggestions so far.
Here's some more of what we know. The CTopServer class has a member object CFileDataThread. The CFileDataThread constructor receives a pointer to the CTopServer class that own it CFileDataThread(this). The CFileDataThread shares no resources with anyone. The only thing shared is the CTopServer object itself, which is used by the CFileDataThread's thread and the CCommMgr's thread. Both threads make the same call to SendMesgToAll() when they have a string of data to send. CFileDataThread's job is to continuously open text files, and read in the data one line at a time. For each line that is read in, it makes a call to it's owner with m_pServer->SendMesgToAll(strMesg). It was a pretty simple and straight forward addon.
HeapAgent was complaining about one specific string problem over and over.

void CTopServer::SendMesgToAll(std::string& strMesg)
{
m_lockConx.lock();
if (m_bOpen)
{
std::string strFilter;
for (each connected user)
{
//Get user's filter, and check if allowed to send strMesg
strFilter.assign(UserIterator->m_strFilter);
//If strMesg is not in the user's filter, then it is sent out over
//the socket, otherwise we go onto the next user.
}
} //**************************HEAP AGENT Double Free
m_lockConx.unlock();
}


Once the strFilter string goes out of scope at the end of the if statement, this is when the string's destructor gets called, which eventually calls delete on the memory that string object was using to hold it's characters. It's here that Heap Agent reports the double free errors. This baffles me, because, well for one the operation takes place inside the mutex object. But also the strFilter object should be a new object for each thread that comes into this function. It's created and destroyed locally in the function, which should make it impossible for the object to be shared by 2 or more threads.
Nobody believed it to be an OS specific bug at first, but we are seriously considering the possibility now. Both XP machines(6 of them) and the server 2003 machines (3 of them) have duel core processors. It is quite possible that timing is the issue, and the XP machines just aren't getting the stars to align in the right way.
Heap Agent has been a good tool to use, but we've had utter failure trying to tie it into studio so that we can use the call stack while debugging. We've also tried DevPartner, but it's instrumentation of the cpp files causes the program to run at a snails pace. If anyone has a favorite 3rd party heap checker, I'd like to hear about it.

Arjay
June 28th, 2007, 02:55 PM
Is the problem related to "STL std::string class causes crashes and memory corruption on multi-processor machines" (http://support.microsoft.com/kb/813810)?

JVene
June 28th, 2007, 03:46 PM
I agree it's puzzling.

However, it's possible that some other thread is performing a string operation using std::string at the same time, which is otherwise unrelated code, but happens to allocate at the same moment.

Try switching this code to another string object or a simple character array you manage yourself carefully.

nice_guy_mel
June 29th, 2007, 08:16 AM
Arjay we are in your debt. We took a look at the Microsoft article, and it seems to be a lot like what we are experiencing. We tried setting the enum FROZEN to 0 as the article suggests, and so far it has been running for 8 hours on Server 2003! By far the longest result we've ever seen. If the heap error persists, then I'm going to try JVene's suggestion and use character buffers instead to see if that makes a difference.

nice_guy_mel
July 2nd, 2007, 07:10 AM
Thanks again Arjay, Server 2003 has been running our software now for over 3 days! 95% of the time bugs end up being attributed to programmer error. I can't believe this turned out to be one of the 5% where it was Microsoft's fault :)

Arjay
July 2nd, 2007, 01:07 PM
Thanks again Arjay, Server 2003 has been running our software now for over 3 days! 95% of the time bugs end up being attributed to programmer error. I can't believe this turned out to be one of the 5% where it was Microsoft's fault :)Your welcome. Well this isn't really Microsoft's fault - you're using a compiler that's nearly 9 years old. This bug was fixed in the next version of the stl in VC7 sometime in 2001. Really glad this worked for you. :)