Click to See Complete Forum and Search --> : Is this safe? (double-checked locking)


Hermit
August 25th, 2007, 05:58 PM
Hey folks,

Suppose I have one thread which constantly reads from a structure and another which only occasionally modifies it. Since the reading thread is performance critical, I'd rather not have to acquire a lock every time the structure is read, so I've come up with the following strategy...

To simplify things I'll just say that there are two copies of the data, one which is modified by the writing thread and another which is accessed by the reading thread. The reading thread's copy should be updated after the writing thread's copy is modified, but it doesn't matter how immediate this transfer is... so I've used an unsynchronized flag to determine if this transfer should take place. The actual transfer process should be sufficiently thread safe. In detail:

- Before modifying the write-thread copy, a lock is acquired on the mutex. The data is modified and the transfer flag is set to true. The lock is then released.

- When the read-thread copy is accessed, the transfer flag is checked without locking the mutex. If it indicates that the this copy should be updated, the mutex is locked, the data is copied over, the transfer flag is reset to false, and the lock is released. In any case, the read-thread copy is returned without any further locks.

Seems like the only loss in terms of synchronization is that the copy-over might be delayed in the event of a race condition on the transfer flag.

Am I falling for the double-checked locking anti-pattern here? If not, is it as safe as I'm assuming it is?


EDIT: Upon further investigation, this almost definitely doesn't have the same problems as true double-checked locking (and probably isn't even related). Still... I'd like some thoughts on the safety implications of it.

JVene
August 26th, 2007, 12:47 AM
Your design recalls the notion of the 'copy-on-write' concept. You might find interesting material searching on that notion.

In particular, copying of the object's contents can be obviated through the use of pointers to switch access to the 'read' and 'write' copy.

Your post is constructed to indicate a thread devoted to writing, opposing a thread devoted to reading. From the viewpoint of a copy-on-write 'object' any thread may perform either function. Your specialization might apply for your current application, but you may find this notion applies to code you'll write in the future, and therefore a generalized solution may be preferable (that is, a template class that can perform these feats on whatever type you need, under circumstances where many threads read and/or write).

Consider this problematic scenario:

Threads (or your reading thread) have read the read only copy many times (assuming it was initialized with meaningful values).

A thread (your writing thread) now updates the write-enabled copy and trips the flag you propose. Now assume that no read has yet been performed before an intervening update occurs. In your design discussion it seems, unless I've missed something, just possible that a read may occur while this proposed 'second write' is in process. In other words, since your design depends on the read process ability to check the proposed transfer flag, and perform associated update/switching/copying as this flag indicates is appropriate, it now seems that the entire duration of the write processing must be under lock.

Conversely, since there are only two copies in your scenario, there must be at least one destination available at all times for the write operations or the write operation may have to stall. There could be a situation where two or more threads are reading material while a third is writing, and a condition may occur where, say, thread 1 is reading from the original read only copy, while thread 2 has sensed a transfer flag was turned on. This means that thread 1, not under lock, could be interrupted with a copy of data - which is one reason I don't recommend it. So, replace that notion with a pointer switching the identity of the read-only copy.

Let's assume that's so and proceed.

Now, thread 2, sensing the transfer flag, has switched this pointer. At this point thread 1 may still be processing data from the original while thread 2 is processing from the new(er) copy, and a 3rd thread has no destination to write to - but how would it 'know' that?

I realize your post indicated a thread dedicated to reading and another for writing, and while that may be applicable for now, it's not well suited for a generalized solution. A generalized solution would have to consider any number of threads with any vector of entrance (read or write) - otherwise your solution is based on this thread design (it's not scalable to more processors, for example).

I haven't timed the use of the Linux mutex for some ages now, but I know that critical sections in Windows pass by millions of times per second. It would take an odd application demand to exceed their use, but such conditions do occur. It may be wiser to check to see if you really have an optimization issue to solve, but I'll assume you know that it is.

It may be better to employ either a FIFO queue or a circular queue of at least as many entries as you have threads. If you switch to the pointer method, rather than an object copy, you can also place the duty of manipulating this pointer into the write process responsibility, taking care about unity across accesses. That is, if you must have all members of the object taken by a thread for the same object (such that various members aren't read 'out of sync') - you should obtain a pointer to the appropriate read object and use it for several accesses that require unity. Otherwise you might read one member from, say, the original and another member from the recently updated copy.

For any queue, circular or FIFO, you have the problem of either deleting the tail or the cycling of the circle. In either case there exists the potential for mismanagement. For a circular queue, it's possible the write process might 'circle back' onto an object still in use by a read-only process. Or, for the FIFO, knowing when it's safe to delete the tail is problematic unless something marks the read status.

I sense in most of this there are requirements for locks beyond your initial statement. By depending on a two thread only arrangement you may have some benefit to obviate that, albeit at the expense of a generalized solution. Thinking in favor of generalization, I personally would have to know there are serious performance implications with respect to locking that warrant the considerable logical puzzle this presents. There are lock free approaches, and most of them are aimed at highly parallel implementations - that is to say their benefit is most notable when many threads are otherwise likely in contention. Locks are fastest and lightest when contention is occasional, as your post suggests.

Once again, as with many of my posts of late, I've handled a number of interruptions while I write, and so I must admit I may easily have overlooked something.

MikeAThon
August 26th, 2007, 03:16 PM
I don't see a problem with your approach, so long as there is at most a single reading thread. Naturally, the approach will not work (since there's only one transfer flag) for more than one reading thread.

But I question whether there are any real performance gains. In your approach, when reading, you need to check the transfer flag. So, what is the performance gain over acquiring a critical section (for example), especially when you have told us that the writing thread will only occasionally be writing, such that the reading thread will almost certainly acquire the critical section immediately, without contention.

If so, then where are the performance gains? This might be one of those circumstances where you are optimizing too early, and without profiling to show if/where optimization is even needed.

Mike

Arjay
August 26th, 2007, 03:53 PM
Since there is only two threads acting on the shared resource, I wonder if the transfer flag is needed. You could just lock it to write and lock it to read w/o the flag. As MikeAThon indicated, because of the low write frequency, the read thread would typically be able to acquire a lock immediately. If you are using a critical section, there would be very little performance overhead for using a cs when the lock is obtained immediately.

If you change the design and decide to have multiple reading threads, you might benefit from using a reader/writer lock over the cs or mutex.

Hermit
August 26th, 2007, 08:51 PM
Your post is constructed to indicate a thread devoted to writing, opposing a thread devoted to reading. From the viewpoint of a copy-on-write 'object' any thread may perform either function. Your specialization might apply for your current application, but you may find this notion applies to code you'll write in the future, and therefore a generalized solution may be preferable (that is, a template class that can perform these feats on whatever type you need, under circumstances where many threads read and/or write).
Of course, this being programmed in C++, going the generic route is compulsory :). Unfortunately my solution doesn't account for multiple read threads (although it does apparently support multiple writing threads), which makes the rest of your post of great interest.

To demonstrate the sort of strategy I was considering, here's the untested, thrown-together-on-a-whim class I'm currently using as a model. Definitely not ideal (and perhaps needing a copy constructor):
template <typename T_data>
class seldom_write
{
public:
typedef T_data data_type;

friend class write_ref;
class write_ref
{
public:
write_ref(seldom_write & sw):
m_sw(sw),
m_lock(sw.m_mutex)
{
if (!m_sw.m_pwritecopy)
m_sw.m_pwritecopy = new data_type(m_sw.m_readcopy);
}
write_ref(const write_ref & rhs):
m_sw(rhs.m_sw),
m_lock(rhs.m_sw.m_mutex)
{
}
data_type & operator*(void)
{
return *m_sw.m_pwritecopy;
}
data_type * operator->(void)
{
return m_sw.m_pwritecopy;
}
private:
seldom_write & m_sw;
boost::recursive_mutex::scoped_lock m_lock;
};

friend class read_ref;
class read_ref
{
public:
read_ref(seldom_write & sw):
m_sw(sw)
{
if (m_sw.m_pwritecopy)
{
boost::recursive_mutex::scoped_lock lk(m_sw.m_mutex);
m_sw.m_readcopy = *m_sw.m_pwritecopy;
delete m_sw.m_pwritecopy;
m_sw.m_pwritecopy = 0;
}
}
read_ref(const read_ref & rhs):
m_sw(rhs.m_sw)
{
}
const data_type & operator*(void) const
{
return m_sw.m_readcopy;
}
const data_type * operator->(void) const
{
return &m_sw.m_readcopy;
}
private:
seldom_write & m_sw;
};

seldom_write(void):
m_pwritecopy(0)
{
}
seldom_write(const data_type & init):
m_readcopy(init),
m_pwritecopy(0)
{
}
~seldom_write(void)
{
delete m_pwritecopy;
}
read_ref read(void)
{
return read_ref(*this);
}
write_ref write(void)
{
return write_ref(*this);
}

private:
boost::recursive_mutex m_mutex;
data_type m_readcopy;
data_type * volatile m_pwritecopy;
};


Consider this problematic scenario:

Threads (or your reading thread) have read the read only copy many times (assuming it was initialized with meaningful values).

A thread (your writing thread) now updates the write-enabled copy and trips the flag you propose. Now assume that no read has yet been performed before an intervening update occurs. In your design discussion it seems, unless I've missed something, just possible that a read may occur while this proposed 'second write' is in process. In other words, since your design depends on the read process ability to check the proposed transfer flag, and perform associated update/switching/copying as this flag indicates is appropriate, it now seems that the entire duration of the write processing must be under lock.
Does this suppose that updating and reading are occurring simultaneously? The idea is that the update is performed implicitly by the reading thread (although ideally there would be a way to prevent this update between consecutive reads). This allows reading, in the absence of a need to update, to occur without acquiring a lock... but it also brings about the limitation of having only one reading thread.

I haven't timed the use of the Linux mutex for some ages now, but I know that critical sections in Windows pass by millions of times per second. It would take an odd application demand to exceed their use, but such conditions do occur. It may be wiser to check to see if you really have an optimization issue to solve, but I'll assume you know that it is.Very true, and I hope I'm not engaging in too much of a premature optimization... but since this isn't exactly an ugly performance hack, but a synchronization class suited for a very reasonable purpose (costless reading in the absence of writing), I'm not feeling too guilty about it. As the system I'm considering this for grows in size, the current lock-based implementation will easily need to test hundreds of thousands of critical sections per second, on top of some rather demanding DSP computation. Whether or not this is actually a problem, it makes the costless alternative seem very tempting. Still, if I find there is no way implement this without latent defects (if today's single reading thread could later become two, for example), then I won't hesitate to drop the idea entirely.

If you switch to the pointer method, rather than an object copy, you can also place the duty of manipulating this pointer into the write process responsibility, taking care about unity across accesses. That is, if you must have all members of the object taken by a thread for the same object (such that various members aren't read 'out of sync') - you should obtain a pointer to the appropriate read object and use it for several accesses that require unity. Otherwise you might read one member from, say, the original and another member from the recently updated copy. The problem of invalidating previously accessed members with consecutive reads is one I do recognize, even in the case of one reading thread. Though, I imagine providing a way to read the data without implicitly updating it would suffice to address this issue?

For any queue, circular or FIFO, you have the problem of either deleting the tail or the cycling of the circle. In either case there exists the potential for mismanagement. For a circular queue, it's possible the write process might 'circle back' onto an object still in use by a read-only process. Or, for the FIFO, knowing when it's safe to delete the tail is problematic unless something marks the read status.
Since access to the writable copy is already mutually exclusive, I'm not sure such a complicated structure would even be needed to handle multiple writing threads. Could this have been a detail you overlooked, or am I overlooking something crucial?

In any case, for N reading threads it seems there would have to be at least N copies of the data and N transfer flags, which might not call for a queue of any kind but has problems of its own... not to mention being a bloat on memory. In the general case it is not always clear what thread a segment of code is being executed by, yet this solution would require any code which accesses the data to use the copy associated with the current thread. There may or may not be a safe and efficient way to perform this task automatically, and my overall impression seems to be that any code which reads the data must be strictly associated with a particular thread.

But I question whether there are any real performance gains. In your approach, when reading, you need to check the transfer flag. So, what is the performance gain over acquiring a critical section (for example), especially when you have told us that the writing thread will only occasionally be writing, such that the reading thread will almost certainly acquire the critical section immediately, without contention.

If so, then where are the performance gains? This might be one of those circumstances where you are optimizing too early, and without profiling to show if/where optimization is even needed.Well... a bit of testing showed that, if no writing occurs, this solution is indistinguishable in performance from thread-unsafe code, where basic use of a mutex (with no contention whatsoever) caused about a 10% increase in execution time. The test was done on top of a fairly practical workload. Could be that the boost::mutex class is creating some of this overhead, but I'm fairly certain it's just a thinly wrapped critical section.

For now I'm going to stick with a normal locking policy, only because doing so is far less error prone than what I've proposed. Still, it's good to know I have an alternative that is not impossible to use safely, in case liberal use of locks becomes prohibitively expensive.

JVene
August 26th, 2007, 11:23 PM
Does this suppose that updating and reading are occurring simultaneously? The idea is that the update is performed implicitly by the reading thread (although ideally there would be a way to prevent this update between consecutive reads).


My thinking was along the lines that if there are two read threads, and only two objects to write to, and one thread is in process of reading from the 'original', then the second read sensed that an intervening update had been flagged, there could be collision in the update. The lock you've proposed is against a single read thread and a single write thread, but if two read threads are involved, and the read thread does the copy or transfer of the pending update, it would not know that the 'other' reading thread was in mid process. Strings would be particularly sensitive to disruption in this collision.

I have yet another proposal, though, one I have used myself. I let it wait in order to focus on your design thought process, but consider this alternative which follows some similar lines, loosely.

Begin by using a pair of reference counted smart pointers. One is the 'current' reading object, another a 'posted change' object (you might call it the write object). This notion lends itself to other arrangements, too, like the circular queue and the FIFO queue, all based on smart pointers.

Let's assume that the write thread has provided it's output, which it 'hangs' on the smart pointer representing the posted change. Until this point, any read thread operations recognize that both the 'current' pointer and the 'posted' pointer have been null, and have nothing to act upon.

Now a read comes along and notices the posted pointer is valid and, under lock, takes it. This lock is honored by all writes as well, meaning that this acquisition of the posted pointer happens by only one thread no matter want vector the approach.

This read thread posts the pointer of this newly acquired object to the 'current' pointer, nullifying the 'posted' pointer - an acquisition transaction. All subsequent reads occur on this pointer. Locks on this pointer occur only when the pointer itself is changed, not the object to which it points, so these reads happen without further locks.

This does imply a checkpoint, however, in that a test of the 'posted' pointer is required whenever a read operation should check for the freshest data. Finding it null, the 'current' pointer continues to suffice.

The function that performs the read has it's own, local smartpointer it will use to 'hold onto' the 'current' object from which it reads. It retains this 'hold' until some reason causes it to refresh (as in a test that it's no longer holding the same pointer as the 'current' pointer, or that a 'posted' pointer is non-null).

There are details regarding these tests I'll leave to you, I'm describing an otherwise general outline. The smart pointers are the key to this notion. Any number of read threads may be in operation, and containment is managed by the smart pointers.

It can even be designed to work on one pointer - the current pointer itself, which is compared against the local smart pointer held by each read thread. Any update is posted to the 'current' pointer under the smart pointer's internal lock. That is, a write thread uses a local smart pointer to create a new object, write to it's contents (perhaps even having copied the existing object from the 'current' smart pointer, if that's required), and when fully prepared it is posted to the 'current' pointer as a replacement. Any reading threads 'holding' the old object in their local smart pointers remain valid, until they notice their local copy's pointer is different than the 'current' - at which point that simply refresh their local copy, bumping the old copy out.

This works for any number of read threads, and any number of write threads, though for the write threads further coordination among them may be required (they may need to lock against each other for write privilege). If, however, you're object has a significant number of members, and your updates only affect a few members (leaving the others untouched) - like a 'blackboard' in public view - perhaps each member should be one of these smart pointers, instead of just the outer object itself.

Further performance enhancements may include custom allocators for the smart pointers and the objects, especially if the 'churning' of this data is significant.


Another way to think of the design is a typical construct we use every day.

The display.

One side of the display is constantly reading RAM, interpreting it as pixels (in GUI's we typically use, but it was characters in the old days) and sending through the RAMDAC to the monitor (or other output interface). The other side of the graphic RAM is subject to update from the OS. The key is a 'dual port' configuration (often a 'virtual' dual port, not a real one, especially in older hardware - using a lock).

Smart pointers can offer this dual port configuration. The lock internally when you assign them, so using one locally in your read thread means one lock for acquisition of the pointer, and no more locks until it senses the 'current' pointer has been changed. The reference counting nature of the smart pointer (when a reference counting version is selected) means your read thread 'doesn't know' something was altered 'out from under it'.

Until, of course, it checks for that....

Hermit
August 28th, 2007, 12:22 PM
Very cool, thanks. I haven't had much time to take a crack at implementing this yet, but at least now I have an idea of how using counted pointers will get rid of a lot of copying/assignment and redundant copies of identical data, and of where locking will actually occur under this approach.

Some pseudo-code to clarify for anyone interested (and for JVene to correct if I've got something wrong):

shared_object
smart_ptr posted_change
smart_ptr current_read

construct():
posted_change = 0
current_read = new data

reader_object
smart_ptr thread_read

update():
if posted_change
scoped_lock
thread_read = current_read = posted_change
posted_change = 0

else if thread_read != current_read
scoped_lock
thread_read = current_read

get():
return const *thread_read


writer_object
scoped_lock

construct():
if !posted_change
posted_change = new data(*current_read)

get():
return *posted_change

In all of this, the assignment or destruction of a smart_ptr implies a lock, as you mentioned towards the end of your post. Comparison, dereferencing, etc. would not, which means that the pointers themselves are not threadsafe; only their reference count is protected against race conditions.

One possible enhancement would be to have a wait_for_update method for the writer_object concept, in case a write is in fact urgent. The wait would also have to be released when all reader_objects fall out of scope (that is, no further updates will occur).

There's still this though:
...it is not always clear what thread a segment of code is being executed by, yet this solution would require any code which accesses the data to use the copy associated with the current thread. There may or may not be a safe and efficient way to perform this task automatically, and my overall impression seems to be that any code which reads the data must be strictly associated with a particular thread.
The solution would most likely be to use thread-specific storage for read accessors, but I'm not exactly sure what kind of runtime overhead this would incur... if boost's implementation is any indication, TSS on Windows is a hack and a half. That aside, If TSS is used in conjunction with the proposed structure, it seems like it would be less prone to deadlock where normal use of a mutex could result in simultaneous locks whose order is difficult or impossible to predict.

JVene
August 28th, 2007, 02:14 PM
In my thinking (and in practice for at least a dozen examples I've written in the last year), the local smart pointer is, effectively, a thread local storage of the object in the read status.

Perhaps I wasn't quite so clear about it - here's what I've done before myself. In many of these cases the subject is a list or a sorted container, and the subject of our outline would be the smart pointer to the container, but otherwise it's the same as a smart pointer to a single object, or a collection of smart pointers to various members.

In most cases there is some 'outer layer' object, probably a window or a control, in my examples. You might have this 'outer layer' in an application object, or perhaps as a global smart pointer. This 'outer layer' object would own 'the current' smart pointer (or it is in global scope, perhaps).

The write thread prepares an object using it's on local smart pointer, one created on the stack and therefore local to it's thread. Once the object is prepared (or, in my case, the container is filled), it is 'posted' using the smart pointer's own assignment overload, which means it looks like a simple pointer assignment. That step is under lock.

Once this is posted, the write thread 'nullifies' or otherwise drops it's local smart pointer from scope. At this moment the only 'hold' on the object created by the write thread is the 'current' smart pointer.

Read threads now recognize there's something new. Possibly by a simple comparison that their own local smart pointer (null initially) doesn't match the 'current' smart pointer's 'pointer value'.

Each read thread acquires a copy of the smart pointer, again through the smart pointer's own assignment. The read threads probably hold this smart pointer in local scope as a stack variable, or a member of some object representing their actions which itself is held on the stack in the read function.

Once acquired (under lock), the pointer may be reused in the read thread without further locks or problems, checking as required that it still matches the 'current' value. The read thread will acquire the update from the 'current' value in it's own time. The smart pointers themselves attend successfully to the deletion of the 'old' objects as the last read thread drops it in favor of the updated one.

My point is, as far as I've found in practice, the use of smart pointers just nearly solves the whole problem without further complications.

There's a minor qualification to that in my own case, though. I do have custom allocators for the smart pointers which scale well.