Click to See Complete Forum and Search --> : how to maximize I/O (write) performace in win32?


nasenmann
October 13th, 2008, 03:21 AM
Dear Forum!

I am currently working on optimizing a (continuous) wirte operation, to maximize the speed to write to a file on a local harddrive.

The final result should be to write some data (a stream) to a HD at the maximum possible speed, at most as long till the HD is full (obviously :) ).

I have set up a simple test that allocates a choosable size of memory and then repeatedly writes this memory to a file, for a total of 1GB in that operation.

I then tried several different sizes for the memory that is being allocated and also determines the size of the buffer given to the fwrite function.

I found out that a larger memory/buffer size (I tried a maximum of about 64MB) does seem to increase the transfer rate a bit.

I also found out that the transfer speed varies a very very large amount between each operation which really surprises me. I took a total transfer of 1GB in order to receive a meaningful average.

But then again, I dont know if there is significant fragmentation between 2 calls or whether there is significant windows activity on the HD at the same time.

Since I am interested in achieving a high transfer speed for a long time, I am interested in finding out how to optimize this process, if possible at all.

I found differences between certain measurements as large and larger than 50%, so I am very interested in obtaining the maximum constantly. Also, please note because of the big size of the file to be written (1GB) I dont think the difference might be due to cache effects (HD should have a max of 8MB of cache). Also, the maximum I ever measured was way below what the max write speed of the HDs I checked given in their datasheets.

So.. I am very greatful for any ideas, insights, hints or tips, on how to

- achieve a higher transfer speed
- achieve a more constant transfer speed

for write operations.

Many thanks + best wishes,
Andreas

Codeplug
October 13th, 2008, 10:41 AM
There's not much you can do besides the following:
* Use a "large-enough" write buffer, usually some multiple of the sector size (typically 512).
* If you have any CPU-bound work that can be done, do it while the disk is busy using overlapped I/O
* Use CreateFile/WriteFile to avoid the small overhead imposed by the CRT (needed if you want to use overlapped I/O)

If you have to share the disk with the OS and other processes, there little you can do to make the performance consistent all the time. If your process is the only one touching the disk, then kicking up the priority of your process will only help slightly since all the work is disk-bound.

gg

MikeAThon
October 13th, 2008, 01:41 PM
For Win32, use an I/O completion port architecture (IOCP) for the best I/O performance.

JamesSchumacher
October 13th, 2008, 06:29 PM
Calculate your output file size, use SetFilePointer() to set the file pointer, use SetEndOfFile(), then seek back to the beginning and write out all your data. It's much quicker.

TheCPUWizard
October 13th, 2008, 06:42 PM
Calculate your output file size, use SetFilePointer() to set the file pointer, use SetEndOfFile(), then seek back to the beginning and write out all your data. It's much quicker.

Not necessarily. Plus it may have a negative impact on other aspects of the system.

MikeAthon's suggestion to use I/O completion ports is the proper approach for a Windows based program. The system will callback as soon as the system is ready for data. The limiting factor will be one of two things:

1) The sustained throughput of the system. [you cant directly control this]
2) You ability to make sure sufficient information is available for write when the completion port code gets triggered.

Codeplug
October 13th, 2008, 08:48 PM
If there's no CPU-bound work to be done in-between writes, then any asynchronous I/O method won't give any substantial improvement over synchronous I/O. Even if there is CPU-bound work to do, then in this case, any async method would work equally well - Overlapped I/O, WriteFileEx (Alertable I/O), or IOCP. Overlapped I/O is by far the easiest.

IOCP becomes more advantageous and more efficient when you have a *lot* of asynchronous I/O operations that can occur concurrently. This doesn't apply to writing a single file. IOCP is the most complex of the asynchronous I/O methods, but provides the best scalability. For that reason, I reserve the use of IOCP for only when scalability is a requirement.

gg

TheCPUWizard
October 13th, 2008, 09:21 PM
If there's no CPU-bound work to be done in-between writes, then any asynchronous I/O method won't give any substantial improvement over synchronous I/O. Even if there is CPU-bound work to do, then in this case, any async method would work equally well - Overlapped I/O, WriteFileEx (Alertable I/O), or IOCP. Overlapped I/O is by far the easiest.

IOCP becomes more advantageous and more efficient when you have a *lot* of asynchronous I/O operations that can occur concurrently. This doesn't apply to writing a single file. IOCP is the most complex of the asynchronous I/O methods, but provides the best scalability. For that reason, I reserve the use of IOCP for only when scalability is a requirement.

gg

There is absolutely nothing wrong with the facts in your post(s) :wave:

But there is a difference in the conclusion. While there is rarely a "one-size-fits-all" answer to any question, I have found that for intense I/O, it is best to develop and throughly test a good solid object model that will meet the widest range of requirements.

There are very (very,very) few conditions where IOCP will provide lower performace (they are mainly parasitic cases). As a result I have developed a complete suite of classes which wrap all of the complexity.

Once this is all encapsulated, using it in programs is very easy, and it protects against "breakage" if the program evolves into something more complex (especially writing to multiple files).

IF the program once written for optimal (or near optimal) performance using any of the methods, one alternative would be to write the data "striped" across multiple physical disks (ideally on distinct IO controllers and channels.

IF the architecture is already IOCP baed, the change is relatively trivial. If it is based on most of the alternatives, then significant re-work may be required.

Nearly 20 years ago Scott Meyers coind the phrase "Programming in the future tense" [Effective C++]. I am a big supporter of this philosophy, and it is one of the reasons my firm has remained competitive.

Investing the time ONCE, will almost invariably pay off over the long run.

nasenmann
October 15th, 2008, 06:38 AM
So many answers.. many thanks to you all!! :)

I am planning on using the following architecture: one thread getting the data (from a pci card via DMA), one thread to process the data, and one thread to write the processed data. (This is not what is implemented in the test version I was describing.)

For synchronization/data exchange I use 2 FIFOs of buffers, each one between 2 threads.

So I wonder whether asynchronous writes would give me any improvement in performance with a threaded system?!

Also, it seems there is no way around setting up the system on a different drive, incl. swapfile, and then doing the test again, hoping at least for more constant results, and then, I suppose, I can start properly evaluating the write speeds, with the ideas you gave me.

Codeplug
October 15th, 2008, 09:51 AM
Using multiple threads won't give any performance gains over using one thread with async I/O.

>> So I wonder whether asynchronous writes would give me any improvement in performance with a threaded system?
Async I/O allows you to do work (with the CPU) while the hardware is off processing read/write requests. So if a thread has work to do besides waiting for a read or write to finish, then it could use async I/O to do the work while the hardware is processing the read/write request.

gg

nasenmann
October 20th, 2008, 09:15 AM
Thx again!

I have one more question..

I am using a Flash Drive (from OCZ) for my tests, since it said in the technical description it has something like 90MB/s write speed (and > 100MB/s read speed). The read speed is quite good, but with the write speed.. it leaves a lot of room for improvement.

I think it has to do with writing in a sequence rather than randomized, since the write operation on a flash drive takes a long time (you first have to read, then change the read data, then write), but I assume the system should figure out when a read it not needed (when writing a large block and all the data just gets overwritten). So this might not be the problem?!

I have also changed my boot device to a different (CF) device, so the Flash Drive is free just for writing, no (?) operation system interference.

What I have found out now is that (I use a VNC viewer to operate that system, it is a testing system different from my development system) when I use a large block size to the fwrite (like 64MB) I reach something like 40MB/s with the VNC window open and 60MB/s with the VNC window closed. Obviously, the VNC viewer takes some CPU resources (<5% with the mirror driver), but especially since the system is a Dual-Core, it really should not matter, I think.

I am clueless, again, plus, I want the 60MB/s all the time (a standard HD should reach something like 90MB/s for write by now if you buy fast 7200 ones?!), even under moderately high load.

Any more ideas on the topic?

Codeplug
October 20th, 2008, 12:25 PM
It's probably not the VNC CPU usage. More likely its the bus utilization of the network card. In other words, if the NIC and disk controller are on the same (PCI) bus, then they have to share the bus.

gg

MikeAThon
October 20th, 2008, 06:59 PM
It's probably pointless to speculate about, and optimize performance for, a flash drive, when in fact your target media is a hard drive. For example, let's say you were able to achieve good performance with the flash drive even when the VNC window is open. Why do you think that such performance would translate over to a scenario with a hard drive?

Stick to your actual target configuration.