A Simple Method For Timing Code
Many programmers need to optimize some algorithms for speed from time to time. You cannot always afford to buy expensive code timing tools. This is mostly the case when learning to program. The poor man's approach for timing a piece of code is to bracket the code with GetTickCount() and subtract the resulting values. This works on most systems, but has some limitations: GetTickCount() comes with some overhead, since it involves a function call, and has a limited resolution of about 0.055s.
I have been looking for some possibility to time certain pieces of code without too much overhead on my Intel PIII, Win98 machine. Here is my approach:
Beginning with Pentium, all IA-Family processors have a Real Time-Stamp Counter. This counter can be read at any privilege level, when TSD flag in CR4 is clear. The command for reading the counter is RDTSC. This command will load EDX:EAX with a 64 bit value representing the current time stamp (the time stamp actually is the number of cycles elapsed since reset).
RDTSC is not a serializing instruction, i.e. "it does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed." (Intel Architecture Software Developers Manual Volume 2: Instruction Set Reference, p. 3-604). To avoid inaccuracies, RDTSC has to be bracketed by a serializing instruction. CPUID is such an instruction.
Although it may seem complicated to use RDTSC to time code, it is not. I've written following four macros for this purpose:
#ifdef _DEBUG
#define PERF_DECLARE \
__int64 MSRB, MSRE; \
void *mrsb = &MSRB; \
void *mrse = &MSRE; \
char perfmtrbuf[100];
#define PERF_START \
{_asm mov eax, 0 \
_asm cpuid \
_asm rdtsc \
_asm mov ebx, mrsb \
_asm mov dword ptr [ebx], eax \
_asm mov dword ptr [ebx+4], edx \
_asm mov eax, 0 \
_asm cpuid}
#define PERF_STOP \
{_asm mov eax, 0 \
_asm cpuid \
_asm rdtsc \
_asm mov ebx, mrse \
_asm mov dword ptr [ebx], eax \
_asm mov dword ptr [ebx+4], edx \
_asm mov eax, 0 \
_asm cpuid}
#define PERF_REPORT \
{_ui64toa(MSRE-MSRB, perfmtrbuf, 10); \
TRACE("Cycles needed: %s\n", perfmtrbuf);}
#else
#define PERF_DECLARE
#define PERF_START
#define PERF_STOP
#define PERF_REPORT
#endif
// _DEBUG
The function _ui64toa() is used, so stdlib.h has to be included prior defining these macros. Since I was using MFC, I used TRACE to output the results. Any other method for displaying and/or storing perfmtrbuf is good.
Using the macros is very simple:
void SomeFunction(void){
PERF_DECLARE;
//...code
PERF_START;
//...code to time
PERF_STOP;
PERF_REPORT;
//...code
}
PERF_START and PERF_STOP/PERF_REPORT can bracket as many pieces of code one wants. One cannot pair them overlapped or nested, though. If nesting is a must (although I cannot imagine why it could be), the macros can be changed to accept parameters (the variable names to use).
1. Accuracy
Actually, the timing's accuracy is less-than-expected, at a first glance. Two consecitive timings can show up a difference of many cycles. Executing following function 60 times shows the behaviour of this timing method:
void Test()
{
PERF_DECLARE;
int i, j, k;
PERF_START;
for(i= 1;i<0xff; i++){
for(j= 1;j<0xff; j++){
k = i+j;
}
}
PERF_STOP;
PERF_REPORT;
}
The mean execution cycle number is 463885.38, with a standard deviation of 4.84%. Following chart shows this:
Obviously, there is a lower threshold at 457313 cycles. The code mostly executes in this number of cycles, except some peaks when it needs more than that. How come? First, the interrupts are not off during code execution. Second, the code ran under Windows, which could preempt the process' execution. It does not apply for this piece of code, but the initial state of the processor's cache also can eat some cycles.
In conclusion, the code is being timed in it's "natural environment", not in a vacuum. If you need to know the bare execution time of a specific piece of code, repeated timing is a must. For short pieces of code, it is possible to get the exact number of cycles needed at minimun to execute.
2. Compatibility / Portability
The code clearly needs an Intel Pentium processor at minimum. At the time writing this, I have no information about the behavior of other processors. The compiler used must be able to handle the __int64 data type (not ANSI). The _asm is a Microsoft specific keyword; other compiler may define it different. The TRACE macro i have used is actually a question of context, and can be easily replaced with some equvalent. The macros actually do not need Windows at all. The only condition, except a Pentium+ processor is to have the TSD flag in CR4 cleared by the OS.
3. Pitfalls
The RDTSC instruction will cause a #GP fault, if the TSD flag in CR4 is set. At the time writing this I have no information about other operating systems regarding this point.
Both RDTSC and CPUID will generate an #UD exception on processors "older" than Pentium (486, 386 and so on).
If not paired properly, PERF_START and PERF_STOP will return bogus results, but no compile/runtime error.
Timing code that accesses any external device except the main memory, will probably produce totally inconsistent results. This is true for any simmilar timing method, by the way.

Comments
Is Windows real-time?
Posted by Legacy on 12/03/2002 12:00amOriginally posted by: Sam Hobbs
It is my understanding that Windows is not a real-time operating system so it cannot be relied upon for complete accuracy. The embedded Windows might be a real-time operating system but the others are not as far as I know. A device driver could be more accurate than an application but even a device driver cannot be relied upon for complete accuracy.
Right?
Replylinux
Posted by Legacy on 04/05/2002 12:00amOriginally posted by: Buddy Zdankiewicz
#define __int64 long long
#define TRACE printf
and this should work on linux too
ReplyRDTSC failed on Intel SpeedStep CPU
Posted by Legacy on 07/28/2001 12:00amOriginally posted by: LSF
Does anyone know the problem and solution?
It seems that the counter is wrong after Sleep() on Intel SpeedStep CPU.
Thanks
ReplyIn a nutshell...
Posted by Legacy on 07/26/2001 12:00amOriginally posted by: Tim
So, the situation we have here then is that we're not sure whether or not CPUID is necessary, correct?
I'm sure we all want the same thing -- accurate timing. Tell me how to acheive that ideally, and I'll adopt it in all of my code...
Thanks!
ReplyThat's what's wrong
Posted by Legacy on 07/26/2001 12:00amOriginally posted by: Georgy
ReplyDamn what's wrong ?
Posted by Legacy on 07/25/2001 12:00amOriginally posted by: Sanx
Reply'Optimized' version
Posted by Legacy on 07/25/2001 12:00amOriginally posted by: Georgy
ReplyQueryPerformanceCounter
Posted by Legacy on 07/24/2001 12:00amOriginally posted by: John
QueryPerformanceCounter might use this feature already...... Don't know for sure.
Does intel offer such library?
John
ReplyOther processors, other OSs
Posted by Legacy on 07/24/2001 12:00amOriginally posted by: Jean Cyr
I have tested your code. It operates correctly on AMD Athlon/Duron processors. Also works under Windows 2000. Thanks.
ReplyQueryPerformanceXxxx functions maybe more convenient
Posted by Legacy on 07/24/2001 12:00amOriginally posted by: Peter van Merkerk
Very nice macro's!
I've used the QueryPerformanceCounter() and QueryPerformanceFrequency() quite frequently to get quite accurate timing info. Those functions are not as accurate as your macro (function call overhead/resolution). But considering that the QueryPerformanceFrequency function returns at least 1.19 MHz (840 nS resolution), they are accurate enough for many cases. An advantage of the QueryPerformanceXxx functions is that you can easilly convert the result to (micro)seconds.
Nevertheless many thanks for the macro's, I will add them to my "toolkit"!
BTW: Even if you have profiler tool (like the one that shipped with MSVC++) macros like this are still an invaluable addition to obtain exact timing information of a piece of code. Information that a profiler tool cannot always provide.
Reply