Environment: VC6 SP5, Win98SE
Many programmers need to optimize some algorithms for speed from time to time. You cannot always afford to buy expensive code timing tools. This is mostly the case when learning to program.
The poor man’s approach for timing a piece of code is to bracket the code with GetTickCount() and subtract the resulting values. This works on most systems, but has some limitations:
GetTickCount() comes with some overhead, since it involves a function call, and has a limited resolution of about 0.055s.
I have been looking for some possibility to time certain pieces of code without too much overhead on my Intel PIII, Win98 machine. Here is my approach:
Beginning with Pentium, all IA-Family processors have a Real Time-Stamp Counter. This
counter can be read at any privilege level, when TSD flag in CR4 is clear. The command for reading the counter is RDTSC. This command will load EDX:EAX with a 64 bit value representing the current time stamp (the
time stamp actually is the number of cycles elapsed since reset).
RDTSC is not a serializing instruction, i.e. “it does not necessarily wait until
all previous instructions have been executed before reading the counter. Similarly, subsequent
instructions may begin execution before the read operation is performed.” (Intel Architecture
Software Developers Manual Volume 2: Instruction Set Reference, p. 3-604). To avoid inaccuracies, RDTSC has to be bracketed by a serializing instruction. CPUID is such an instruction.
Although it may seem complicated to use RDTSC to time
code, it is not. I’ve written following four macros for this purpose:
#ifdef _DEBUG
#define PERF_DECLARE \
__int64 MSRB, MSRE; \
void *mrsb = &MSRB; \
void *mrse = &MSRE; \
char perfmtrbuf[100];#define PERF_START \
{_asm mov eax, 0 \
_asm cpuid \
_asm rdtsc \
_asm mov ebx, mrsb \
_asm mov dword ptr [ebx], eax \
_asm mov dword ptr [ebx+4], edx \
_asm mov eax, 0 \
_asm cpuid}#define PERF_STOP \
{_asm mov eax, 0 \
_asm cpuid \
_asm rdtsc \
_asm mov ebx, mrse \
_asm mov dword ptr [ebx], eax \
_asm mov dword ptr [ebx+4], edx \
_asm mov eax, 0 \
_asm cpuid}#define PERF_REPORT \
{_ui64toa(MSRE-MSRB, perfmtrbuf, 10); \
TRACE(“Cycles needed: %s\n”, perfmtrbuf);}
#else
#define PERF_DECLARE
#define PERF_START
#define PERF_STOP
#define PERF_REPORT
#endif
// _DEBUG
The function _ui64toa() is used, so stdlib.h has to be included prior defining these macros.
Since I was using MFC, I used TRACE to output the results. Any other method for displaying and/or storing perfmtrbuf
is good.
Using the macros is very simple:
void SomeFunction(void){ PERF_DECLARE; //...code PERF_START; //...code to time PERF_STOP; PERF_REPORT; //...code }
PERF_START and PERF_STOP/PERF_REPORT can bracket as many pieces of code one wants. One cannot pair them overlapped or nested, though.
If nesting is a must (although I cannot imagine why it could be), the macros can be changed to accept parameters (the variable names to use).
1. Accuracy
Actually, the timing’s accuracy is less-than-expected, at a first glance. Two consecitive timings can show up a difference of many cycles. Executing following function 60 times shows the behaviour of this timing method:
void Test() { PERF_DECLARE; int i, j, k; PERF_START; for(i= 1;i<0xff; i++){ for(j= 1;j<0xff; j++){ k = i+j; } } PERF_STOP; PERF_REPORT; }
The mean execution cycle number is 463885.38, with a standard deviation of 4.84%. Following chart shows this:
Obviously, there is a lower threshold at 457313 cycles. The code mostly executes in this number of cycles, except some peaks
when it needs more than that. How come? First, the interrupts are not off during code execution. Second, the code ran under Windows,
which could preempt the process’ execution. It does not apply for this piece of code, but the initial state of the processor’s cache also can eat some cycles.
In conclusion, the code is being timed in it’s “natural environment”, not in a vacuum. If you need to know the bare execution time of a specific piece of code,
repeated timing is a must. For short pieces of code, it is possible to get the exact number of cycles needed at minimun to execute.
2. Compatibility / Portability
The code clearly needs an Intel Pentium processor at minimum. At the time writing this, I have no information about
the behavior of other processors. The compiler used must be able to handle the __int64 data type (not ANSI). The _asm is a Microsoft specific keyword; other compiler may define it different. The TRACE macro i have used
is actually a question of context, and can be easily replaced with some equvalent. The macros actually do not need Windows at all. The only condition, except a Pentium+ processor
is to have the TSD flag in CR4 cleared by the OS.
3. Pitfalls
The RDTSC instruction will cause a #GP fault, if the TSD flag in CR4 is set. At the time writing this I have no information about other operating systems regarding this point.
Both RDTSC and CPUID will generate an #UD exception on processors “older” than Pentium (486, 386 and so on).
If not paired properly, PERF_START and PERF_STOP will return bogus results, but no compile/runtime error.
Timing code that accesses any external device except the main memory, will probably produce totally inconsistent
results. This is true for any simmilar timing method, by the way.