Virtual Developer Workshop: Containerized Development with Docker

Environment: VC6 SP5, Win98SE

Many programmers need to optimize some algorithms for speed from time to time. You cannot always afford to buy expensive code timing tools. This is mostly the case when learning to program. The poor man's approach for timing a piece of code is to bracket the code with GetTickCount() and subtract the resulting values. This works on most systems, but has some limitations: GetTickCount() comes with some overhead, since it involves a function call, and has a limited resolution of about 0.055s.

I have been looking for some possibility to time certain pieces of code without too much overhead on my Intel PIII, Win98 machine. Here is my approach:

Beginning with Pentium, all IA-Family processors have a Real Time-Stamp Counter. This counter can be read at any privilege level, when TSD flag in CR4 is clear. The command for reading the counter is RDTSC. This command will load EDX:EAX with a 64 bit value representing the current time stamp (the time stamp actually is the number of cycles elapsed since reset).

RDTSC is not a serializing instruction, i.e. "it does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed." (Intel Architecture Software Developers Manual Volume 2: Instruction Set Reference, p. 3-604). To avoid inaccuracies, RDTSC has to be bracketed by a serializing instruction. CPUID is such an instruction.

Although it may seem complicated to use RDTSC to time code, it is not. I've written following four macros for this purpose:

#ifdef _DEBUG
#define PERF_DECLARE         \
  __int64 MSRB, MSRE;        \
  void *mrsb = &MSRB;    \
  void *mrse = &MSRE;    \
  char perfmtrbuf[100];    

#define PERF_START          \
  {_asm mov eax, 0          \
  _asm cpuid                \
  _asm rdtsc                \
  _asm mov ebx, mrsb        \
  _asm mov dword ptr [ebx], eax    \
  _asm mov dword ptr [ebx+4], edx  \
  _asm mov eax, 0           \
  _asm cpuid}

#define PERF_STOP           \
  {_asm mov eax, 0          \
  _asm cpuid                \
  _asm rdtsc                \
  _asm mov ebx, mrse        \
  _asm mov dword ptr [ebx], eax   \
  _asm mov dword ptr [ebx+4], edx \
  _asm mov eax, 0           \
  _asm cpuid}

#define PERF_REPORT          \
  {_ui64toa(MSRE-MSRB, perfmtrbuf, 10);    \
  TRACE("Cycles needed: %s\n", perfmtrbuf);}
#define PERF_START
#define PERF_STOP
// _DEBUG 

The function _ui64toa() is used, so stdlib.h has to be included prior defining these macros. Since I was using MFC, I used TRACE to output the results. Any other method for displaying and/or storing perfmtrbuf is good.

Using the macros is very simple:

void SomeFunction(void){
  //...code to time

PERF_START and PERF_STOP/PERF_REPORT can bracket as many pieces of code one wants. One cannot pair them overlapped or nested, though. If nesting is a must (although I cannot imagine why it could be), the macros can be changed to accept parameters (the variable names to use).

1. Accuracy

Actually, the timing's accuracy is less-than-expected, at a first glance. Two consecitive timings can show up a difference of many cycles. Executing following function 60 times shows the behaviour of this timing method:

void Test()
  int i, j, k;
  for(i= 1;i<0xff; i++){
    for(j= 1;j<0xff; j++){
      k = i+j;

The mean execution cycle number is 463885.38, with a standard deviation of 4.84%. Following chart shows this:

Obviously, there is a lower threshold at 457313 cycles. The code mostly executes in this number of cycles, except some peaks when it needs more than that. How come? First, the interrupts are not off during code execution. Second, the code ran under Windows, which could preempt the process' execution. It does not apply for this piece of code, but the initial state of the processor's cache also can eat some cycles.

In conclusion, the code is being timed in it's "natural environment", not in a vacuum. If you need to know the bare execution time of a specific piece of code, repeated timing is a must. For short pieces of code, it is possible to get the exact number of cycles needed at minimun to execute.

2. Compatibility / Portability

The code clearly needs an Intel Pentium processor at minimum. At the time writing this, I have no information about the behavior of other processors. The compiler used must be able to handle the __int64 data type (not ANSI). The _asm is a Microsoft specific keyword; other compiler may define it different. The TRACE macro i have used is actually a question of context, and can be easily replaced with some equvalent. The macros actually do not need Windows at all. The only condition, except a Pentium+ processor is to have the TSD flag in CR4 cleared by the OS.

3. Pitfalls

The RDTSC instruction will cause a #GP fault, if the TSD flag in CR4 is set. At the time writing this I have no information about other operating systems regarding this point.

Both RDTSC and CPUID will generate an #UD exception on processors "older" than Pentium (486, 386 and so on).

If not paired properly, PERF_START and PERF_STOP will return bogus results, but no compile/runtime error.

Timing code that accesses any external device except the main memory, will probably produce totally inconsistent results. This is true for any simmilar timing method, by the way.


  • Is Windows real-time?

    Posted by Legacy on 12/03/2002 08:00am

    Originally posted by: Sam Hobbs

    It is my understanding that Windows is not a real-time operating system so it cannot be relied upon for complete accuracy. The embedded Windows might be a real-time operating system but the others are not as far as I know. A device driver could be more accurate than an application but even a device driver cannot be relied upon for complete accuracy.


  • linux

    Posted by Legacy on 04/05/2002 08:00am

    Originally posted by: Buddy Zdankiewicz

    #define __int64 long long
    #define TRACE printf

    and this should work on linux too

  • RDTSC failed on Intel SpeedStep CPU

    Posted by Legacy on 07/28/2001 07:00am

    Originally posted by: LSF

    Does anyone know the problem and solution?
    It seems that the counter is wrong after Sleep() on Intel SpeedStep CPU.


  • That's what's wrong

    Posted by Legacy on 07/26/2001 07:00am

    Originally posted by: Georgy

    It tured out that the compiler can't inline such 'naked'
    function. There was still call/ret. BUT this function will
    be really inlined.

    // Read/Return the number of cycles elapsed since CPU reset
    #pragma warning( disable: 4035 ) // Disable warning message "no return value"
    __asm rdtsc

    The generated code for PERF_START; PERF_END contains this

    00401008 0F 31 rdtsc
    0040100A 8B CA mov ecx,edx
    0040100C 8B F0 mov esi,eax
    0040100E 0F 31 rdtsc

    and my Pentium x86 Family 6 Model 7 Stepping 2 448MHz
    spends 31 cycles for that.

  • In a nutshell...

    Posted by Legacy on 07/26/2001 07:00am

    Originally posted by: Tim

    So, the situation we have here then is that we're not sure whether or not CPUID is necessary, correct?

    I'm sure we all want the same thing -- accurate timing. Tell me how to acheive that ideally, and I'll adopt it in all of my code...


  • 'Optimized' version

    Posted by Legacy on 07/25/2001 07:00am

    Originally posted by: Georgy

    Good idea.
    I just didn't understand why you need 'cpuid'.
    My 'optimized' code seems to work well as well.

    #define PERF_DECLARE __int64 PERF_MSRB, PERF_MSRE;

    #define PERF_START \
    { _asm rdtsc \
    _asm lea ebx, PERF_MSRB \
    _asm mov dword ptr [ebx], eax \
    _asm mov dword ptr [ebx+4], edx }

    #define PERF_STOP \
    { _asm rdtsc \
    _asm lea ebx, PERF_MSRE \
    _asm mov dword ptr [ebx], eax \
    _asm mov dword ptr [ebx+4], edx }

    #define PERF_REPORT \
    { char perfmtrbuf[100]; \
    _ui64toa( PERF_MSRE - PERF_MSRB, perfmtrbuf, 10 ); \
    printf( "Cycles needed: %s\n", perfmtrbuf ); }

  • Damn what's wrong ?

    Posted by Legacy on 07/25/2001 07:00am

    Originally posted by: Sanx

    Simply use :

    * Read the number of cycles used
    * \return the number of cycles.
    inline __int64 __declspec(naked) rdtsc()

  • QueryPerformanceXxxx functions maybe more convenient

    Posted by Legacy on 07/24/2001 07:00am

    Originally posted by: Peter van Merkerk

    Very nice macro's!

    I've used the QueryPerformanceCounter() and QueryPerformanceFrequency() quite frequently to get quite accurate timing info. Those functions are not as accurate as your macro (function call overhead/resolution). But considering that the QueryPerformanceFrequency function returns at least 1.19 MHz (840 nS resolution), they are accurate enough for many cases. An advantage of the QueryPerformanceXxx functions is that you can easilly convert the result to (micro)seconds.

    Nevertheless many thanks for the macro's, I will add them to my "toolkit"!

    BTW: Even if you have profiler tool (like the one that shipped with MSVC++) macros like this are still an invaluable addition to obtain exact timing information of a piece of code. Information that a profiler tool cannot always provide.

  • Other processors, other OSs

    Posted by Legacy on 07/24/2001 07:00am

    Originally posted by: Jean Cyr

    I have tested your code. It operates correctly on AMD Athlon/Duron processors. Also works under Windows 2000. Thanks.

  • QueryPerformanceCounter

    Posted by Legacy on 07/24/2001 07:00am

    Originally posted by: John

    QueryPerformanceCounter might use this feature already...... Don't know for sure.

    Does intel offer such library?


  • You must have javascript enabled in order to post comments.

Leave a Comment
  • Your email address will not be published. All fields are required.

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date