Performance in dependence of access patterns


A long long time ago, there existed computers where the access time for each cell of a random-access memory was independent from earlier accesses. This is no longer the case and in fact a non-trivial topic in which we will delve into in the coming weeks. Nonetheless, we will have a first look at this phenomenon which gives us the opportunity to see how performance can be measured in C.

The UNIX kernel maintains for each running process the CPU time and real time spent since the start of the process in clock ticks. Clock ticks are a system-dependent unit which give the actual granularity for time measurements on a system. The system call sysconf(_SC_CLK_TCK) tells the number of clock ticks per second which permits us to convert between clock ticks and seconds.

Here is a small test program demonstrating the function walltime which delivers the number of real time seconds (represented as double) since the program started:

#include <stdio.h>
#include <stdlib.h>
#include <sys/times.h>
#include <unistd.h>

/* return real time in seconds since start of the process */
double walltime() {
   static clock_t ticks_per_second = 0;
   if (!ticks_per_second) {
      ticks_per_second = sysconf(_SC_CLK_TCK);
   struct tms timebuf;
   /* times returns the number of real time ticks passed since start */
   return (double) times(&timebuf) / ticks_per_second;

int main() {
   double t0 = walltime();
   double t1 = walltime() - t0;
   printf("sleep(1) took actually %.4lf seconds\n", t1);
theon$ gcc -Wall -o walltime walltime.c
theon$ ./walltime
sleep(1) took actually 1.0100 seconds

We use the real time here under the assumption that our process has a dedicated CPU or CPU core while it is running. The real time is more realistic in case of virtual systems or if our process waits for some event or gets interrupted.

The system call times fills a struct tms that provides as with the CPU time spent for the process in user and in system mode (tms_utime and tms_stime). Both are also given as sums for all child processes (tms_cutime and tms_cstime). The real time is not stored into the struct tms but just returned. Hence, we are ignoring the contents of the struct tms and evaluate just the return value of times.


It is interesting to see how the performance scales using variable vector lengths for powers of two in the range from 8,192 to 67,108,864. To measure the performance for each of the lengths independently from each other, you should for each length