I was surprised that on Unix/Linux OS my code was running 50-80% faster than on M$ Window$ with the same hardware! As an example, just compare different operating systems on my Athlon-XP 1.3 GHz laptop:
"__MTOPS________CPU_@_freq.__(GHz)___________OS__________comments/flags_______",
" 0.20| Intel Pentium 0.10| Win95| Win32",
" 0.40| * Intel Pentium 0.18| Win95| Win32",
" 0.68| Motorola MPC8260 0.20| MontaVista Linux| no FPU",
" 0.73| Intel Pentium 0.16| FreeBSD 3.3| ",
" 0.96| Intel Celeron 0.40| Win95| Win32",
" 1.31| Intel Pentium Pro 0.20| OpenBSD 3.8|gcc335 -s pentiumpro -O3",
" 1.39| Intel PII 0.40| Win95| Win32",
" 1.64| UltraSPARC 10 0.44| Sun Solaris 2.7| ",
" 1.75| Intel Celeron 0.40| RH Linux| ",
" 1.92| Intel PII 0.40| FreeBSD 4.8| ",
" 3.25| Dual Intel PIII 0.75| RH Linux 7.3| ",
" 3.87| AMD AthlonXP 1.30| Win2000| Win32",
" 4.11| Intel PIII 0.66| FreeBSD 5.4| -O3",
" 4.27| Intel Celeron 1.70| Win2000| Win32",
" 4.33| Intel PIII 0.66| FreeBSD 5.4| -mtune=pentium3 -O3",
" 6.06| AMD AthlonXP 1.30| RH Linux 7.3/2.4| ",
" 6.57| Intel PIII 1.00| Knoppix/2.6.11| -O3",
" 6.82| AMD AthlonXP 1.30| RH Linux 7.3/2.4| -O3",
" 7.78| AMD AthlonXP 1.30| FreeBSD 5.4| -O0",
" 8.33| AMD AthlonXP 1.30| FreeBSD 5.4| -O3",
" 9.61| AMD AthlonXP 2.10| Fedora FC2 Linux| ?",
" 9.85| AMD AthlonXP 1.30| FreeBSD 5.4| -mtune=athlon-xp -O3",
" 10.31| PowerPC G4 1.33| Mac OS 10.4.6| gcc400 (Apple) -O3",
" 10.55| AMD Athlon XP 1.40| FreeBSD 6.1|gcc344 -s -athlon-xp -O3",
" 10.67| AMD AthlonXP 2.10| Fedora FC2 Linux| -O3",
" 10.68| Intel P4 2.20| Win XP| optimized Win32",
" 10.95| Intel P4 Xeon HT 2.80| FreeBSD 5.4| -O3",
" 11.15| Intel P4 Xeon HT 2.80| FreeBSD 5.4| -ffast-math -O3",
" 11.98| AMD AthlonMP 2400+ 2.00| FreeBSD 6.0 b5| gcc344 -O3",
" 12.67| Intel Celeron M380 1.60| Win XP| optimized Win32",
" 12.85| Intel Xeon HT/EM64 2.80|# FreeBSD 5.4| -ffast-math -O3",
" 13.23| Intel Xeon 3.00| SuSe 9.1| gcc333 -s nocona -O3",
" 13.44| Intel Xeon HT/EM64 3.00| FreeBSD 6.0| -mtune=nocona -O3",
" 14.08| AMD Sempron 2600+ 1.68| Win2000| optimized Win32",
" 14.30| AMD Athlon64 1.80|# FreeBSD 5.4| -ffast-math -O3",
" 14.94| Intel Core 2 Duo 2.16| Mac OS X 10.4.8| gcc401 -O3",
" 14.94| Dual AMD Opteron242 1.60|# FreeBSD 5.4?| -O3",
" 15.47| AMD Athlon64 1.80| Win XP| optimized Win32",
" 15.47| AMD Sempron 1.80| Win XP| optimized Win32",
" 15.83| Dual AMD Opteron242 1.60|# FreeBSD 5.4?| ?",
" 16.02| AMD Sempron64 3400+ 2.00| FreeBSD 6.1| gcc344 -mtune=k8 -O3",
" 16.84| AMD Athlon64 1.80|# FreeBSD 5.4| -O0",
" 17.05| AMD Athlon64 1.80|# FreeBSD 5.4| -ffast-math -O0",
" 17.73| AMD Sempron64 3400+ 2.00| Win XP SP2| optimized Win32",
" 17.97| AMD Sempron64 3400+ 2.00| FreeBSD 6.1|ffast-math -mtune=k8 -O3",
" 18.22| AMD Athlon64 1.80|# FreeBSD 5.4| -O3",
" 18.87| Intel Xeon HT/EM64 2.80|# FreeBSD 5.4| -O0",
" 19.00| AMD Sempron64 3400+ 2.00|# FreeBSD 6.1| gcc344 -mtune=k8 -O2",
" 20.15| Intel Xeon HT/EM64 2.80|# FreeBSD 5.4| -O3",
" 20.78| Intel Xeon HT/EM64 2.80|# FreeBSD 5.4| -mtune=nocona -O3",
" 21.11| AMD Turion X2 1.60|# SuSe 10.2| gcc412 -O3",
" 22.17| Intel Pentium D 945 3.40|# Linux FC 5| gcc410 -O3",
" 22.17| * AMD Sempron 2600+ 1.76|# SuSe 9.3| gcc335 -O3",
" 22.51| Intel Pentium D 945 3.40|# Linux FC 5| gcc410 -ffast-math -O3",
" 22.54| AMD Athlon64 X2 4400 2.25|# FreeBSD 5.4| gcc342 -O3",
" 22.67| Intel Xeon HT/EM64 3.00|# FreeBSD 6.0| -mtune=nocona -O3",
" 22.93| AMD Athlon64 X2 4400 2.25|# FreeBSD 5.4| gcc402 -O3",
" 23.20| AMD Athlon64 X2 4400 2.25|# FreeBSD 5.4| gcc402 nocona -O3",
" 23.47| AMD Athlon64 X2 6000 3.00| FreeBSD 6.2| gcc346 athlon64 -O3",
" 24.94| Intel Pentium D 945 3.40|# FreeBSD 6.2| gcc346 -O3",
" 24.94| Intel Xeon 3.00|# SuSe 10| gcc402 nocona -s -O3",
" 25.25| Intel Pentium D 840 2.80|# SuSe 10| -O3",
" 25.58| Intel P4 EM64 2.80|# SuSe 10| gcc402 nocona -O3",
" 25.58| * AMD Sempron 2600+ 2.00|# SuSe 9.3| gcc335 -O3",
" 26.60| Intel Pentium D 945 3.40|# FreeBSD 6.2|gcc346 -mtune=nocona -O3",
" 27.33| AMD Phenom X4 9600 2.30|# FreeBSD 6.3| gcc346 -O3",
" 29.34| Intel P4 640 HT 3.20|# SuSe 10| -mtune=nocona",
" 31.67| AMD Athlon X2 4600+ 2.40|# Linux Gentoo-r7| gcc441 -O3",
" 35.21| AMD Phenom X4 9600 2.30|# FreeBSD 7.0| gcc421 -O3",
" 36.50| AMD Phenom X4 9600 2.30|# FreeBSD 7.0| gcc421 -ffast-math -O3",
" 39.90| Intel Xeon E5420 2.50|# GNU/Linux 2.6.24| Intel 64bit compiler",
" 43.37|Intel Core2 Duo E8400 3.00|# Ubuntu 4.3.3-5| gcc433 -ffast-math -O3",
" Linux 2.6.28-16| ",
" 48.79|Intel Core2 Duo P8600 2.40|McBkPro McOSX 10.5| gcc401",
"103.59| * Intel Core i7 975 3.60|# OpenSUSE 11.2| gcc -march=native -O3",
" Linux 2.6.31|-fopenmp -ffast-math ...",
"_____________________________________________________________________________",
" ( *) Either CPU or bus is overclocked",
" ( #) 64 bit OS",
And the moral is do not put into trash your old Pentium III 666 MHz box because it might serve as good Unix calculator as fast as Pentium 4 1.2 GHz or Celeron 1.7 GHz with Window$! There are no fools, this is why true scientists do not hate Micro$oft they love Unix instead.
Though the result of the performance measurements very much depends on what and how we measure, OS/kernel/scheduler responses or pure hardware power (see short description below)?
Another interesting thing is how GCC v. 3.4.2 optimization flags affect the performance on different platforms. In general, the difference in performance between -O0
and -O3
level of optimization is less than 10%. Seem that -ffast-math -O3
much slower than even -O0
on AMD64 and Intel Xeon EM64 platforms. GCC v. 4.0.2 produces code with better performance and does not show this behavior. Also -mtune=CPU
is very important for x86 hardware and unnecessary for x86-64 CPUs.
This program is written in C++ though there is no objects and OO programming, I tried to follow ANSI standard. Software consist of one file testfcpu.cpp which you can freely download, modify and redistribute under terms of the GNU General Public License. The code is "platform independent" meaning that it is possible to compile on Unix, Linux and Window$. I have successfully compiled testfcpu.cpp with Borland C++ and VC++ compiler as a console win32 application.
The program is not a sophisticated benchmark meter, it does not measure MIPS or xFLOPs. The main idea is to create a big array of random numbers (double precision, 8 bytes!) and then take tens or hundreds of millions (or even more ...) of trigonometric operations like sine and cosine a.k.a. FFT often used in scientific computations. The performance is measured in MTOPS or millions trigonometric operations per second. The size of the array is big enough to not fit in the cache of the CPU, though for modern CPUs with huge cache it might be not true.
Basically, this program measures overall performance of the CPU, FPU and data transfer rate between them and RAM. There is no benefit to run it on multiprocessor/cluster platform because code will be executed on one CPU/node only.
In order to measure time a system call clock()
is used which determines the amount of processor time used since the invocation of the calling process (man 3 clock). The way time is measured is very important in benchmarking. In early version of the program the real time measurements were performed, so other tasks and CPU interruptions were affecting the results very much. This method was not very good for performance estimations of the hardware, however showed good characterization of the OS and kernel/scheduler in particular. For example Window$ based platforms showed really bad results with its poor task scheduler. In the latest version the measured MTOPS value should be more or less OS independent and even heavy load of the CPU should not affect the result because the CPU time spent on testfcpu
is measured. For the comparison, the number of operations per real second is also computed.
g++ -mtune=athlon-xp -O3 testfcpu.cpp -o testfcpu
-mtune
flag have to
contain different value, read manual for your compiler! If in doubt, just skip this flag. On AMD64/Opteron with 64 bit OS and GCC compiler you do not need this flag at all.strip testfcpu
./testfcpu