Abstract The Fortran Whetstone programs were the first general purpose benchmarks that set industry standards of computer system performance. Whetstone programs also addressed the question of the efficiency of different programming languages, an important issue not covered by more contemporary standard benchmarks. Results are provided for computers produced during the 1960's to present day systems, including via different languages. The benchmark, a UK product, was based on work by Brian Wichmann ** of the National Physical Laboratory. It was developed by Harold Curnow ** of HM Treasury Technical Support Unit (TSU - later part of Central Computer and Telecommunications Agency or CCTA). This document was produced by Roy Longbottom (TSU/CCTA 1960 to 1993), who carried out further development.



** Download Whetstone.pdf, a copy of their original research paper - kindly supplied by Brian.

Contents

In The Beginning Whetting The Stone Rolling The Stone Throwing The Stone Compiler Optimisation Table Headings and Explanation Index of Results PC Results Same C Compiler PC Comparison Various Compilers PC Efficiency %MWIPS/MHz References and Source Code More Historic Data In The Beginning Before the introduction of high level languages, general computer performance comparisons were usually based on instruction execution times. These were combined to produce an overall rating using a mix of instructions, the most well known one being the Gibson Mix for scientific applications, devised by J Gibson of IBM. In 1957, the UK Government formed the Technical Support Unit to evaluate and advise on computers, employing engineers from the telecommunications service. This unit eventually became part of the central procurement body later known as the Central Computer and Telecommunications Agency (CCTA). TSU engineers produced numerous calculations between 1966 and 1973, using an ADP Mix, the Gibson Mix and a Process Control Mix.



To Start

Whetting The Stone During the late 1960's, the UK National Physical Laboratory had an English Electric (ICL) KDF9 scientific computer with one of the first implementations of Algol 60, the Whetstone translator-interpreter. Brian Wichmann modified the interpreter to record statistics on the intermediate Whetstone instructions and produced a suite of simple statements which could be used to evaluate the efficiency of compilers and overall performance of a processor (see ICL KDF9 benchmark results in the table - the first is for the Whetstone Interpreter). In 1971 Roy Wickens, one of the founding members of TSU, abandoned producing a portable benchmark using real programs as it was becoming too expensive. He asked Harold Curnow to produce modular synthetic benchmark suites. Harold produced the COPRXX suite for COBOL and a scientific program based on Brian Wichmann's work. The first Whetstone benchmark, known as HJC11 (later ALPR12), was written in Algol 60 and completed in November 1972. The Fortran codes (HJC12 and HJC12D) were published in April 1973 as FOPR12 and FOPR13. The first results published were for IBM and ICL mainframes in 1973. The speed rating was calculated in terms of Kilo Whetstone Instructions Per Second or KWIPS. Later, Millions or MWIPS was used.



To Start

Rolling The Stone During the 1970's, I was head of the CCTA Scientific Systems Branch with responsibilities for evaluating new systems, advising on procurements and supervising acceptance trials at both Government Departments and Universities. This provided the means for obtaining numerous results on minicomputers and mainframes. At this time, versions were available in various programming languages (see results). Taking personal responsibility for state of the art systems including supercomputers, in 1978 I produced a fully vectorisable version FOVP12 (using arrays instead of simple variables). This provides MWIPS ratings at different vector lengths (array dimensions). At the time, results of the Livermore Kernels benchmark were available for top of the range scientific systems but it was considered that it would be useful to be able to have rough performance comparisons with less glamorous systems. Results are given later in the tables at vector length 256. Also during 1978, the standard versions were modified to calculate MWIPS, using CPU timers. [Vector version reference - R Longbottom, "Performance of Multi-user Supercomputing Facilities", 4th International Conference on Supercomputing, April 1989] It appears that my vectorisable version was used long after I departed from the supercomputer scene. Later Results From Here (mainly for workstations). In 1980, I added facilities to time each of the eight loops to produce speed ratings in Millions of Integer Instructions and Floating point Operations Per Second (MIPS and MFLOPS). MIPS represent a relative measurement where DEC VAX 11/780 = 1. This was to identify the tricks that some compilers were getting up to and to provide more meaningful measures for supercomputers. The last alterations to the benchmark were in 1987, in conjunction with Bangor University, who made slight changes intended to avoid over optimisation whilst still executing identical functions. The benchmarks were also converted to Fortran 77 standards. At a later stage, I produced compatible versions using Fortran, Basic, C and Java programming languages for use on PCs (see PC results). These included further changes to repeat the tests via outer loops to prevent speed calculation inaccuracy due to timer resolution. 2005 - The Whetstone Benchmark has been compiled to run as a 64 bit program via Windows XP Pro x64 and modified to demonstrate performance of Dual Core CPUs. Also available are 32 bit versions that use SSE floating point instructions via the latest Microsoft compiler. See Win64.htm and DualCore.htm

To Start

Throwing The Stone The benchmark results were published within CCTA as "Commercial in Confidence" and supplied to customers when required for a particular procurement. By 1979, results were available for about 200 systems from 30 suppliers. Although the main emphasis was on comparing speeds via Fortran, limited results were also available via Algol, PL/I, APL, Pascal, Basic, Simula and Coral besides from varying optimising options. Along with results in single and double precision (and extended precision where appropriate), more than 500 measurements were available. By this time, the Whetstone benchmark speed rating had become the default definition of minicomputer MIPS (Millions of Instructions Per Second), its significance being exaggerated when a minicomputer supplier somehow acquired the table of Whetstone benchmark results and published some of them in the computer press with the heading "Now who has the fastest minicomputer". Whetstone performance ratings are known to have been a serious consideration in the design of the Digital VAX systems and other minicomputers of the same vintage, where some were reluctant to publish double precision results which did not match VAX speeds. DEC benchmarking publications show that Whetstone results were given serious consideration until 1986. The benchmark was still being run by DEC in 1996 with results of Alpha-based systems available on www.digital.com. The Intel microprocessors were designed at the height of popularity of the Whetstone benchmark. Examining the instruction set of the math coprocessor, with instructions for sin, cos, atan, sqrt and log, possibly indicates a complete hardware implementation (the one and only?) to match the benchmark. The design also includes 80 bit registers which ensure fast double precision operation. Although rightly not used as one of the main performance measurement tools, the Whetstone benchmark was still run by Intel in 1996, with results of 486 systems, DX4 and Pentium overdrive processors being available on www.intel.com. The benchmark also formed a small part (2%) of the Intel iComp benchmark. As can be seen in PC results, the Intel P4 processor obtains poor results relative to CPU MHz. This might be due to the length of the P4’s execution pipelines and the relatively few instructions in the benchmark’s timing loops.



To Start

Compiler Optimisation The benchmark is very simple, comprising some 150 statements with eight active loops, three of which execute via procedure calls. Three loops carry out floating point calculations, two functions, one assignments, one fixed point arithmetic and one branching statements. The dominant loop, usually accounting for 30% to 50% of the time, carries out floating point calculations via procedure calls. The tests only reference a small amount of data which will fit in the L1 cache of any CPU. Hence, L2 cache and memory speed should have no influence on performance ratings. Speeds are invariably proportional to CPU MHz on a given type of processor. The code was designed to be non-optimisable and optimising compilers did not have a significant impact until the introduction of in-lining of subroutine instructions. Although this produces code outside the definition of Whetstone instructions, which include a specific proportion of procedure calls, it is a valid technique to obtain the best performance out of modern systems and may well be the compiler default optimisation level. As reflected in the PC results, a good compiler can halve the execution time by in-lining, careful choice of instructions and sequence, and omission of intermediate stores/loads. With in-lining and global optimisation, a small number of compilers identified that the dominant loop did not have to be executed and immediately lead to an apparent more than doubling of MWIPS speeds. This was identified by the 1980 enhancements and fixed in 1987, essentially by changing the name of one variable. Unlike some other standard benchmarks, Whetstone results were generally verified as part of the CCTA system appraisal, in project related benchmarking sessions or during acceptance trials. It was also standard practice to run the tests with different levels of optimisation and obvious over optimised results were not published. Besides the global optimisation problem, two other areas of complication have been observed. The first is where loop variables can be too large for index registers yet the program still runs with a truncated count. This is catered for by having a double loop to control the running time. The second complication becomes apparent as systems become faster and underflow can slow down the execution rate of the first two loops. This can be fixed by changing the values of variables t and t1 to be closer to 0.5. The latest Intel compiler appears to over optimise the loop with integer arithmetic. Here, a series of variables are calculated which produce array indices of constant values and therefore only need to be calculated once. It would seem that the only way the problem would arise was if the compiler carried out the indexing calculations, maybe to determine that array accesses are not going to be out of bounds.



To Start

Table Headings and Explanation Supplier or System - The suppliers full name or earlier/later names may be shown. Hardware options may be included with the system name. CPU and Precision - This includes the CPU chip type and an indication of the precision as shown in the original CCTA results. This gives Base: Precision. For example, 2:23 indicates 23 binary digits and 16:6 six hexadecimal digits. These were important when considering accuracy where hexadecimal single precision was not that good. Precision numbers are as follows (single and double):



SP DP P1 16:6 16:14 P2 10:8 10:16 P3 8:13 8:26 P4 2:48 2:96 P5 2:23 2:30 P6 2:27 2:62 P7 2:23 2:55 P8 2:24 2:56 P9 2:23 2:46 P10 2:23 2:38 P11 2:23 2:39 P12 2:28 2:64 P13 2:22 2:38 P14 2:24 2:53 P15 2:39 2:78 P16 2:39 2:74 P17 2:23 2:47 P18 2:27 2:60 P19 2:31 P20 2:32 P1 also had Extended Precision of 16:28 and P10 2:69 For vector processors, an indication is given to show whether the result represents scalar or vector performance, the latter being for vector length 256. In comparing supercomputer performance both scalar and vector results should be taken into account. In this case a weighted harmonic mean is used based on 90% of the code being vectorisable. The weighted average is calculated as 1/(0.1/S+0.9/V), where S is the scalar speed and V the vector speed. Clock MHz - This may be derived from the clock period for older systems. MWIPS - Generally single precision Whetstone rating in Millions of Whetstone Instructions Per Second. Differences between Single Precision (MWIPS SP) and Double Precision (MWIPS DP) should be noted. MFLOPS - The geometric mean of the three floating point results in Millions of Floating Point Operations Per Second. VAX MIPS - The geometric mean of Millions of Operations Per Second for the sections covering fixed point arithmetic, if then else and assignments, multiplied by five. Such a calculation for the DEC VAX 11/780, accepted as running at 1 Million Instructions Per Second, produces approximately 1.0 MIPS. Lang - Two or three digit code to indicate the programming language:



For Fortran Alg Algol PL1 PL/I Cor Coral Pas Pascal Apl APL Bas Basic BasI Interpreter VB Visual Basic C++ C or C++ Sim Simula Opt - An indication of relative optimisation levels within a given range of systems. IL might be shown to indicate in-lining of procedures or subroutines. xxx indicates unknown for versions run by suppliers which may be subject to part not being run due to over-optimisation. Cost $K - Most of the prices shown were obtained from US sources. Some of the costs are, of course, not particularly accurate. Mainframe and larger minicomputer prices are generally only for a processor with minimum memory capacity. Intr Date - Approximate year of first delivery.



To Start