Throwing
StonesBill Weinberg, Sr. Product Manager, Microtec Research
In the world of real-time embedded applications, both the performance requirements and limited system resources argue for the selection of tools that produce the fastest and the most compact code. Finding compilers that produce such code is not always an easy task, however. Of the many factors in judging a compiler, benchmarks appear to be the most readable, quantifiable indicators of code quality and performance.
Benchmarks and their interpretation, however, are among the most contentious subjects in the computer industry. Through careful choice (and omission) of particular tests and cavalier treatment of competing tools, vendors can place their compilers at the head of the pack and back up their claims in writing! Unless you know about the circumstances under which benchmarks were performed and the characteristics of the benchmark programs, benchmarks will often obscure more than they reveal and offer a potentially skewed view of competing tool sets.
I doubt that anyone actually believes that benchmarks alone determine the quality and usefulness of a tool set. Rather, lacking time and resources to fully evaluate the options at hand, we look to benchmarks to provide an easy-to-understand guideline in a complex decision.
In a perfect world, you would be able to testdrive the alternatives and choose the tools that yield the best performance for your application. Two factors, however, conspire to make in-house bench-marking prohibitive. The first is that compilers are usually purchased early in the development cycle, when no significant amount of code is available for testing.
The second factor is that even if code is available, such a comprehensive benchmarking project consumes expensive resources and valuable time. So, practical concerns dictate looking to outside sources for performance figures and competitive benchmarks. One such source lies in competitive reviews published in magazines like this one. Another, more generally available source, is vendor-supplied benchmarks found in advertisements and supplier literature.
Ideally, then, a responsible compiler manufacturer should supply you with the most accurate benchmark data. Unfortunately, compiler technology and the marketplace are changing so quickly that current benchmarks for all the tools on your shopping list are seldom available.
When it comes to vendor presentations, reader beware! Ask yourself questions such as: What does the vendor want me to believe? Do the benchmark criteria track my application? Do they track any real application? What methods were used to arrive at these numbers? This article will help you answer these and other questions and become a better benchmark consumer by stacking up the various stones and marks and deciding which tool set meets your project's needs.
Whetstone, Dhrystone, Null-stone--what's with all the rocks? Where do these and other benchmarks come from, and what do they really do? Will the benchmark code be representative of your actual application?
The origin of the test programs offers the biggest challenge to the very use of compiler benchmarks: most of the benchmarks were conceived not to measure software efficiency, but to grade computer systems relative to one another. Silicon vendors inherit SPECs and "stones" thrown down to them from workstation and mainframe vendors. Embedded systems compiler vendors follow in their footsteps by using the same benchmarks, whether or not they are indicative of compiler quality and whether or not the test programs represent realistic cases of embedded application code.
The long and short of it is that those programs provide a traditional basis for benchmark testing industry-wide. Although these tests aren't particularly appropriate for embedded applications benchmarking, as a collection they can give you a reasonable estimation of compiler code generation quality.
So what is a Dhrystone, anyway? Table l provides brief descriptions of some of the more popular benchmarks used in today's marketplace. Note that the benchmarks include a variety of tests, from simple programs such as Fibonacci through comprehensive test suites such as SPEC and Stanford.
Once the test programs have been determined, the C code must be passed through the compilers in question. Although the principal focus of benchmark testing is the output of code generation tools, we always start out by considering the input characteristics of the compiler under test. Remember--you can't test code if the tools won't compile it!
"First pass" compiler checks include the capability of compiling large functions and modules, the ability to correctly compile complex C expressions, and the availability of run-time libraries to execute the programs in question. Do not underestimate the importance of code acceptance and compliance. In a recent round of tests at my company, one of the top performing compilers (in Dhrystones) proved to be the least compliant in terms of ANSI Standard C (using Plum Hall). The moral here is that fast code is great; code that compiles and works correctly is better.
Two basic schools of thought exist regarding compiler invocation options. The first I'll call naive testing. Basically, this method assumes that all compilers default to producing optimal or near-optimal code. Hence, you can test any two compilers right out of the box with no tuning. In reality, few compilers default to a full complement of optimizations partly because of debugging considerations and the inherently conservative nature of embedded programming.
The sensible alternative to naive testing is to study the code generation and optimization switches for all the compilers involved in a benchmark comparison and derive a reasonable base line for code generation and optimization options. If you are performing the benchmarks, this base-line set of characteristics should conform to the way in which you will build your application production code.
Areas to investigate include:
Data item sizes
Global data access models
Code models
Stack and bus bandwidth and parameter passing
Optimization repertoire
Run-time libraries used.
A good example of establishing a base-line lies in 680x0 family integer data sizes. Most 680x0 family C compilers standardize on 32-bit storage for int and long int types. A few compilers, however, default to 16-bit integers. If ignored in a test, the 16-bit defaulting compilers will appear to generate more compact code, regardless of actual efficiency or real-world usage.
Another 680x0 example can be found in addressing modes used for accessing global data, sometimes referred to as memory model (yes, 680x0 CPUs do have models). Various compilers give you the option of addressing global data through absolute addressing, Pcrelative addressing, or via an address register (commonly A5) with either 16- or 32-bit offsets. A compiler that offers only A5-relative addressing and only 16-bit offsets would certainly seem to generate smaller and sometimes faster code than one with 32-bit offsets. In reality, however, 16-bit offset relative addressing can only handle global data objects totaling 64 kbytes in size.
| Benchmark | Description | Emphasis |
| Ackerman | A small recursive program | Parameter passing and return |
| Benche | Embedded string search/match | Array manipulation |
| Bezier | Curve splitting for cubic Bezier curves | Integer arithmetic |
| Blit | Bit "blitting" - typical graphics operations | Multiplication and bit-wise operations |
| Bubble | Bubble Sort algorithm | Array manipulation |
| Compaq | Hodge-podge program | C compiler run-time functions |
| Dhrystone | Standard Dhyrstone benchmark | Structures and pointers |
| Fibo | Recursive Fibonacci series generator | Function calls and paramter passing |
| Hanoi | Towers of Hanoi puzzle | Heavy recusion and array usage |
| Matrix Multiply | Algebraic matrix multiplication | Array manipulation |
| Nullstone | A test suite | Wide variety of optimizations |
| Opt | PC Tech Journal test program | Optimizations |
| Puzzle | Compute-bound program | Integer arithmetic, loops and arrays |
| Queens | Solves the eight queens chessboard problem | Loops and array manipulation |
| Qsort | Quick sort algorithm | Array manipulation |
| Sieve | Sieve of Eratoshenes prime number program | Nested loops and arrays |
| SPEC | Industry-standard suite designed for system benchmarking | Wide range of code, including gcc |
| Stanford | John Hennessy's benchmark suite | Includes many programs in this table |
| Whetstone | Standard Whetsone benchmark | Floating point arithmetic |
Compilers vary greatly in terms of default optimization behaviors. Some default to practically no optimization, assuming that you will want to tune your application only near the end of your project. Others emphasize realism and access to production code (the code that you actually ship in your application) from the start. These compilers default to full optimization out of the box. Most default to a useful but conservative mix of compiler optimizations rather than all-or-nothing settings.
It is important to remember that optimization is not binary. Individual types of optimizations offer different speed and size benefits. Moreover, compilers do not exist in a vacuum. They exist as components of entire tool chains, and compiler output may eventually be directed to a debugger. You may find it useful to employ the available debuggers to monitor the bench-marking process. When some compilers emit the necessary debugging information, however, important optimizations are disabled and performance seriously curtailed.
Come benchmarking time, it is all too easy for tool vendors to overlook the nifty options available in competing compilers. If disparities in code generation defaults and options are not taken into account by the tester, the results of a comparison will be dubious at best and dishonest at worst. When in doubt, ask for examples of the command lines employed and explanations of options in force for each compiler in a ranking.
Although it is probably a good idea for the tester to use all the tool sets in question on the same development host, it is not absolutely critical unless you want to focus on the host-based characteristics of the tools. For embedded systems applications, target-based performance is usually of much greater importance than how the tools perform on the host.
If build-speed is truly critical to your development effort, you may have to compromise on production code performance; compilers that run faster often don't do as well in terms of code generation and optimization. Full-function optimizing compilers may run slower, but the superior code produced is absolutely worth the wait.
When perusing data for build times, try to keep the following few questions in mind:
Are build speeds reported for all tools with comparable invocations? Debug and optimization switches can alter re-suits dramatically.
Were the compilers invoked with full ANSI C checking? ANSI standard and lint-like checks improve code quality but can slow compile time.
Are the published times for complete builds (compile/assemble/link) or for compile only? Full-function linkers are often build-speed bottlenecks, but also need only be run once as opposed to multiple compiler invocations.
Remember, realism is the key: how would you actually use the tools on a day-to-day basis?
Another consideration is the execution environment. To measure benchmark code performance, we have to choose a platform on which to run the programs. Three viable types of execution "engines" exist: instruction-set simulators (Microtec Research XRAY Debugger/Simulator or Avocet AvSim products) native/hosted systems utilizing the same processor (PCs or workstations), and target boards (VME boards or other SBCs).
Simulation is by far the most convenient; no messy hardware is involved and tests can be run in a friendly development environment. Timings are taken not from system clocks but from built-in cycle-count mechanisms. Unfortunately, accurate simulators are not always available for all target architectures; cache, pipeline, and superscalar effects are notoriously difficult to simulate. Moreover, not all compiler vendors' code will load into available simulation engines.
Native execution is also convenient. Benchmarks for 80x86, 68030/40, and SPARC systems can be run native on PCs, Sun 3, and Sun SPARCstations, respectively. However, native/hosted benchmarking also suffers from a variety of shortcomings: execution timing on a workstation platform is subject to the vagaries of the machine installation (network traffic, background tasks running, speed of memory installed, and so on); compatible PCs or workstations may be hard to come by; not all tools being tested provide for native execution; and (forced) use of native run-time libraries would level timings. So, it would seem that native execution is not ideal either--workstations are not embedded systems.
Eliminating simulators and native execution brings us to actual target hardware, usually an industry-standard VME board or silicon-vendor-supplied evaluation system. Although using real hardware offers the greatest realism, the practice is not without problems: obtaining and calculating accurate timings with on-board or on-chip timers isn't always easy. Some evaluation boards have limited memory (and the boards are hard to come by), and hardware-based performance testing is inherently more time consuming (time needed to build downloadable executables, tailor start-up code to enable caches, write timing and I/O routines, and so on).
Target execution platforms are often chosen for convenience or appearances sake rather than realism or applicability to end-user development. For example, many 68000-family tool vendors show performance times on 68040 VME boards; 68040 is high-end, sexy, and exhibits a great ability to plow through Dhrystones or other small benchmarks (which often fit inside the CPU cache). The fact that 68040 designs represent only a small portion of real applications doesn't deter publication of 68040-based benchmarks, but does leave users of more "humble" members of the processor family to guess at figures for their devices.
Before actually running tests or attempting to interpret benchmark results, a review of some finer points of methodology is in order. First of all, good benchmarking methodology specifically prohibits modifying benchmark code to suit a specific compiler or using a priori knowledge about a benchmark program to unfair advantage. Ideally, benchmarks should be "blind."
Despite that prohibition on "peeking under the hood," quite a few compilers have been tuned to generate peak-per-forming code for the more popular benchmarks, especially for Dhrystone. Because Dhrystone does little as a program, it can theoretically be optimized down to practically nothing, and "nothing" runs very fast! The prevalence of such benchmark-specific tuning is one more reason to regard Dhrystone numbers with suspicion.
One of the more aggressive optimizations employed by modern compilers is function inlining. In-lining expands a function call to include the function body itself, instead of using a call to a (nonlocal) version of the function. In-lining, which resembles macro expansion, speeds execution by eliminating the calling interface and associated parameter passing overhead and exposes the in-lined code to additional local optimizations.
Compilers let users control this optimization in a variety of ways: some offer per-function in-lining specification (for example, command-line options or the in-line keyword in C++ and GNU gcc) while others employ special heuristics to decide whether a function can or should be expanded in-line (leaf function, local variables, and so on) without user intervention (that is, blind in-lining).
The problem with in-lining and benchmarking is that ANSI C contains no in-line keyword. To benefit from this powerful optimization, test code will either need to be modified or will require certain command-line options. Unless a compiler supports "blind" in-lining, the need to tinker with the benchmark code violates the previous basic prohibition.
I have recently encountered both questionable abuse of in-lining and quite virtuous abstinence from even blind in-lining by commercial compiler vendors. In the first case, a competitor invoked function-specific in-line options on the command line (while failing to enable our blind in-line switch). In the second case, a zealous vendor/ tester disabled all inlining for maximum fairness, even to the detriment of their own code performance.
It is possible to take either a local or a global view of compiler code performance. The quality of code at a block or function-level is easy to inspect and simple to test because it is limited in scope. Short sequences of code are easier to eye-ball and run. The user can easily spot nonoptimal code sequences that may impact performance. Application-wide performance is harder to measure on the test bench and in the real world due to the complex interaction of local and global execution and real-time interactions. Because of the complexity of bench-marking across larger scopes, benchmark program have typically been short and sweet, as are most of the benchmark programs listed in this article. Thus, the emphasis in benchmarking is usually on short function bodies and mainline code to the exclusion of everything else.
But what about everything else? How can we account for the performance characteristics of an application's hidden parts, specifically, start-up code and run-time libraries? Since start-up runs only once and doesn't participate in the benchmark, it can be ignored. Run-time libraries, however, are another matter entirely.
At least half of the benchmark programs that are commonly employed make no use of runtime libraries or at least do so outside their critical paths (for example, printf ("Start main loop . . ._n");). Several standard benchmarks, however, make heavy use of libraries: Whetstone employs the floating-point libraries for CPUs without math co-processors, and others make extensive use of heap routines malloc() and free().
Benchmark tests that emphasize local code generation quality often seek ways to eliminate the performance effects of run-time libraries. These ways include linking in common libraries for all compilers tested (easiest in native environments) and the choice of benchmarks. Holistic benchmarking practices will dictate the complete construction of each program with each tool set involved.
Which method is superior? Which is the most honest? While emphasis on main-line code generation and performance is laudable from a pure compilation standpoint and reflects the main-line nature of many benchmarks, elimination of run-time libraries is highly unrealistic from an end-user standpoint. Besides, if the run-time libraries involved are written in C, they too will benefit from a compiler's optimizer. "Leveling" of calls to runtime libraries also presents significant technical challenges in terms of forcing the objects produced by various compilers to link with some "neutral" run-time library.
The effect of caching on execution times should also be considered. On high-end CPUs, the processors owe 50% or more of their performance ratings to caches. Running a benchmark without cache or with caches incorrectly enabled can put a compiler out of the running in competitive tests. For example, in a recent test at Microtec Research, various compilers shown to offer between 30,000 and 40,000 Dhrystones on a 68040 yielded no more than 5,000 to 10,000 without a cache. This dramatic disparity has led our staff to consider the inclusion of cache-specific board support code in our tool sets to ensure fair benchmarking on the open market.
Although good cache design speeds real application cede execution, misuse of benchmarks vis-a-vis cache effects is endemic, especially in the world of 386-and 486-based PC-compatibles. Many vendors choose benchmarks that just fit inside their on-chip or on-board caches to portray their hardware in the best light, even if real-world applications are guaranteed to cause cache spills.
In the end, actually running the test programs is not very different from building, downloading, and running your embedded applications. Each of the compilers' cede is put through its paces, with each test run a statistically meaningful number of times. Results are gathered and tabulated for presentation and publication. This is where the fun begins. All seems to be fair in the compiler wars, so the discriminating consumer must beware. Benchmark reports can show results in terms of target code size and execution time for each compiler tested. Tools can be compared by showing raw counts (bytes vs. bytes, cycles vs. cycles, stones per second vs. stones per second, and so on) or relative numbers (Compiler A produces N% smaller/faster code than Compiler B). A compiler (usually your own) can also be established as a baseline and all others measured against it. How benchmark data is reported depends on what the tester is trying to prove and to whom. Sales people want to be able to say "Our tools produce N% smaller/faster code," or that "Brand X is M% less efficient." End users usually like to also see "raw" data to judge for themselves. Everyone wants to see individual test results and aggregate scores, despite some inherent contradictions.
"Brand X is N% smaller/faster than Brand Y" reporting only works if X's code is consistently smaller/faster. Losing just one test will result in negative percentiles, a truly confusing grade. Relative results, however, are essential to producing aggregate scores (on the average, Brand X produces N% smaller/faster code). You can't average across benchmarks because code size and execution speed vary greatly. (Try averaging Dhrystones and SPECmarks together!) Worst of all, some "statistically correct" compiler gurus claim that results should not be averaged, but that a geometric mean should be used in-stead--a real problem with negative test grades!
To get around the confusion, results should be presented graphically whenever possible, and in two forms: raw data (times or bytes) for each test for each compiler, and relative to your own compiler, with negative results dipping below the X axis. The choice of arithmetic vs. geometric mean is a matter of personal taste.
If you are really concerned about size and speed, you should note that most benchmark programs are built to emphasize speed of target code. Size statistics for these same benchmarks should be quoted in kind, that is, as built for speed, not rebuilt for size.
Facts never lie, right? Presentation and use of data, however, is another matter. Having said this, I do not wish to impugn the honesty of vendor benchmarks, but as a marketer of compilers myself, I will summarize the most common tricks and omissions I have seen that can obscure scores and hide less-than-optimal results:
Set tools' defaults to small values such as 16-bit int types. With no options set, your code will be (unrealistically) smaller and likely faster too!
"Accidentally" leave off the key (non-default) optimizations of your competitor.
Ingenuously assume your competitor sets cache as you do in your start-up code.
Test against competitors' older versions and leave out revision numbers.
Quote results from your own "latest and greatest tools," which of course aren't shipping yet.
Omit bad grades. Showing poorly? Leave out that (unfriendly) test!
Tune the hell out of one benchmark (for example, Dhrystone) then report either just those results or aggregate numbers skewed by one great score.
In describing the previous reporting "methods," I speak from both experience and temptation: they have all been perpetrated in the marketplace, but I've never been tempted to follow suit. (Well, almost never.)
In closing, I must disqualify myself from leading you to certain conclusions about compiler benchmarking. After all, since I work for a tools vendor, I am a suspect, a non-neutral. I have, however, attempted to outline the benchmarking process and illuminate various traps and pitfalls with the following rules of thumb:
Do not be intimidated by exotic benchmark programs and CPUs.
Look for a mix of standard and application-specific benchmark code.
Emphasize realism both in tool usage and target performance.
Perform your own tests whenever possible.
Read the fine print--know how vendors' benchmarks were performed.
As for the benchmarks, the appropriateness of individual tests for real-world code is open to debate; the test suites in toto, however, give a reasonable picture of compiler quality. If anyone has a truly representative embedded systems benchmark suit, I'd like to hear about it. Until then, we'll continue throwing SPECs and stones at one another. Please don't use benchmarks as the sole criterion in choosing a tool set. Other important factors to consider are embedded systems features, tool integration, inter-tool communication, access to optimized production cede at debug time and integration time, and vendor responsiveness.
As with any purchase, buyer beware! Choice of tools has long-term impact on the success of your project. Take benchmark data seriously, but keep the salt shaker handy.
William Weinberg is the senior product marketing manager at Microtec Research, where he drives the evolution of embedded C and C+ + development tools, including compilers, debuggers, and class libraries. He also manages Microtec Research's line for Motorola, AMD, and TRON processors. Weinberg has over 15 years of experience with embedded applications, CAD, and computational linguistics.