23 September 2022 / C++

Correct benchmarking

Benchmarks and microbenchmarks have been a staple of computer science for the better part of its history.

We spent a lot of time learning about fast code and how to measure it.

You can't control what you can't measure
— Tom DeMarco

To control software runtime we have to be able to measure it as accurately as possible.

Benchmarks, profiling and code coverage

Make it work, make it clean and make it fast. That is the 3 step process to efficient software. It is critical to have software that does the correct thing, hence you should test your software and ensure you cover all the cases.

Before you even think about speed, make sure your code gives the correct results.

Then, once you are confident you code works, run some profiling with synthetic and expected loads.

Code coverage and actual execution

In certain cases we want to prevent optimization of results during tests, and we always want to do that in benchmarks. Here is how this is achieved in Google benchmarks:

template <class Tp>
inline void DoNotOptimize(Tp const& value) {
  asm volatile("" : : "r,m"(value) : "memory");
}

template <class Tp>
inline void DoNotOptimize(Tp& value) {
#if defined(__clang__)
  asm volatile("" : "+r,m"(value) : : "memory");
#else
  asm volatile("" : "+m,r"(value) : : "memory");
#endif
}

Code coverage

On Clang you can use the options --coverage -fprofile-instr-generate -fcoverage-mapping to prepare the build for coverage testing. You should also turn optimizations off with -O0 and leave debug data with -g.

Then you need to run your software with prepared environment variables:

LLVM_PROFILE_FILE=MyProgram.profraw ./MyProgram

From there you can usellvm-profdata to process the output and llvm-cov to output the data in a human readable form:

llvm-profdata merge -sparse MyProgram.profraw -o MyProgram.profdata
llvm-cov show MyProgram \ 
	-instr-profile=MyProgram.profdata \
	-show-line-counts-or-regions \
	-output-dir=MyProgramOutput \
	-format="html"

This outputs the data in html, another solution is to run it to generate some computer readable format for further processing. For that we use llvm-cov with the export command. Export format text is JSON formated and hence a fairly obvious choice for automation.

For more docs, see here.

You can also use CMake to build with code coverage, see George Cave's code coverage module.

Profiling with a sampling profiler

A sampling profiler is a profiler that samples production grade binaries by stopping them every once in a while to inspect what their callstack look like. The goal is to find the most common callstacks and functions and to focus our attention to those.

Sampling profilers are great because they actually log the optimized code and can, over time, pinpoint the exact bit of code that costs you the most.

gprof is treacherously easy to use: just build with the -pg -g options in either Clang or GCC and a file called gmon.out. you can then call gprof <executable file> <gmon.out file> and it will output a text report of what your application is doing that you can analyse:

Frequency/Time analysis

The first section of the output is the time analysis and lloks like this:

  %   cumulative   self              self     total             
 time   seconds   seconds    calls  ms/call  ms/call  name      
100.00      0.01     0.01      984     0.01     0.01  void random<unsigned int>::chacha_do_rounds<20>(array<unsigned int, 16ul>&)  
  0.00      0.01     0.00  2800660     0.00     0.00  array<unsigned int, 16ul>::operator[](unsigned long)  
  0.00      0.01     0.00    21227     0.00     0.00  array<unsigned int, 32ul>::size() const  
  0.00      0.01     0.00    10776     0.00     0.00  array<unsigned int, 32ul>::operator[](unsigned long)  
  0.00      0.01     0.00    10776     0.00     0.00  random<unsigned int>::current()  
  0.00      0.01     0.00    10541     0.00     0.00  std::numeric_limits<unsigned long>::max()  
  0.00      0.01     0.00    10451     0.00     0.00  random<unsigned int>::next()  
  0.00      0.01     0.00    10001     0.00     0.00  optional<unsigned long>::value()  
  0.00      0.01     0.00    10001     0.00     0.00  optional<unsigned long>::optional(unsigned long const&)  
  0.00      0.01     0.00    10000     0.00     0.00  map<(anonymous namespace)::TestE::start(non_owning_view<non_owning_view<char const*>*>)::{lambda(unsigned int)#1}>::map_impl<limit::limit_impl<random<unsigned int> > >::next()  
  0.00      0.01     0.00    10000     0.00     0.00  map<(anonymous namespace)::TestE::start(non_owning_view<non_owning_view<char const*>*>)::{lambda(unsigned int)#1}>::map_impl<limit::limit_impl<random<unsigned int> > >::current()  
  0.00      0.01     0.00    10000     0.00     0.00  limit::limit_impl<random<unsigned int> >::next()  
  0.00      0.01     0.00    10000     0.00     0.00  limit::limit_impl<random<unsigned int> >::current()

To sum it up, the higher something is, the more time is spent there. Super easy and also super helpful to direct optimization work.

Function cost analysis

Function cost analysis gives you, for each function, the time spent in its children. here is a partial example:

index % time    self  children    called     name  
                                                 <spontaneous>  
[1]    100.0    0.00    0.01                 main [1]  
                0.00    0.01       1/1           run_all_applications(int, char**) [5]  
                0.00    0.00       1/1           mediocre_allocator::mediocre_allocator() [312]  
                0.00    0.00       1/1           optional<abstract_memory_allocator*>::optional(abstract_memory_allocator* const&) [377]  
-----------------------------------------------  
                0.00    0.01     984/984         random<unsigned int>::reset() [4]  
[2]    100.0    0.00    0.01     984         array<unsigned int, 16ul> random<unsigned int>::chacha_block<20>(array<unsigned int, 16ul>&) [2]  
                0.01    0.00     984/984         void random<unsigned int>::chacha_do_rounds<20>(array<unsigned int, 16ul>&) [3]  
                0.00    0.00   62976/2800660     array<unsigned int, 16ul>::operator[](unsigned long) [105]  
-----------------------------------------------  
                0.01    0.00     984/984         array<unsigned int, 16ul> random<unsigned int>::chacha_block<20>(array<unsigned int, 16ul>&) [2]  
[3]    100.0    0.01    0.00     984         void random<unsigned int>::chacha_do_rounds<20>(array<unsigned int, 16ul>&) [3]  
                0.00    0.00 2204160/2800660     array<unsigned int, 16ul>::operator[](unsigned long) [105]  
-----------------------------------------------

The function labeled with the [NUMBER] is the evaluated function. Above it are the callers/parents of that function, and below are the callees/children.

You can check the function name in the index situated at the end of the file.

Knowing that, read in your output the exact meaning of the different fields as it is very well explained there.

Profiling with a code injection based profiler

Code injection profilers are designed to inject some code at compile time (or some placeholder) to monitor execution at runtime. They can be used to trace the execution time of functions like a microbenchmark would.

Code injection profilers are great because they give you a human measure of time taken and an exact count of how many times this amount was taken for a task.

It is right in between straight forward and complex to use LLVM XRay for that purpose. LLVM XRay can be enabled for all functions when compiling with recent versions of Clang with the options -fxray-instrument -fxray-instruction-threshold=1 (we will see it in more details later).

Once compiled, you can enable profiling with environment variables just like llvm-cov:

XRAY_OPTIONS='patch_premain=true xray_mode=xray-basic verbosity=0' ./executable_name  
llvm-xray account --instr_map=./executable_name --format=csv ./xray-log.executable_name.*

llvm-xray takes the generated file, which will have a name similar to xray-log.executable_name.RaNd0m.

Bell curve and the dangers of measuring only once

If you need to remember 2 things from this post, it should be the following:

if you hit an outlier 1/16th of the time (which is likely the case) measure at least 8 times that
measure on real hardware with real cases, because the cache is a fickle mistress

You may wonder why would something purely hardware like a small function may not take the same amount of time everytime to which I can say: linkage order, caches, interrupts, and any number of weird but very real shenanigans like memory timings, floating point operations, power jitter...

Randomizing the unstable

The best way to deal with the linkage order part is to test with several executables linked in a random order.

For that, we are going to use a different linker than the standard one. We will be using mold for its great sections randomization feature. This feature can be seeded which is also very good: it allows us to select noticeably faster layouts that we know are better from benchmarks.

mold is a drop-in replacement for lld or any such linker. For example you can use it with Clang by specifying the --use-ld=mold if you have mold installed on your system.

You can then perform the linking stage of your executable or library building with Wl,--shuffle-sections to shuffle with a random PRNG seed.


for (( c=1; c<=10; c++ ))  
do
	clang++ $(OBJECTS) -o my_exe_$c \
		-O3 -fPIC \
		-Wl,--shuffle-sections -fuse-ld=mold
done

This will proceed to build 10 executables that you can them run your benchmarks on. The more performance sensitive your software, the more randomization will affect it.

It is also possible to pick the seed (which can be useful to reproduce bad cases or to specifically build the best case) by specifying it in the --shuffle-section flag like so:


for (( c=1; c<=10; c++ ))  
do
	seed=$RANDOM
	clang++ $(OBJECTS) -o my_exe_$seed \
		-O3 -fPIC \
		-Wl,--shuffle-sections=$seed -fuse-ld=mold
done

Monitoring a real workload

To deal with the rest of the issues, the only way is to deal with the real thing: to monitor the application in production with a randomized linkage and several tested executables. Luckily, we know all the tools we need.

For that, we can enable instrumentation when building for production with -fxray-instrument -fxray-instruction-threshold=<N> and of course optimization with -O2 -fPIC and section randomization with -Wl,--shuffle-sections -fuse-ld=mold and which you can also pick a seed for if your tests have found particularly advantageous randomized seeds.

Let's get back to the instrumentation where we have a parameter N. It is generally not useful to instrument every non-inlined function. Remember that while XRay doesn't cost CPU execution time, it pollutes the cache as much as a pretty large nop. For that reason, you will want to measure the tradeof at which point you consider a function big enough to benchmark. For that, replace <N> in the above expression with the number of instruction you want a function to have at the very least to be eligible for instrumentation.

A value that I like for that is 128 which generally is pretty big functions that involve inlining shenanigans in large quantities or loop unrolling, both which signal large operations that are generally caracteristic of data processing.

THRESHOLD=20
for (( c=1; c<=100; c++ ))  
do
	clang++ $OBJECTS -o sample_exe \
		-O3 -fPIC \
		-fxray-instrument -fxray-instruction-threshold=$THRESHOLD \
		-Wl,--shuffle-sections -fuse-ld=mold
	XRAY_OPTIONS='patch_premain=true xray_mode=xray-basic verbosity=0' ./sample_exe  
	llvm-xray account --instr_map=./sample_exe --format=csv ./xray-log.sample_exe.* \
	| sed '/^f/d' >> log.csv  
	rm ./xray-log.sample_exe.*
done

Okay, so that is good for some easy to build software. What about a more complicated load?

Scoring `ledger-cli`

#!/usr/bin/env bash  

# Pick a threshold
THRESHOLD=20  

# Delete old data
rm -r ./bench_build  
mkdir -p ./bench_build  

# Start building with the optionas we want
cd ./bench_build || exit  
cmake \  
  -DCMAKE_BUILD_TYPE=Release \  
  -DCMAKE_C_COMPILER=clang-13 \  
  -DCMAKE_CXX_COMPILER=clang++-13 \  
  -DCMAKE_C_LINKER=mold \  
  -DCMAKE_CXX_LINKER=mold \  
  -DCMAKE_C_FLAGS="${CMAKE_C_FLAGS} -Wl,--shuffle-sections -fuse-ld=mold -fxray-instrument -fxray-instruction-threshold=$THRESHOLD" \  
  -DCMAKE_CXX_FLAGS="${CMAKE_CXX_FLAGS} -Wl,--shuffle-sections -fuse-ld=mold -fxray-instrument -fxray-instruction-threshold=$THRESHOLD" \  
  -DLDFLAGS="--shuffle-sections" \  
  -DCMAKE_VERBOSE_MAKEFILE=ON ..  

make -j12 ledger  

# Prepare the environment
export XRAY_OPTIONS='patch_premain=true xray_mode=xray-basic verbosity=0'  

# Put a header to the log to make it easier to import to GUIs
echo "funcid,count,min,median,90%ile,99%ile,max,sum,debug,function" > log.csv  

# We do 25 passes over all the software tests
for (( c=1; c<=25; c++ ))  
do
  # Run the tests, they all make quite a mess of files
  make test

  # Process the first mess
  n=0  
  for file in test/xray-log.ledger.*;  
  do  
    n=`expr $n + 1`  
    llvm-xray-13 account --instr_map=./ledger --format=csv "$file" \
    | sed '/^f/d' >> log.csv  
    rm "$file"  
  done  
  echo "Processed $n files in test"  
  

  # Process the second mess
  n=0  
  for file in ../xray-log.ledger.*;  
  do  
    n=`expr $n + 1`  
    llvm-xray-13 account --instr_map=./ledger --format=csv "$file" \
    | sed '/^f/d' >> log.csv  
    rm "$file"  
  done  
  echo "Processed $n files in parent"  
  

  # Force the relink by pretending to change a small-ish file from the build
  touch ../src/scope.cc  
  make -j2 ledger  
done

This generates a large log.csv file (for ledger it is around 220MB).

Now we can process these logs:

# Headers for readability in GUIs
echo "time,function" > processed_log.csv  

# For each function present in the log-file, we run the fragment analysis script
cat log.csv \  
  | sed '/^f/d' \  
  | gawk 'BEGIN{FPAT="([^,]+)|(\"[^\"]+\")"}{print $10}' \  
  | sort \  
  | uniq \  
  | xargs -I{} ../fragment_analysis.sh log.csv "{}" \  
  >> processed_log.csv

fragment_analysis.sh

#!/usr/bin/env bash  
  
grep -s "$2" "$1" \  
  | gawk 'BEGIN{FPAT="([^,]+)|(\"[^\"]+\")"} {printf "%.9f,%d,%s\n",$4,$2,$10}' \  
  | sort \  
  | gawk 'BEGIN{FPAT="([^,]+)|(\"[^\"]+\")"}{name=$3;cd=$2;while(cd != 0) {list[acc]=$1;acc=acc+1;cd--}}END {if(length(name) != 0) {if(acc%2) {printf "%.9f,%s\n",list[(acc+1)/2],name} else {printf "%.9f,%s\n", (list[(acc/2)]+list[(acc/2)+1])/2.0,name}}}'

The first part extract the functions from the file, then cherry-picks the number of runs and the median along with the name. We then sort that and do the median of medians with GNU awk (split down below for clarity):

BEGIN {
	# We split by csv
	FPAT="([^,]+)|(\"[^\"]+\")"
}
{
	# Collect the name for later
	name=$3;
	# cd = countdown because we have to add to the array as many medians as exist
	cd=$2;
	while(cd != 0) {
		list[acc]=$1;
		acc=acc+1;
		cd--
	}
}
END {
	# I had a lot of trouble with empty inputs reaching here so we filter that
	if(length(name) != 0) {
		if(acc%2) {
			# Median for odd is right in the middle
			printf "%.9f,%s\n",list[(acc+1)/2],name
		} else {
			# Median for odd is the average of the 2 middles
			printf "%.9f,%s\n", (list[(acc/2)]+list[(acc/2)+1])/2.0,name
		}
	}
}

This outputs a csv from which the following extract was collected:

0.000061820,"boost::basic_format<char, std::char_traits<char>, std::allocator<char> >::~basic_format()"  
0.000005538,"boost::basic_format<char, std::char_traits<char>, std::allocator<char> >::make_or_reuse_data(unsigned long)"  
0.000012501,"boost::basic_format<char, std::char_traits<char>, std::allocator<char> >::parse(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)"  
0.000006786,"boost::date_time::gregorian_calendar_base<boost::date_time::year_month_day_base<boost::gregorian::greg_year, boost::gregorian::greg_month, boost::gregorian::greg_day>, unsigned int>::from_day_number(unsigned int)"  
0.000007285,"boost::date_time::month_formatter<boost::gregorian::greg_month, boost::date_time::iso_extended_format<char>, char>::format_month(boost::gregorian::greg_month const&, std::ostream&)"  
0.000005853,"boost::detail::lcast_ret_unsigned<std::char_traits<char>, unsigned short, char>::convert()"  
0.000040052,"boost::filesystem::basic_ifstream<char, std::char_traits<char> >::~basic_ifstream()"  
0.000010118,"boost::function1<ledger::value_t, ledger::call_scope_t&>::~function1()"  
0.000007141,"boost::function1<ledger::value_t, ledger::call_scope_t&>::move_assign(boost::function1<ledger::value_t, ledger::call_scope_t&>&)"  
0.000007206,"boost::function1<ledger::value_t, ledger::call_scope_t&>::swap(boost::function1<ledger::value_t, ledger::call_scope_t&>&)"  
0.000011250,"boost::gregorian::date::date(boost::gregorian::greg_year, boost::gregorian::greg_month, boost::gregorian::greg_day)"  
0.000009347,"boost::io::basic_altstringbuf<char, std::char_traits<char>, std::allocator<char> >::overflow(int)"  
0.000000000,"boost::io::basic_altstringbuf<char, std::char_traits<char>, std::allocator<char> >::seekpos(std::fpos<__mbstate_t>, std::_Ios_Openmode)"  
0.000008337,"boost::io::detail::format_item<char, std::char_traits<char>, std::allocator<char> >::format_item(boost::io::detail::format_item<char, std::char_traits<char>, std::allocator<char> > const&)"  
0.000038230,"boost::optional_detail::optional_base<ledger::date_specifier_or_range_t>::destroy()"  
0.000007835,"boost::optional_detail::optional_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::assign(boost::optional_detail::optional_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)"  
0.000009140,"boost::variant<boost::blank, boost::intrusive_ptr<ledger::expr_t::op_t>, ledger::value_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::function<ledger::value_t (ledger::call_scope_t&)>, boost::shared_ptr<ledger::scope_t> >::destroy_content()"  
0.000010321,"boost::variant<boost::blank, boost::intrusive_ptr<ledger::expr_t::op_t>, ledger::value_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::function<ledger::value_t (ledger::call_scope_t&)>, boost::shared_ptr<ledger::scope_t> >::variant_assign(boost::variant<boost::blank, boost::intrusive_ptr<ledger::expr_t::op_t>, ledger::value_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::function<ledger::value_t (ledger::call_scope_t&)>, boost::shared_ptr<ledger::scope_t> >&&)"  
0.000007918,"int boost::io::detail::upper_bound_from_fstring<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::ctype<char> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::value_type, std::ctype<char> const&, unsigned char)"  
0.000107349,"ledger::bind_scope_t::define(ledger::symbol_t::kind_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::intrusive_ptr<ledger::expr_t::op_t>)"  
0.000009056,"ledger::bind_scope_t::lookup(ledger::symbol_t::kind_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)"  
0.000061951,"ledger::child_scope_t::define(ledger::symbol_t::kind_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::intrusive_ptr<ledger::expr_t::op_t>)"  
0.000007579,"ledger::date_interval_t::~date_interval_t()"

Scoring and comparing apples to oranges statistically

Automated benchmarking is, like automated testing, fuzzing, code coverage analysis and other practices, a great source of confidence in software and a good measure of expectations and regressions.

So let's make a standard for our measures. We cannot reliably measure in seconds, as not all CPUs run exactly the same. So we will define a function that we will compare everything to.

struct Standard {
	std::set<uint32_t> source;

	Standard() {
		std::random_device dev;
		// 250MB worth of uint32_t
		while(source.size() < 65536000) {
			source.insert(dev());
		}
	}
	
	double operator()() {
		std::set<std::string> data;

		for(auto elem : source) {
			data.insert(std::to_string(elem));
		}

		// Return the average length of strings
		return std::accumulate(
			data.begin(), 
			data.end(), 
			0f, 
			[](std::string a, double b){ 
				return a.size() + b;
			}
		) / data.size();
	}
};

And of course, we run this several times, with the result given to a DoNotOptimize function until we have a reasonable number of samples to have a stable median. And we do not forget to randomize the sections!

# Take 100 samples of random linkages
for (( c=1; c<=100; c++ ))  
do
	clang++ standard.cpp -o standard_exe \
		-O3 -fPIC \
		-fxray-instrument -fxray-instruction-threshold=1 \
		-Wl,--shuffle-sections -fuse-ld=mold
	XRAY_OPTIONS='patch_premain=true xray_mode=xray-basic verbosity=0' ./standard_exe  
	llvm-xray account --instr_map=./standard_exe --format=csv ./xray-log.standard_exe.* \
	| sed '/^f/d' >> log.csv  
	rm ./xray-log.standard_exe.*
done

Of course, running this a hundred times will take a noticeable amount of time and may not actually be required (10 to 15 times is probably sufficient to derive a standard)

From there, running the following line will give us an average of the medians between all executions:

grep operator log.csv \
| awk -F ',' '{sum+=$4;cnt+=1} END{print sum/cnt}'

This value is the standard on the system that executes this. We express that time as 1S (capital S because this is not seconds).

For example, on my system, we have 1S = 31.3735.

A function like "Hello world" is measured at 0.00002994 = 9.54E-7 = 0.954µS:

[[gnu::noinline]] auto& execute_test() {  
    return std::cout << "Hello world\n";  
}

The test is performed here in a similar way, but it can be done from data in production or, for automated benchmarking, with either a replay from production inputs or unit tests. What is critical here is that performance relative to the standard as a unit should be closer to an objective measure of performance.

If you like my content and want more, please donate. If everyone that finds my content useful paid $1 every week, I would be able to produce content for 1 week a month without relying on other sources of income for that week.

Correct benchmarking

Benchmarks, profiling and code coverage

Code coverage and actual execution

Code coverage

Profiling with a sampling profiler

Frequency/Time analysis

Function cost analysis

Profiling with a code injection based profiler

Bell curve and the dangers of measuring only once

Randomizing the unstable

Monitoring a real workload

Scoring `ledger-cli`

Scoring and comparing apples to oranges statistically

Fast logging

Readable code: a guide to absolute beginners

Benchmarks, profiling and code coverage

Code coverage and actual execution

Code coverage

Profiling with a sampling profiler

Frequency/Time analysis

Function cost analysis

Profiling with a code injection based profiler

Bell curve and the dangers of measuring only once

Randomizing the unstable

Monitoring a real workload

Scoring ledger-cli

Scoring and comparing apples to oranges statistically

Scoring `ledger-cli`