Correct benchmarking
Benchmarks and microbenchmarks have been a staple of computer science for the better part of its history.
We spent a lot of time learning about fast code and how to measure it.
You can't control what you can't measure
— Tom DeMarco
To control software runtime we have to be able to measure it as accurately as possible.
Benchmarks, profiling and code coverage
Make it work, make it clean and make it fast. That is the 3 step process to efficient software. It is critical to have software that does the correct thing, hence you should test your software and ensure you cover all the cases.
Before you even think about speed, make sure your code gives the correct results.
Then, once you are confident you code works, run some profiling with synthetic and expected loads.
Code coverage and actual execution
In certain cases we want to prevent optimization of results during tests, and we always want to do that in benchmarks. Here is how this is achieved in Google benchmarks:
template <class Tp>
inline void DoNotOptimize(Tp const& value) {
asm volatile("" : : "r,m"(value) : "memory");
}
template <class Tp>
inline void DoNotOptimize(Tp& value) {
#if defined(__clang__)
asm volatile("" : "+r,m"(value) : : "memory");
#else
asm volatile("" : "+m,r"(value) : : "memory");
#endif
}
Code coverage
On Clang you can use the options --coverage -fprofile-instr-generate -fcoverage-mapping
to prepare the build for coverage testing. You should also turn optimizations off with -O0
and leave debug data with -g
.
Then you need to run your software with prepared environment variables:
LLVM_PROFILE_FILE=MyProgram.profraw ./MyProgram
From there you can usellvm-profdata
to process the output and llvm-cov
to output the data in a human readable form:
llvm-profdata merge -sparse MyProgram.profraw -o MyProgram.profdata
llvm-cov show MyProgram \
-instr-profile=MyProgram.profdata \
-show-line-counts-or-regions \
-output-dir=MyProgramOutput \
-format="html"
This outputs the data in html, another solution is to run it to generate some computer readable format for further processing. For that we use llvm-cov
with the export
command. Export format text
is JSON formated and hence a fairly obvious choice for automation.
For more docs, see here.
You can also use CMake to build with code coverage, see George Cave's code coverage module.
Profiling with a sampling profiler
A sampling profiler is a profiler that samples production grade binaries by stopping them every once in a while to inspect what their callstack look like. The goal is to find the most common callstacks and functions and to focus our attention to those.
Sampling profilers are great because they actually log the optimized code and can, over time, pinpoint the exact bit of code that costs you the most.
gprof
is treacherously easy to use: just build with the -pg -g
options in either Clang or GCC and a file called gmon.out
. you can then call gprof <executable file> <gmon.out file>
and it will output a text report of what your application is doing that you can analyse:
Frequency/Time analysis
The first section of the output is the time analysis and lloks like this:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
100.00 0.01 0.01 984 0.01 0.01 void random<unsigned int>::chacha_do_rounds<20>(array<unsigned int, 16ul>&)
0.00 0.01 0.00 2800660 0.00 0.00 array<unsigned int, 16ul>::operator[](unsigned long)
0.00 0.01 0.00 21227 0.00 0.00 array<unsigned int, 32ul>::size() const
0.00 0.01 0.00 10776 0.00 0.00 array<unsigned int, 32ul>::operator[](unsigned long)
0.00 0.01 0.00 10776 0.00 0.00 random<unsigned int>::current()
0.00 0.01 0.00 10541 0.00 0.00 std::numeric_limits<unsigned long>::max()
0.00 0.01 0.00 10451 0.00 0.00 random<unsigned int>::next()
0.00 0.01 0.00 10001 0.00 0.00 optional<unsigned long>::value()
0.00 0.01 0.00 10001 0.00 0.00 optional<unsigned long>::optional(unsigned long const&)
0.00 0.01 0.00 10000 0.00 0.00 map<(anonymous namespace)::TestE::start(non_owning_view<non_owning_view<char const*>*>)::{lambda(unsigned int)#1}>::map_impl<limit::limit_impl<random<unsigned int> > >::next()
0.00 0.01 0.00 10000 0.00 0.00 map<(anonymous namespace)::TestE::start(non_owning_view<non_owning_view<char const*>*>)::{lambda(unsigned int)#1}>::map_impl<limit::limit_impl<random<unsigned int> > >::current()
0.00 0.01 0.00 10000 0.00 0.00 limit::limit_impl<random<unsigned int> >::next()
0.00 0.01 0.00 10000 0.00 0.00 limit::limit_impl<random<unsigned int> >::current()
To sum it up, the higher something is, the more time is spent there. Super easy and also super helpful to direct optimization work.
Function cost analysis
Function cost analysis gives you, for each function, the time spent in its children. here is a partial example:
index % time self children called name
<spontaneous>
[1] 100.0 0.00 0.01 main [1]
0.00 0.01 1/1 run_all_applications(int, char**) [5]
0.00 0.00 1/1 mediocre_allocator::mediocre_allocator() [312]
0.00 0.00 1/1 optional<abstract_memory_allocator*>::optional(abstract_memory_allocator* const&) [377]
-----------------------------------------------
0.00 0.01 984/984 random<unsigned int>::reset() [4]
[2] 100.0 0.00 0.01 984 array<unsigned int, 16ul> random<unsigned int>::chacha_block<20>(array<unsigned int, 16ul>&) [2]
0.01 0.00 984/984 void random<unsigned int>::chacha_do_rounds<20>(array<unsigned int, 16ul>&) [3]
0.00 0.00 62976/2800660 array<unsigned int, 16ul>::operator[](unsigned long) [105]
-----------------------------------------------
0.01 0.00 984/984 array<unsigned int, 16ul> random<unsigned int>::chacha_block<20>(array<unsigned int, 16ul>&) [2]
[3] 100.0 0.01 0.00 984 void random<unsigned int>::chacha_do_rounds<20>(array<unsigned int, 16ul>&) [3]
0.00 0.00 2204160/2800660 array<unsigned int, 16ul>::operator[](unsigned long) [105]
-----------------------------------------------
The function labeled with the [NUMBER] is the evaluated function. Above it are the callers/parents of that function, and below are the callees/children.
You can check the function name in the index situated at the end of the file.
Knowing that, read in your output the exact meaning of the different fields as it is very well explained there.
Profiling with a code injection based profiler
Code injection profilers are designed to inject some code at compile time (or some placeholder) to monitor execution at runtime. They can be used to trace the execution time of functions like a microbenchmark would.
Code injection profilers are great because they give you a human measure of time taken and an exact count of how many times this amount was taken for a task.
It is right in between straight forward and complex to use LLVM XRay for that purpose. LLVM XRay can be enabled for all functions when compiling with recent versions of Clang with the options -fxray-instrument -fxray-instruction-threshold=1
(we will see it in more details later).
Once compiled, you can enable profiling with environment variables just like llvm-cov
:
XRAY_OPTIONS='patch_premain=true xray_mode=xray-basic verbosity=0' ./executable_name
llvm-xray account --instr_map=./executable_name --format=csv ./xray-log.executable_name.*
llvm-xray
takes the generated file, which will have a name similar to xray-log.executable_name.RaNd0m
.
Bell curve and the dangers of measuring only once
If you need to remember 2 things from this post, it should be the following:
- if you hit an outlier 1/16th of the time (which is likely the case) measure at least 8 times that
- measure on real hardware with real cases, because the cache is a fickle mistress
You may wonder why would something purely hardware like a small function may not take the same amount of time everytime to which I can say: linkage order, caches, interrupts, and any number of weird but very real shenanigans like memory timings, floating point operations, power jitter...
Randomizing the unstable
The best way to deal with the linkage order part is to test with several executables linked in a random order.
For that, we are going to use a different linker than the standard one. We will be using mold
for its great sections randomization feature. This feature can be seeded which is also very good: it allows us to select noticeably faster layouts that we know are better from benchmarks.
mold
is a drop-in replacement for lld
or any such linker. For example you can use it with Clang by specifying the --use-ld=mold
if you have mold installed on your system.
You can then perform the linking stage of your executable or library building with Wl,--shuffle-sections
to shuffle with a random PRNG seed.
for (( c=1; c<=10; c++ ))
do
clang++ $(OBJECTS) -o my_exe_$c \
-O3 -fPIC \
-Wl,--shuffle-sections -fuse-ld=mold
done
This will proceed to build 10 executables that you can them run your benchmarks on. The more performance sensitive your software, the more randomization will affect it.
It is also possible to pick the seed (which can be useful to reproduce bad cases or to specifically build the best case) by specifying it in the --shuffle-section
flag like so:
for (( c=1; c<=10; c++ ))
do
seed=$RANDOM
clang++ $(OBJECTS) -o my_exe_$seed \
-O3 -fPIC \
-Wl,--shuffle-sections=$seed -fuse-ld=mold
done
Monitoring a real workload
To deal with the rest of the issues, the only way is to deal with the real thing: to monitor the application in production with a randomized linkage and several tested executables. Luckily, we know all the tools we need.
For that, we can enable instrumentation when building for production with -fxray-instrument -fxray-instruction-threshold=<N>
and of course optimization with -O2 -fPIC
and section randomization with -Wl,--shuffle-sections -fuse-ld=mold
and which you can also pick a seed for if your tests have found particularly advantageous randomized seeds.
Let's get back to the instrumentation where we have a parameter N
. It is generally not useful to instrument every non-inlined function. Remember that while XRay doesn't cost CPU execution time, it pollutes the cache as much as a pretty large nop
. For that reason, you will want to measure the tradeof at which point you consider a function big enough to benchmark. For that, replace <N>
in the above expression with the number of instruction you want a function to have at the very least to be eligible for instrumentation.
A value that I like for that is 128
which generally is pretty big functions that involve inlining shenanigans in large quantities or loop unrolling, both which signal large operations that are generally caracteristic of data processing.
THRESHOLD=20
for (( c=1; c<=100; c++ ))
do
clang++ $OBJECTS -o sample_exe \
-O3 -fPIC \
-fxray-instrument -fxray-instruction-threshold=$THRESHOLD \
-Wl,--shuffle-sections -fuse-ld=mold
XRAY_OPTIONS='patch_premain=true xray_mode=xray-basic verbosity=0' ./sample_exe
llvm-xray account --instr_map=./sample_exe --format=csv ./xray-log.sample_exe.* \
| sed '/^f/d' >> log.csv
rm ./xray-log.sample_exe.*
done
Okay, so that is good for some easy to build software. What about a more complicated load?
Scoring ledger-cli
#!/usr/bin/env bash
# Pick a threshold
THRESHOLD=20
# Delete old data
rm -r ./bench_build
mkdir -p ./bench_build
# Start building with the optionas we want
cd ./bench_build || exit
cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=clang-13 \
-DCMAKE_CXX_COMPILER=clang++-13 \
-DCMAKE_C_LINKER=mold \
-DCMAKE_CXX_LINKER=mold \
-DCMAKE_C_FLAGS="${CMAKE_C_FLAGS} -Wl,--shuffle-sections -fuse-ld=mold -fxray-instrument -fxray-instruction-threshold=$THRESHOLD" \
-DCMAKE_CXX_FLAGS="${CMAKE_CXX_FLAGS} -Wl,--shuffle-sections -fuse-ld=mold -fxray-instrument -fxray-instruction-threshold=$THRESHOLD" \
-DLDFLAGS="--shuffle-sections" \
-DCMAKE_VERBOSE_MAKEFILE=ON ..
make -j12 ledger
# Prepare the environment
export XRAY_OPTIONS='patch_premain=true xray_mode=xray-basic verbosity=0'
# Put a header to the log to make it easier to import to GUIs
echo "funcid,count,min,median,90%ile,99%ile,max,sum,debug,function" > log.csv
# We do 25 passes over all the software tests
for (( c=1; c<=25; c++ ))
do
# Run the tests, they all make quite a mess of files
make test
# Process the first mess
n=0
for file in test/xray-log.ledger.*;
do
n=`expr $n + 1`
llvm-xray-13 account --instr_map=./ledger --format=csv "$file" \
| sed '/^f/d' >> log.csv
rm "$file"
done
echo "Processed $n files in test"
# Process the second mess
n=0
for file in ../xray-log.ledger.*;
do
n=`expr $n + 1`
llvm-xray-13 account --instr_map=./ledger --format=csv "$file" \
| sed '/^f/d' >> log.csv
rm "$file"
done
echo "Processed $n files in parent"
# Force the relink by pretending to change a small-ish file from the build
touch ../src/scope.cc
make -j2 ledger
done
This generates a large log.csv
file (for ledger it is around 220MB).
Now we can process these logs:
# Headers for readability in GUIs
echo "time,function" > processed_log.csv
# For each function present in the log-file, we run the fragment analysis script
cat log.csv \
| sed '/^f/d' \
| gawk 'BEGIN{FPAT="([^,]+)|(\"[^\"]+\")"}{print $10}' \
| sort \
| uniq \
| xargs -I{} ../fragment_analysis.sh log.csv "{}" \
>> processed_log.csv
fragment_analysis.sh
#!/usr/bin/env bash
grep -s "$2" "$1" \
| gawk 'BEGIN{FPAT="([^,]+)|(\"[^\"]+\")"} {printf "%.9f,%d,%s\n",$4,$2,$10}' \
| sort \
| gawk 'BEGIN{FPAT="([^,]+)|(\"[^\"]+\")"}{name=$3;cd=$2;while(cd != 0) {list[acc]=$1;acc=acc+1;cd--}}END {if(length(name) != 0) {if(acc%2) {printf "%.9f,%s\n",list[(acc+1)/2],name} else {printf "%.9f,%s\n", (list[(acc/2)]+list[(acc/2)+1])/2.0,name}}}'
The first part extract the functions from the file, then cherry-picks the number of runs and the median along with the name. We then sort that and do the median of medians with GNU awk
(split down below for clarity):
BEGIN {
# We split by csv
FPAT="([^,]+)|(\"[^\"]+\")"
}
{
# Collect the name for later
name=$3;
# cd = countdown because we have to add to the array as many medians as exist
cd=$2;
while(cd != 0) {
list[acc]=$1;
acc=acc+1;
cd--
}
}
END {
# I had a lot of trouble with empty inputs reaching here so we filter that
if(length(name) != 0) {
if(acc%2) {
# Median for odd is right in the middle
printf "%.9f,%s\n",list[(acc+1)/2],name
} else {
# Median for odd is the average of the 2 middles
printf "%.9f,%s\n", (list[(acc/2)]+list[(acc/2)+1])/2.0,name
}
}
}
This outputs a csv from which the following extract was collected:
0.000061820,"boost::basic_format<char, std::char_traits<char>, std::allocator<char> >::~basic_format()"
0.000005538,"boost::basic_format<char, std::char_traits<char>, std::allocator<char> >::make_or_reuse_data(unsigned long)"
0.000012501,"boost::basic_format<char, std::char_traits<char>, std::allocator<char> >::parse(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)"
0.000006786,"boost::date_time::gregorian_calendar_base<boost::date_time::year_month_day_base<boost::gregorian::greg_year, boost::gregorian::greg_month, boost::gregorian::greg_day>, unsigned int>::from_day_number(unsigned int)"
0.000007285,"boost::date_time::month_formatter<boost::gregorian::greg_month, boost::date_time::iso_extended_format<char>, char>::format_month(boost::gregorian::greg_month const&, std::ostream&)"
0.000005853,"boost::detail::lcast_ret_unsigned<std::char_traits<char>, unsigned short, char>::convert()"
0.000040052,"boost::filesystem::basic_ifstream<char, std::char_traits<char> >::~basic_ifstream()"
0.000010118,"boost::function1<ledger::value_t, ledger::call_scope_t&>::~function1()"
0.000007141,"boost::function1<ledger::value_t, ledger::call_scope_t&>::move_assign(boost::function1<ledger::value_t, ledger::call_scope_t&>&)"
0.000007206,"boost::function1<ledger::value_t, ledger::call_scope_t&>::swap(boost::function1<ledger::value_t, ledger::call_scope_t&>&)"
0.000011250,"boost::gregorian::date::date(boost::gregorian::greg_year, boost::gregorian::greg_month, boost::gregorian::greg_day)"
0.000009347,"boost::io::basic_altstringbuf<char, std::char_traits<char>, std::allocator<char> >::overflow(int)"
0.000000000,"boost::io::basic_altstringbuf<char, std::char_traits<char>, std::allocator<char> >::seekpos(std::fpos<__mbstate_t>, std::_Ios_Openmode)"
0.000008337,"boost::io::detail::format_item<char, std::char_traits<char>, std::allocator<char> >::format_item(boost::io::detail::format_item<char, std::char_traits<char>, std::allocator<char> > const&)"
0.000038230,"boost::optional_detail::optional_base<ledger::date_specifier_or_range_t>::destroy()"
0.000007835,"boost::optional_detail::optional_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::assign(boost::optional_detail::optional_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)"
0.000009140,"boost::variant<boost::blank, boost::intrusive_ptr<ledger::expr_t::op_t>, ledger::value_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::function<ledger::value_t (ledger::call_scope_t&)>, boost::shared_ptr<ledger::scope_t> >::destroy_content()"
0.000010321,"boost::variant<boost::blank, boost::intrusive_ptr<ledger::expr_t::op_t>, ledger::value_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::function<ledger::value_t (ledger::call_scope_t&)>, boost::shared_ptr<ledger::scope_t> >::variant_assign(boost::variant<boost::blank, boost::intrusive_ptr<ledger::expr_t::op_t>, ledger::value_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::function<ledger::value_t (ledger::call_scope_t&)>, boost::shared_ptr<ledger::scope_t> >&&)"
0.000007918,"int boost::io::detail::upper_bound_from_fstring<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::ctype<char> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::value_type, std::ctype<char> const&, unsigned char)"
0.000107349,"ledger::bind_scope_t::define(ledger::symbol_t::kind_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::intrusive_ptr<ledger::expr_t::op_t>)"
0.000009056,"ledger::bind_scope_t::lookup(ledger::symbol_t::kind_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)"
0.000061951,"ledger::child_scope_t::define(ledger::symbol_t::kind_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::intrusive_ptr<ledger::expr_t::op_t>)"
0.000007579,"ledger::date_interval_t::~date_interval_t()"
Scoring and comparing apples to oranges statistically
Automated benchmarking is, like automated testing, fuzzing, code coverage analysis and other practices, a great source of confidence in software and a good measure of expectations and regressions.
So let's make a standard for our measures. We cannot reliably measure in seconds, as not all CPUs run exactly the same. So we will define a function that we will compare everything to.
struct Standard {
std::set<uint32_t> source;
Standard() {
std::random_device dev;
// 250MB worth of uint32_t
while(source.size() < 65536000) {
source.insert(dev());
}
}
double operator()() {
std::set<std::string> data;
for(auto elem : source) {
data.insert(std::to_string(elem));
}
// Return the average length of strings
return std::accumulate(
data.begin(),
data.end(),
0f,
[](std::string a, double b){
return a.size() + b;
}
) / data.size();
}
};
And of course, we run this several times, with the result given to a DoNotOptimize
function until we have a reasonable number of samples to have a stable median. And we do not forget to randomize the sections!
# Take 100 samples of random linkages
for (( c=1; c<=100; c++ ))
do
clang++ standard.cpp -o standard_exe \
-O3 -fPIC \
-fxray-instrument -fxray-instruction-threshold=1 \
-Wl,--shuffle-sections -fuse-ld=mold
XRAY_OPTIONS='patch_premain=true xray_mode=xray-basic verbosity=0' ./standard_exe
llvm-xray account --instr_map=./standard_exe --format=csv ./xray-log.standard_exe.* \
| sed '/^f/d' >> log.csv
rm ./xray-log.standard_exe.*
done
Of course, running this a hundred times will take a noticeable amount of time and may not actually be required (10 to 15 times is probably sufficient to derive a standard)
From there, running the following line will give us an average of the medians between all executions:
grep operator log.csv \
| awk -F ',' '{sum+=$4;cnt+=1} END{print sum/cnt}'
This value is the standard on the system that executes this. We express that time as 1S (capital S because this is not seconds).
For example, on my system, we have 1S = 31.3735
.
A function like "Hello world" is measured at 0.00002994 = 9.54E-7 = 0.954µS
:
[[gnu::noinline]] auto& execute_test() {
return std::cout << "Hello world\n";
}
The test is performed here in a similar way, but it can be done from data in production or, for automated benchmarking, with either a replay from production inputs or unit tests. What is critical here is that performance relative to the standard as a unit should be closer to an objective measure of performance.
If you like my content and want more, please donate. If everyone that finds my content useful paid $1 every week, I would be able to produce content for 1 week a month without relying on other sources of income for that week.