Implementing x² in C++
Many equations require to compute the square of some quantity x
.
Unfortunately, C++ does not provide an operator for exponentiation like Julia or Fortran.
One can rely on two solutions:
- Just multiply the value by itself:
x * x
; - Use the
std::pow()
function, defined in<cmath>
.
The first solution has the advantage of requiring no separate #import
, but the second one is better if the value you are going to square is not stored in a variable:
double result1{(a + b) * (a + b)}; // Ugly to read & write
double result2{std::pow(a + b, 2)}; // Much better
However, are the two solutions really the same? Are x * x
and std::pow(x, 2)
equivalent, or is there some difference?
Let’s write some code to benchmark the two approaches. We compute the sum of the squares of all the integer numbers between 1 and 10 millions with both methods to determine which implementation is the fastest. Here are the two implementations:
double pow_test(int N) {
double sum{};
for (double i{}; i < N; i += 1.0) {
sum += pow(i, 2.0);
}
return sum;
}
double times_test(int N) {
double sum{};
for (double i{}; i < N; i += 1.0) {
sum += i * i;
}
return sum;
}
A good method when profiling code is to run every benchmark a few times and to take the minimum value among the elapsed time measurements.
So let’s write a function that does this and uses <chrono>
to measure the time:
void run_benchmark(const char *name, int N, auto fn) {
array<std::chrono::duration<double>, 5> samples;
double sum{};
for (int k{}; k < ssize(samples); ++k) {
const auto start = std::chrono::steady_clock::now();
sum = fn(N);
const auto end = std::chrono::steady_clock::now();
samples[k] = end - start;
}
auto result{std::ranges::min_element(samples)};
cout << format("{}: {} s (result is {})\n", name, *result, sum);
}
The full source code of the program is the following:
#include <algorithm>
#include <array>
#include <chrono>
#include <cmath>
#include <format>
#include <iostream>
using namespace std;
double pow_test(int N) {
double sum{};
for (double i{}; i < N; i += 1.0) {
sum += pow(i, 2.0);
}
return sum;
}
double times_test(int N) {
double sum{};
for (double i{}; i < N; i += 1.0) {
sum += i * i;
}
return sum;
}
void run_benchmark(const char *name, int N, auto fn) {
array<std::chrono::duration<double>, 5> samples;
double sum{};
for (int k{}; k < ssize(samples); ++k) {
const auto start = std::chrono::steady_clock::now();
sum = fn(N);
const auto end = std::chrono::steady_clock::now();
samples[k] = end - start;
}
auto result{std::ranges::min_element(samples)};
cout << format("{}: {} s (result is {})\n", name, *result, sum);
}
int main() {
const int N{10'000'000};
run_benchmark("pow", N, pow_test);
run_benchmark("x * x", N, times_test);
}
(The code uses <format>
and std::ranges::min_element
, so be sure to use the -std=c++23
flag and a reasonably recent C++ compiler; mine is GCC 14.2.1.)
The results on my laptop are the following:
pow: 0.184192s s (result is 333333283333717098496)
x * x: 0.0386399s s (result is 333333283333717098496)
Wow! So x * x
is nearly five times faster than pow
! What’s happened?
Using GCC 14.2, the line sum += pow(i, 2.0)
produces the following machine code:
movsd xmm0, QWORD PTR .LC7[rip]
mov rax, QWORD PTR [rbp-16]
movapd xmm1, xmm0
movq xmm0, rax
call pow
movsd xmm1, QWORD PTR [rbp-8]
addsd xmm0, xmm1
movsd QWORD PTR [rbp-8], xmm0
So, as expected, the machine code loads the parameters to pow
into registers and then calls the pow
function.
And calling a function is expensive!
The line sum += x * x
produces this:
movsd xmm0, QWORD PTR [rbp-16]
mulsd xmm0, xmm0
movsd xmm1, QWORD PTR [rbp-8]
addsd xmm0, xmm1
movsd QWORD PTR [rbp-8], xmm0
which has no call
instruction: it is just the plain implementation of the operation x * x
, using mulsd
.
The take-away message is: when computing the square of a number in C++ you should always use x * x
.
But what about the ugly line we saw above:
double result1{(a + b) * (a + b)};
We can improve the code via a template function:
template<typename T>
T square(T x) {
return x * x;
}
so that the line becomes
double result1{square(a + b)};
Adding a corresponding case to the benchmark above produces the following results:
pow: 0.176948s s (result is 333333283333717098496)
x * x: 0.0394336s s (result is 333333283333717098496)
square: 0.0399737s s (result is 333333283333717098496)
which shows no significant loss of performance when using a template
.
If you do not want to use templates because you are sure you will always use double
values, you can define inline double square(double x)
, and the benchmark result will be the same.