Implementing x² in C++
Many equations require to compute the square of some quantity x.
Unfortunately, C++ does not provide an operator for exponentiation like Julia or Fortran.
One can rely on two solutions:
- Just multiply the value by itself:
x * x; - Use the
std::pow()function, defined in<cmath>.
The first solution has the advantage of requiring no separate #import, but the second one is better if the value you are going to square is not stored in a variable:
double result1{(a + b) * (a + b)}; // Ugly to read & write
double result2{std::pow(a + b, 2)}; // Much better
However, are the two solutions really the same? Are x * x and std::pow(x, 2) equivalent, or is there some difference?
Let’s write some code to benchmark the two approaches. We compute the sum of the squares of all the integer numbers between 1 and 10 millions with both methods to determine which implementation is the fastest. Here are the two implementations:
double pow_test(int N) {
double sum{};
for (double i{}; i < N; i += 1.0) {
sum += pow(i, 2.0);
}
return sum;
}
double times_test(int N) {
double sum{};
for (double i{}; i < N; i += 1.0) {
sum += i * i;
}
return sum;
}
A good method when profiling code is to run every benchmark a few times and to take the minimum value among the elapsed time measurements.
So let’s write a function that does this and uses <chrono> to measure the time:
void run_benchmark(const char *name, int N, auto fn) {
array<std::chrono::duration<double>, 5> samples;
double sum{};
for (int k{}; k < ssize(samples); ++k) {
const auto start = std::chrono::steady_clock::now();
sum = fn(N);
const auto end = std::chrono::steady_clock::now();
samples[k] = end - start;
}
auto result{std::ranges::min_element(samples)};
cout << format("{}: {} s (result is {})\n", name, *result, sum);
}
The full source code of the program is the following:
#include <algorithm>
#include <array>
#include <chrono>
#include <cmath>
#include <format>
#include <iostream>
using namespace std;
double pow_test(int N) {
double sum{};
for (double i{}; i < N; i += 1.0) {
sum += pow(i, 2.0);
}
return sum;
}
double times_test(int N) {
double sum{};
for (double i{}; i < N; i += 1.0) {
sum += i * i;
}
return sum;
}
void run_benchmark(const char *name, int N, auto fn) {
array<std::chrono::duration<double>, 5> samples;
double sum{};
for (int k{}; k < ssize(samples); ++k) {
const auto start = std::chrono::steady_clock::now();
sum = fn(N);
const auto end = std::chrono::steady_clock::now();
samples[k] = end - start;
}
auto result{std::ranges::min_element(samples)};
cout << format("{}: {} s (result is {})\n", name, *result, sum);
}
int main() {
const int N{10'000'000};
run_benchmark("pow", N, pow_test);
run_benchmark("x * x", N, times_test);
}
(The code uses <format> and std::ranges::min_element, so be sure to use the -std=c++23 flag and a reasonably recent C++ compiler; mine is GCC 14.2.1.)
The results on my laptop are the following:
pow: 0.184192s s (result is 333333283333717098496)
x * x: 0.0386399s s (result is 333333283333717098496)
Wow! So x * x is nearly five times faster than pow! What’s happened?
Using GCC 14.2, the line sum += pow(i, 2.0) produces the following machine code:
movsd xmm0, QWORD PTR .LC7[rip]
mov rax, QWORD PTR [rbp-16]
movapd xmm1, xmm0
movq xmm0, rax
call pow
movsd xmm1, QWORD PTR [rbp-8]
addsd xmm0, xmm1
movsd QWORD PTR [rbp-8], xmm0
So, as expected, the machine code loads the parameters to pow into registers and then calls the pow function.
And calling a function is expensive!
The line sum += x * x produces this:
movsd xmm0, QWORD PTR [rbp-16]
mulsd xmm0, xmm0
movsd xmm1, QWORD PTR [rbp-8]
addsd xmm0, xmm1
movsd QWORD PTR [rbp-8], xmm0
which has no call instruction: it is just the plain implementation of the operation x * x, using mulsd.
The take-away message is: when computing the square of a number in C++ you should always use x * x.
But what about the ugly line we saw above:
double result1{(a + b) * (a + b)};
We can improve the code via a template function:
template<typename T>
T square(T x) {
return x * x;
}
so that the line becomes
double result1{square(a + b)};
Adding a corresponding case to the benchmark above produces the following results:
pow: 0.176948s s (result is 333333283333717098496)
x * x: 0.0394336s s (result is 333333283333717098496)
square: 0.0399737s s (result is 333333283333717098496)
which shows no significant loss of performance when using a template.
If you do not want to use templates because you are sure you will always use double values, you can define inline double square(double x), and the benchmark result will be the same.