This MR changes the implementation of sparse_time_dense_product_impl for RowMajor matrices to use two accumulation variables instead of one. This breaks up the dependency chain of addition the values, and opens up options for the CPU to employ instruction-level parallelism.

I have run the following benchmark on my system (AMD Threadripper 1950, single threaded)

#define ANKERL_NANOBENCH_IMPLEMENT
#include <nanobench.h>
#include <Eigen/Sparse>
#include <random>

void benchmark_sparse_dense_vector(ankerl::nanobench::Bench& bench) {
    const Eigen::Index SIZE = 100'000;
    const Eigen::Index NNZ = 10'000'000;
    Eigen::SparseMatrix<float, Eigen::RowMajor> matrix(SIZE, SIZE);
    std::default_random_engine rng;
    std::uniform_int_distribution<Eigen::Index> dist(0, SIZE - 1);
    std::vector<Eigen::Triplet<float>> triplets;
    triplets.reserve(NNZ);
    auto start = std::chrono::steady_clock::now();
    for(int i = 0; i < NNZ; ++i) {
        triplets.emplace_back(dist(rng), dist(rng), 1.0);
    }
    matrix.setFromTriplets(begin(triplets), end(triplets));
    Eigen::VectorXf rhs = Eigen::VectorXf::Random(SIZE);
    Eigen::VectorXf dst(SIZE);

    bench.run("Sparse * Dense -> Dense", [&] {dst = matrix *  rhs; });
}

int main() {
    ankerl::nanobench::Bench bench{};
    bench.warmup(10).epochs(100);
    bench.minEpochTime(std::chrono::nanoseconds{1'000'000});
    bench.minEpochIterations(10);
    benchmark_sparse_dense_vector(bench);
}

with the following results

SIZE = 1M

New

ns/op	op/s	err%	ins/op	bra/op	miss%	total	benchmark
15,099,473.67	66.23	0.4%	89,126,601.18	14,625,689.82	7.2%	16.55	`Sparse * Dense -> Dense`

Old

ns/op	op/s	err%	ins/op	bra/op	miss%	total	benchmark
16,052,703.27	62.29	1.4%	90,625,231.30	14,125,126.00	7.3%	18.99	`Sparse * Dense -> Dense`

SIZE = 100k

New

ns/op	op/s	err%	ins/op	bra/op	miss%	total	benchmark
8,723,373.15	114.63	1.1%	62,853,098.64	10,452,488.20	1.0%	9.73	`Sparse * Dense -> Dense`

Old

ns/op	op/s	err%	ins/op	bra/op	miss%	total	benchmark
9,254,136.85	108.06	0.4%	71,994,036.60	10,402,745.36	1.0%	10.45	`Sparse * Dense -> Dense`

On the other hand, in my real application (about 60% time spent sparse matrix times dense vector), I don't see any change. I suspect that is because I'm running many of these in parallel, and the computation is severely memory bound anyway.

Edited Dec 13, 2021 by Rasmus Munk Larsen

Small speed-up in row-major sparse dense product

SIZE = 1M

New

Old

SIZE = 100k

New

Old

Merge request reports