Parallel filling of sparse matrix

I have an OpenMP-parallelized finite-element code that calculates the matrix entries with up to 128 threads on an AMD Epyc CPU. Memory for the sparse matrix is pre-allocated using m_matrix.reservePerColumn(nnz), where nnz is a guess.

When filling the matrix with

#pragma omp critical
m_matrix.coeffRef(ii,jj) += localMatrix(i,j);

everything works fine but of course the performance is poor.

#pragma omp atomic
m_matrix.coeffRef(ii,jj) += localMatrix(i,j);

compiles fine but yields a segmentation fault when run with multiple OpenMP threads.

I am wondering what is the suggested way to fill a sparse matrix in a multi-threaded code?