Draft: Fix bottlenecks for MPI with facetting mode (!649) · Merge requests · André Offringa / WSClean

Duncan Kampert requested to merge cortex-5-mpi-pinning into master Mar 08, 2024

Running facetting mode on wsclean does typically not optimally utilize large amounts of CPU cores due to thread migration across numa domains. This can be solved by pinning threads to specific cores, which allows for more efficient parallelism for parts of the wgridding step (e.g., fft). Pinning threads through numactl is difficult with the normal wsclean run configuration, due to the recursively spawning of threads through workers which in turn spawn threads in ducc0 wgridding.

We can more easily pin workers by using the MPI implementation for this instead, and restrict processes to specific subsets of cores through a rankfile.

All of the following tests are done on AMD Rome dual socket 128 (64*2) core machines:

The baseline for our dataset with 16 channels of data, processing ~10million rows is: 2024-Mar-08 12:13:28.812094 Inversion: 00:12:02.927086, prediction: 00:03:35.867093, deconvolution: 00:02:32.236038

Using MPI to pin processes to cores: 2024-Mar-08 11:44:04.423463 Inversion: 00:08:11.582919, prediction: 00:10:25.912108, deconvolution: 00:02:42.754009

Using MPI to pin processes, and removing the write lock on a per-line basis: 2024-Mar-08 14:33:50.669247 Inversion: 00:08:02.996095, prediction: 00:03:09.016500, deconvolution: 00:02:42.607004

Pinning threads to cores gives a ~30% speedup for inversion and prediction, the two parts of the process which use wgridder. There are some parts of the code which are suboptimal for now (to move the write lock I made some properties public instead of private).

I will open it as a draft for now, and can provide more performance figures (or the dataset) if necessary.

Below the MPI runscript and the normal runscript respectively:

Click to expand

# Load environment
source ~/Cortex/env.sh
module load OpenMPI/4.1.4-GCC-11.3.0

# Create WSClean args
TEMP_DIR=/dev/shm/
NUM_PROCS=$SLURM_CPUS_ON_NODE  # The machine's corecount
H5_DB="$HOME/Cortex/2channels/master_merged.h5"
MS_NAME="test.ms"
PARALLEL_GRIDDING=$(( NUM_PROCS / 8 ))

args="-update-model-required -gridder wgridder -mem 7 -no-work-on-master -minuv-l 80.0 -size 22500 22500 -weighting-rank-filter 3 -weight briggs -1.5 -reorder -parallel-reordering $NUM_PROCS -mgain 0.65 -data-column DATA -auto-mask 2.5 -auto-threshold 1.0 -pol i -name 1.2image -scale 0.4arcsec -taper-gaussian 1.2asec -niter 50000 -log-time -multiscale-scale-bias 0.7 -parallel-deconvolution 2600 -multiscale -multiscale-max-scales 9 -nmiter 1 -facet-regions facets.reg -apply-facet-solutions $TEMP_DIR/$(basename -- $H5_DB) amplitude000,phase000 -apply-facet-beam -facet-beam-update 600 -use-differential-lofar-beam -even-timesteps -temp-dir $TEMP_DIR $MS_NAME"

WORK_DIR=/scratch-shared/duncank/cortex
#WORK_DIR=/scratch-node/duncank.$SLURM_JOB_ID
#WORK_DIR=/dev/shm

mkdir $WORK_DIR -p
for file in "$MS_NAME" "$HOME/Cortex/2channels/facets.reg"; do
        cp -r $file $WORK_DIR
done

# Copy the h5_db for higher read performance
cp $H5_DB $TEMP_DIR

cd $WORK_DIR

# Create hostfile for the run to know where to put the processes oversubscribed.

FULL_NODELIST=$(scontrol show hostname $SLURM_JOB_NODELIST | paste -d" " -s)
echo "Slurm Nodes: $FULL_NODELIST"


echo "rank 0=`echo $FULL_NODELIST | cut -d ' ' -f1` slot=0-$(( NUM_PROCS - 1 ))" > rankfile
for node in $FULL_NODELIST; do
  for grid in `seq $PARALLEL_GRIDDING`; do
    echo "rank $(( node * PARALLEL_GRIDDING + grid ))=$node slot=$(( 8 * (grid - 1) ))-$(( 8 * grid - 1 ))" >> rankfile
  done
done
cat rankfile



# Parallel run
cmd="$PROFILING mpirun -np $(( PARALLEL_GRIDDING + 1 )) -rf rankfile -oversubscribe wsclean-mp $args"
echo $cmd
eval $cmd

Click to expand

# Load environment
source ~/Cortex/env.sh
module load likwid/5.2.2-GCC-11.3.0

# Create WSClean args
TEMP_DIR=/dev/shm/
NUM_PROCS=128  # Machine core count
H5_DB="$HOME/Cortex/2channels/master_merged.h5"
MS_NAME="test.ms"
PARALLEL_GRIDDING=16

args="-update-model-required -gridder wgridder -minuv-l 80.0 -size 22500 22500 -weighting-rank-filter 3 -weight briggs -1.5 -reorder -parallel-reordering $NUM_PROCS -mgain 0.65 -data-column DATA -auto-mask 2.5 -auto-threshold 1.0 -pol i -name 1.2image -scale 0.4arcsec -taper-gaussian 1.2asec -niter 50000 -log-time -multiscale-scale-bias 0.7 -parallel-deconvolution 2600 -parallel-gridding $PARALLEL_GRIDDING -multiscale -multiscale-max-scales 9 -nmiter 1 -facet-regions facets.reg -apply-facet-solutions $TEMP_DIR/$(basename -- $H5_DB) amplitude000,phase000 -j $NUM_PROCS -apply-facet-beam -facet-beam-update 600 -use-differential-lofar-beam -even-timesteps -temp-dir $TEMP_DIR $MS_NAME"

WORK_DIR=/scratch-shared/duncank/cortex
#WORK_DIR=/scratch-node/duncank.$SLURM_JOB_ID
#WORK_DIR=/dev/shm

mkdir $WORK_DIR -p
for file in "$MS_NAME" "$HOME/Cortex/2channels/facets.reg"; do
	cp -r $file $WORK_DIR
done

# Copy the h5_db for higher read performance
cp $H5_DB $TEMP_DIR

cd $WORK_DIR

wsclean $args

Edited Mar 08, 2024 by Duncan Kampert

Draft: Fix bottlenecks for MPI with facetting mode

Merge request reports