More buffering of accumulation in shared memory
Use more shared memory to buffer accumulations in a single thread block since atomic accesses in a thread block are less expensive than global atomic accesses. The thread block accumulations will have to be pushed back by one of the threads at some point.