Introduction

Realm is a low-level task-based runtime providing a high-performance asynchronous task execution model to higher-level systems such as cuNumeric and Omniverse. In addition to asynchronous tasks, it also offers the ability to perform asynchronous data movement across different kinds of memories. Originally, Realm only supported point-to-point memory copies from one source memory to one destination memory. In this branch, we extended Realm's copy interface to handle collective communication.

Broadcast Algorithms

Naive 1-to-n: call find_fastest_path n times and put all the resulting paths together.
1-to-n Aggregation: call find_fastest_path n times and merge the common predecessors of the resulting paths.
Hierarchical Radix Tree Broadcast Algorithm: find an "entry buffer" on every node and then perform a radix tree broadcast across those entry buffers. After that, perform an intra-node broadcast on every node. There are three intra-node broadcast algorithms.
- Intra-node aggregation: 1-to-n aggregation within the same node.
- Intra-node split: distinguish CPU and GPU buffers, and then 1-to-n aggregation.
- Intra-node radix tree: distinguish CPU and GPU buffers, and then radix tree broadcast.

Implementation Note

Important Arguments

-ll:coll_copy_algo: control which broadcast algorithm to use. Possible options:
- 0: naive 1-to-n.
- 1: 1-to-n aggregation.
- 2 (default): inter-node radix tree.
-ll:coll_copy_radix_inter: For inter-node radix tree, control the radix to use.The default is 2. When set to 1, the algorithm will become inter-node 1-to-n aggregation.
-ll:coll_copy_radix_intra: For intra-node radix tree, control the radix to use. 0 is a special value representing intra-node aggregation. 1 is a special value representing intra-node split. The default is 2.

Important Files

runtime/realm/transfer/collective.h: for the convenience of development, the major part of the source code relevant to collective communication resides here. This file is included in the file "/home/jiakuny/workspace/legion/runtime/realm/transfer/transfer.cc" using #include "collective.h". Eventually, it should be removed and all the source code should be copied directly to transfer.cc.
test/realm/collective.cc: a test for collective communication (currently only broadcast). We mainly used two tests, memspeed.cc and scatter.cc, as references when writing this test.
tools/realm_collcopy_analysis.py: a python script that parses the log output of collcopy (enabled by command line argument -level collcopy=2) and draws the transfer graph using graphviz.

Modification to Runtime

IB Buffer Management: originally, Realm only supports point-to-point copy operations. An IB buffer always has one producer data movement engine (XferDes) and one consumer data movement engine (XferDes). IB buffers will be allocated after the analysis phase and before the actual data movement phase and deallocated by the consumer, right after the consumer completes the data transfer. With broadcast copy operation, an IB buffer always has one producer but multiple consumer XferDeses, so we associate a reference counter to every IB buffer, using a hashmap IBMemory::block_ref_count_map, which maps the offset of each IB buffer to its reference counter. The major code change resides in IBMemory::do_alloc, IBMemory::do_free, and TransferOperation::allocate_ibs.
Transfer Descriptor Control Flow: Realm transfers the data through IB memories in a pipeline manner. (a) To save IB memories, Realm set the maximum size of an IB memory buffer to be 16MB. (b) Realm will transfer the message as long as its input has gathered more than 4KB of data. Therefore, there is a signaling mechanism to coordinate every upstream and downstream XferDes pair. (a) When an upstream XferDes finishes its write to the output IB memory buffer, it will call its own update_bytes_write method, which will call the downstream XferDes’ update_pre_bytes_write method. (“I have written these data to the buffer. You can now read it.”) (b) When a downstream XferDes finishes its read from the input IB memory buffer, it will call its own update_bytes_read method, which will call the upstream XferDes’ update_next_bytes_read method. (“I have read these data from the buffer. You can now free the buffer space to write more data.”) With broadcast copies, XferDes can have multiple downstream peers. update_bytes_write needs to call multiple peers’ update_pre_bytes_write. update_next_bytes_read can be called by multiple peers. Need to take the intersection of the reported range. The major code change resides in XferDes::update_pre_bytes_write and XferDes::update_next_bytes_read.
Use RegionInstance as IB: originally, a RegionInstance can either be the input or output of a XferDes, but not both, so there is no signaling mechanism built around RegionInstance to coordinate its upstream and downstream XferDeses. In the radix tree broadcast algorithm, a RegionInstance can also be selected as an entry buffer. We extend the IB memory’s signaling mechanism so that it can also handle RegionInstance correctly. Basically, RegionInstance always has enough space to hold the entire message, so we tell the upstream XferDes you can always write to it by setting the ib_size to a very large value (~size_t(0)) >> 2. We also add a new variable inst_serve_as_ib so that the instance would not be accidentally deallocated as an IB buffer in XferDes::mark_completed. The major code change resides in TransferOperation::create_xds.

Edited Aug 19, 2022 by Jiakun Yan

Draft: Support for Collective Copy Operations in Realm