GPU cell calculations should accomidate tile edges
As part of #6, transitions between tiles needed to be incorporated into the way cells are calculated.
The initial idea was to simply add another row to the top and bottom of matrix to allow for seed data to be positioned. It quickly became evident that this would result in a massive number of exceptions around not processing the far left and far right of the tile. A second idea was conceived whereby the calculation of the cell is modified to assume that the present values represent a prior candidate, and if the current calculated candidate is not superior, it is abandoned.
While this idea shows promise, it is going to be difficult to implement. Given the the small exposure to the edges of tiles that the overall problem experiences, the current imperfect solution is assumed to be "good enough" to defer the "real" solution (this issue) to the future.
graph LR
prev((Previous Tile)) --> seed
seed --> here
here --> N
here --> W
here --> NW
N --> nCand(Candidate)
W --> wCand(Candidate)
NW --> nwCand(Candidate)
here --> hereCand(Candidate)
nCand --> compare{Pick Best}
wCand --> compare
nwCand --> compare
hereCand --> compare
compare --> result[New `here`]
result -.-> seed
Problem
The individual cell consists (primarily) of four values:
- Match Score: the score of this cell based on a comparison of the string tokens
- Terminus: a running fuse, indicating that we should give up on the chain
- Directionality: the direction of previous cell in the chain (
horizontal
,vertical
,diagonal
) - Running Score: the accumulated score for the chain
The directionality (dir
) is the key value we are seeking as this indicates what the "next" element in a chain is. Running Score (score
) is critical to this because it helps to assess which branch of a chain to follow.
This definition gets interesting around the "edges" of the workspace. Cells along the west (w
) and north (n
) axis are the very start of our universe, and therefore will not have all three candidate parent cells existing. In the past, this was solved by simply assuming out of bounds values to have zero for all score values. This worked by allowing the ma thematic operations to be performed, but with very low scores resulting the values being discarded as irrelevant, or tie-breaking logic being applied.
With the recent addition of feature #6, what was a single large workspace, has been broken into several smaller "tiles" which fit into memory regardless of how large the dataset becomes. Each of these tiles is calculated independantly, with their edges stitched together when chains cross the boundary. This act of "stiching" is performed by seeding the values of a newly formed tile with the cell values from the preceding tile.
As we have seed values, we can no longer assume that "out of bounds" is equal to "zero". We must have some mechanism to accomodate or retain the information from the prior tile. Further, this mechanism must account for GPU tendency to be cell agnostic: it wants to perform the same calculation for every cell instance... no exceptions.
Solution
In order to achieve this, the seeding process is allowed to set perform a cell calculation to determine which of the seed chains should be used for seeding which cell and the value of the cell is set prior to the GPU calculations being applied.
The GPU can independently calculate each cell by calculating a complete candiate value for each of the parents, and then selecting the best candidate to replace the local location. Using this technique we can add the current location as one of the candidates, which accommodate a record introduced by the intra-tile seeding process.
Under these conditions, an out-of-bounds null
value may be introduced, but will not score higher than the existing introduced value.
Parent | pScore |
[v,h] | mScore |
dir |
score |
---|---|---|---|---|---|
here | 7 | [6,0] | -1 | 🡠 |
6 |
north | 4 | [6,0] | -1 | 🡡 |
5 |
west | null |
[6,0] | -1 | 🡠 |
0 |
nw | null |
[6,0] | -1 | 🡤 |
0 |
result | 7 | [6,0] | -1 | 🡠 |
6 |
(taken from tile [0,1] of a comparison between covid genomes LC523989 and MT304476)
In the above example, we have an element on the western edge (6 down, 0 over). The previous tile seeded an introductory score of 6 from the west (here
). This was retained in spite of the natural west value being considered (from out of bounds).
A second example where the value is not retained is relevant.
Parent | pScore |
[v,h] | mScore |
dir |
score |
---|---|---|---|---|---|
here | 7 | [6,0] | -1 | 🡠 |
6 |
north | 8 | [6,0] | -1 | 🡡 |
7 |
west | null |
[6,0] | -1 | 🡠 |
0 |
nw | null |
[6,0] | -1 | 🡤 |
0 |
result | 8 | [6,0] | -1 | 🡡 |
7 |
We can see that while a seeded chain was taken into account, it was not as significant as the chain arriving from the north. Therefore the vertical relationship is used as the final result.
Lastly, an example where no seeding has taken place (or the seeded value was zero):
Parent | pScore |
[v,h] | mScore |
dir |
score |
---|---|---|---|---|---|
here | 0 | [6,0] | -1 | - |
0 |
north | 4 | [6,0] | -1 | 🡡 |
5 |
west | null |
[6,0] | -1 | 🡠 |
0 |
nw | null |
[6,0] | -1 | 🡤 |
0 |
result | 4 | [6,0] | -1 | 🡡 |
5 |
In this case, the here
value is effectively equivalent to null
and is therefore considered in a fair comparison. It is worth noting, that in the event of a tie, priority is given to diagonal.
Direction | Enum | Binary |
---|---|---|
null | 0 | 00 |
north | 1 | 01 |
west | 2 | 10 |
north+west | 3 | 11 |
So if use the directional enumeration as a means to break ties, we prioritise the diagonal and deprioritise null (effectively discarding it).