[IMPROVEMENT] Superior data design. (!141) · Merge requests · Valasiadis Fotios / build-recorder

Debugging these last few weeks i slowly came to the realization that the way we had our finfo and pinfo collections organized was wrong.

A simple explanation of what was going on up until now would be:

We had a global finfo array We had a global pinfo array Each pinfo had an internal array which would hold indexes pointing at the global finfo array's elements

Upon a file encounter on a process pid with the file's id = fd:

Search the global finfo array to see if the file was ever seen before
If not, add it
have the corresponding pinfo's array element at index fd point to the index of the previously existing or newly defined file at the global finfo array.

Later we found out that we need to skip files opened for writing during our 1. search phase. That's easy to figure out if you simply imagine the following scenario:

Open a file for writing with fd = n
Before closing n, we open the same file for reading

Associating the read with the finfo created by the write is wrong, because we've discussed before that we don't care about a file's previous state before a write. We thus have to treat em as two separate entities.

We now realize that files in the global array that are opened for write and are not yet written are to be ignored on searches. We add a bool was_hash_printed denoting that information in an attempt to patch the problem.

This worked! But later thinking about ways in which i could change the finfo structure with a hashmap or binary tree for performance purposes i realized that we had an extra restraint to think of thanks to this poor design.

The new data structure had to provide reference validity in order for the internal pinfo data structure to have a means to point to that array.

Furthermore, We couldn't possibly just search the global finfo array, since the to-be-written list of files had no hash associated with 'em. Did we not have any way to tell the files apart? We absolutely had, their fd! Something which we didn't store inside the FILE_INFO structure that the finfo array was composed of.

Another fact worth noting is that in reality, we only cared about a process's list of files that are opened for writing or reading & writing. Definitely not the ones opened for plain reading. On the same note as previously discussed, in our 1st step we only want to search files opened for plain reading. Do you see a pattern here?

As a result of the above, i propose this change. Have the pinfo's internal data structure to be a map (in our case, a simple pair of arrays, one containing the fd fields, and one the finfo fields) mapping fds to FILE_INFOs currently open for any sort of writing, and only add them to the global finfo array once they are close(2)d

[IMPROVEMENT] Superior data design.

Merge request reports