Skip to content

Mf6 budgets

Huite Bootsma requested to merge mf6-budgets into master

This is a start with some functions for reading structured discretization (DIS) model budget flows (CBC) into structured DataArrays.

There's a couple potential issues to be aware of:

  • modflow6 writes the budgets either with IMETH=1, which is a dense (1D) array with all data for all cell-to-cell connections for flow-ja-face; and for all cells in case of storage-ss or storage-sy. These can be dumped into DataArrays, requiring only a reshape to (nlayer, nrow, ncol).
  • Or modflow6 writes with IMETH=6, which is a sparse array / table, containing cell number (ID1), bounds number (internal to the package, ID2), the budget term, and a number of auxiliary columns. As far as I know, xarray does not really support sparse arrays fully; or it's in very early stages of development. Anyway, this has two consequences:
  1. Unless you read all tables for all timesteps beforehand, you cannot know in which cells the boundary is active: you need the ID1 column to know this -- this might not be entirely true, I think the active bounds cells are not dynamic during a simulation, but I'm not entirely sure. If they're indeed static, you can take a peek at ID1 for the first timestep, and infer only the necessary layers. But perhaps this is questionable anyway, based on "IDF-thinking": shouldn't you also do it for missing columns and rows in that case? So maybe just allocate the entire domain always anyway like it does now (until we maybe get good sparse data in xarray).
  2. In a dense format, you cannot store duplicate bounds in a single cells. Modflow6 does support multiple drainage boundaries in a single cell for example (ID1 would be the same for two entries, but ID2 would differ). I would argue that you shouldn't do this anyway, since essentially no (dense) raster or mesh data models support this; better to define multiple "systems" -- but not sure whether these ends up with different TEXT labels in the CBC file?
  • flow-ja-face is converted from a 1D (linear index = modflow6 cell number) to right, front, and lower face flow. The indices to grab the respective arrays are computed once, and passed on to index an xr.DataArray, via Dask. During (delayed!) computation, the indexing will occur per timestep. I think this works out fine: the dask graph references the indexing array so it won't be garbage collected, and I expect the graphs contains only the pointer to the indexing array, not a copy of the array itself. So memory use shouldn't explode...
  • Currently, the open_cbc function will read all the headers and convert everything in the CBC file into a dict of TEXT label -> xr.DataArray. This occurs in essentially the following steps:
  1. Read GRB file to get spatial coordinates, cell-to-cell connectivity matrix (ia, ja)
  2. Go through the entire CBC file, parse all headers, store file positions to the data (similar to open_hds)
  3. Create delayed read operations for every file position (==timestep entry); for imeth=6 data, this includes a (delayed!) allocate for t the entire domain, and (delayed!) index data into that array (using ID1 as indexing array)
  4. Create dask array from delayed object
  5. Stack dask arrays along time axis
  6. Convert into xr.DataArray
  7. Store in cbc_content dict of TEXT label -> xr.DataArray This could be wasteful IF you're only interested in RIV flows, for example. However, there's still no real computation occurring. I've tested with a 2.6 GB cbc file. read_cbc_headers only takes 10 ms in this case. open_cbc takes 1 second (so reading headers is only 1%!). An easy solution could be an additional argument variables (or something) which tells the functions which entries should actually be turned into DataArrays.

Been testing with some model output, no CI tests yet ...

Edited by Huite Bootsma

Merge request reports