Changes

Stephen Parsons · 52843f8b
--- a/Data-preparation-workflow.md
+++ b/Data-preparation-workflow.md
@@ -24,7 +24,33 @@ Files may be distributed to begin with. Check the following documents and locati
 * `gemini1-2:/mnt/gemini1-4/seales_uksr/`
 * `lcc:/pscratch/seales_uksr/`
-# Generate/extract slices
+# Determine crop
+It is necessary to load the volume into a visualizer such as ImageJ to determine the crop bounding box. If the slices are already generated and on your local disk in one volume, which is likely if they came from a benchtop machine, those can be viewed directly. With the large .hdf files from Diamond, it is easiest to first extract some sample slices on LCC, transfer those to your machine and then view them there. Those can be extracted all at once from one or more .hdfs (multiple desirable if object scanned in slabs) like this:
+`sbatch -p CAL48M192_L --time=14-00:00:00 inkid_general_cpu.sh \\`\
+`python /usr/local/dri/ink-id/inkid/scripts/hdf_extract_slices.py \\`\
+`--input-files /pscratch/seales_uksr/nxstorage_partial_copy_in_case_gemini_down/data/Herculaneum_Scrolls/PHercParis2/Frag47/CT/54keV/row_000_scan_91865_91866/full_recon_row_000_scan_91865_91866.hdf \\`\
+`/pscratch/seales_uksr/nxstorage_partial_copy_in_case_gemini_down/data/Herculaneum_Scrolls/PHercParis2/Frag47/CT/54keV/row_001_scan_91867_91868/full_recon_row_001_scan_91867_91868.hdf \\`\
+`/pscratch/seales_uksr/nxstorage_partial_copy_in_case_gemini_down/data/Herculaneum_Scrolls/PHercParis2/Frag47/CT/54keV/row_002_scan_91869_91870/full_recon_row_002_scan_91869_91870.hdf \\`\
+`/pscratch/seales_uksr/nxstorage_partial_copy_in_case_gemini_down/data/Herculaneum_Scrolls/PHercParis2/Frag47/CT/54keV/row_003_scan_91871_91872/full_recon_row_003_scan_91871_91872.hdf \\`\
+`/pscratch/seales_uksr/nxstorage_partial_copy_in_case_gemini_down/data/Herculaneum_Scrolls/PHercParis2/Frag47/CT/54keV/row_004_scan_91873_91874/full_recon_row_004_scan_91873_91874.hdf \\`\
+`/pscratch/seales_uksr/nxstorage_partial_copy_in_case_gemini_down/data/Herculaneum_Scrolls/PHercParis2/Frag47/CT/54keV/row_005_scan_91875_91876/full_recon_row_005_scan_91875_91876.hdf \\`\
+`--dataset-name entry/data/data \\`\
+`--auto-percentile-windowing \\`\
+`--output-dir /pscratch/seales_uksr/nxstorage_partial_copy_in_case_gemini_down/data/Herculaneum_Scrolls/PHercParis2/Frag47/CT/54keV/2023-02-04_sample_every_100_slices/ \\`\
+`--combine-output-in-single-dir \\`\
+`--slice-skip 100`
+The next step is the same for either the original full set of slices (if that is feasible and on your machine) or the subset sampled using something like the above.
+Load the volume in Fiji/ImageJ, and select a rectangular bounding box. Scrub through the slices, adjusting the bounding box to make sure it is always outside the bounds of the object of interest. It is good to crop tightly to create smaller datasets, however it is important to be sure to contain the entire object. When in doubt, it's OK to make the bounding box a little bigger to be sure it includes the object.
+Record the coordinates of the bounding box you have selected. You can either mouse over the edges and get close estimates from viewing the mouse coordinates in the ImageJ main window, or you can use `Analyze -> Tools -> ROI Manager`, add the current ROI, and use `More -> Specify...` withing the ROI manager to see the bounding box specs.
+ ![Screenshot_2023-02-04_at_3.29.53_PM](uploads/1611b662833a0ae3f76a537258c666d9/Screenshot_2023-02-04_at_3.29.53_PM.png)
+# Generate/extract slices, window and crop
 The goal of this step is to get a 16-bit .tif image stack. For benchtop sources, this is probably already done as part of the reconstruction process. For synchrotron scans, this step may be necessary. For example in the 2019 Diamond Light Source scans, the reconstruction output 32-bit float .hdf files from which .tif slices need to be extracted. Often, such as with the fragments scanned in that session, there is a separate .hdf for each "slab". Slices should be extracted from each, and then merged later.
@@ -36,13 +62,13 @@ For particularly large datasets (such as those split into slabs) the entire data
 ### One time setup:
-For convenience, a singularity container and skeleton slurm script have been placed in the DRI Datasets drive directory under Resources. These will allow for easy use of the relevant scripts from volume cartographer on the LCC servers, and should be copied to your scratch space before you begin:
+For convenience, a singularity container and skeleton SLURM script have been placed in the DRI Datasets drive directory under Resources. These will allow for easy use of the relevant scripts from volume cartographer on the LCC servers, and should be copied to your scratch space before you begin:
 ```shell
 rclone copy dri-datasets-remote:/Resources/ $SCRATCH/data_processing_temp_space/ -v
 ```
-The slurm script included should be lightly edited to be specific to the user. In particular the email field should be changed to the relevant address.
+The SLURM script included should be lightly edited to be specific to the user. In particular the email field should be changed to the relevant address.
 ### File transfer
@@ -64,7 +90,7 @@ rclone copy dtn-remote:/path/to/large/dataset ./data_processing_temp_space/ -v -
 ### Running extract/crop on LCC
-Now use sbatch and the previously copied slurm scripts to run extract/crop on the LCC system. The parameters passed to this script will be passed along to the hdf5_to_tif.py file included with volume cartographer, so should be treated in the same way:
+Now use sbatch and the previously copied SLURM scripts to run extract/crop on the LCC system. The parameters passed to this script will be passed along to the hdf5_to_tif.py file included with volume cartographer, so should be treated in the same way:
 ```shell
 sbatch run_hdf_to_tif.sh --input-file ./data_processing_temp_space/slab_file.hdf --output-dir ./data_processing_temp_space/cropped_slab/ --min-x min_val --max-x max_val --min-y min_val --max-y max_val
@@ -76,10 +102,6 @@ This should run fairly quickly and multiple slabs can be processed at once. Now
 Get the original volume or slices onto the machine you are using to process the dataset. This depends on context, but typically we use scp, rclone, etc.
-## Extract slices (HDF5 files only)
-\*\* Coming soon \*\*
 ## Crop slices (optional)
 Many of our datasets are too large to be processed efficiently in their native format. Cropping is the preferred method for reducing size as it maintains the spatial resolution of the scan. Scan through the slices to determine a good bounding box for the object in the scan. Test your crop using the `convert` utility provided by ImageMagick. The following command creates a 9060x1794 image starting at pixel (670,830) in the `full_slice_0000.tif` input image:
@@ -186,7 +208,7 @@ path/to/volume-cartographer/build/bin/vc_segment -m TFF -s VC_SEGMENTATION_ID -v
 * If you enabled dump-vis, a new directory called 'debugvis' will appear inside the working directory. Two directories inside, called 'mask' and 'skeleton', contain images that show what is being segmented. You can use these images as a reference to help you determine when to stop the segmentation.
-To obtain a good-quality segmentation, the mask must cover the majority of the layer of interest, but it is fine if some small parts aren't covered or parts of neighboring pages get segmented too. This is an example of a good-quality segmentation: https://drive.google.com/file/d/1_qzL2L2gZpYHYUJznCZENbsW2ueUj8\_\\\_/view?usp=sharing
+To obtain a good-quality segmentation, the mask must cover the majority of the layer of interest, but it is fine if some small parts aren't covered or parts of neighboring pages get segmented too. This is an example of a good-quality segmentation: https://drive.google.com/file/d/1_qzL2L2gZpYHYUJznCZENbsW2ueUj8\\\_\\\\\\\_/view?usp=sharing
 Check the segmentation occasionally. If a lot of neighboring pages are getting segmented or if the segmentation loses the layer you are segmenting, use Ctrl-C to kill vc_segment **provided you ran vc_segment with `--save interval 1`**.
@@ -235,7 +257,7 @@ The output of this process, `canny_raw.ply`, is a dense point set and requires f
 1. Run `Filters/Point Set/Point Cloud Simplification` to reduce the point set to a reasonable size. If the surface is very smooth, use fewer points. Usually, within the order of 10k to 100k points typically retains enough detail while significantly speeding up later steps. Save this point set with the name: `01_simplified.ply`.
 2. Manually select and delete points that are not on the desired surface.
 3. Run `Filters/Selection/Select Outliers` and then delete the selected vertices. This cleans up groups of points that are not on the surface. It is recommended to enable the Preview option while tuning the selection options.
- 4. Run `Filters/Point Set/Compute normals for point sets` to estimate surface normals for the point set. For dense, noisy point sets, adjust the `Neighbour num` value to a larger value, typically no more than 100. Save this point set to your working directory with the name: `canny_cleaned.ply`
+ 4. Run `Filters/Point Set/Compute normals for point sets` to estimate surface normals for the point set. For dense, noisy point sets, adjust the `Neighbor num` value to a larger value, typically no more than 100. Save this point set to your working directory with the name: `canny_cleaned.ply`
 5. Run `Filters/Remeshing, Simplification and Reconstruction/Surface Reconstruction: Screened Poisson` to triangulate the surface. This filter uses the surface normals generated in the previous step to fit a continuous surface to the point set. Increase the Reconstruction depth to make the surface fit more closely to the original point set at the expense of more faces and a rougher surface. Typically, use a reconstruction depth in the range of 8-10. Save this mesh to your working directory with the name: `canny_poisson.ply`
 6. Poisson will create faces which extend beyond the original point set. Run `Filters/Sampling/Hausdorff Distance` to add an attribute to each vertex of the new surface that is that vertex's distance to the nearest point in the original point set.
 7. Run `Filters/Selection/Select by Vertex Quality` to select those vertices in the Poisson surface which have large distances from the original point set. Use the Preview option to tune the selection. Delete the selected vertices and faces.