Add a max_scanner_duration config value that would indicate the maximum duration the scanne should run. When this value is reached, it should cleanly stop, eventually writing a resume file. When started again, it would read the resume file and try to complete the work again in less than max_scanner_duration else pausing again until it completes the scan.
See #163 (closed) for a justification of such an option.
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I've been thinking of this option again this week-end after I upgraded to 5.3.10 to solve the ImageBomb issue and that the new version started rebuilding the cache from start. I only added the new pil_size_for_decompression_bomb_error and activated a password-protected album, and boom! the scanner started indexing for a few days, because of the number of medias and the limited power of my server...
From what I understand, there are mainly 2 phases in the indexing:
Walking the albums tree and resizing pictures, while collecting metadata.
Generate the JSON files containing the metadata.
Phase 1 is the one resource-intensive and taking most of the time (hours) while phase 2 is much faster (few seconds).
When someone adds 1 new picture to an existing albums tree, the scanner still do these 2 phases:
Walking the tree to detect if there are changes, while collecting metadata into memory.
When the new picture is found, resize it and add its metadata to Python data structures.
Generate the JSON files containing the metadata.
Now in that scenario, phase 1 still exist but is much faster as detecting changes is simpler than generating the pictures. The JSON files generated in phase 2 include the new picture and are complete: the JavaScript can use them without errors.
All this to say that the indexing is incremental and that each run builds upon the previous ones. And that even if phase 1 does not scan the content of the whole albums tree, the JSON files generated by phase 2 are complete (in the sense that they are sound and valid) and can be used by the JavaScript without causing errors.
If we add a limiting parameter to the configuration, a max_scanner_duration that limits phase 1 run time or a max_new_media that limits the number of new pictures that phase 1 processes, and when this limit in phase 1 is reached we stop phase 1 to continue to phase 2 to generate the JSON indexes for the content that has been seen by phase 1 until now, we still get valid JSON indexes. MyPhotoShare can display the gallery even if we still don't have the whole albums yet.
Justification for this
MyPhotoShare cron job is put in /etc/cron.daily. When the scanner process lasts more than 24 hours, a new process is started and stacked over the first one that haven't completed yet, and eventually a new one the day after because the previous two are competing for resources, etc. The server is dawning... if linux does not kill them when they use all resources.
If we are able to slice the scanner process in smaller runs less than 24 hours, we can prevent such situation from occurring. As said, the gallery does not show the whole content, but perhaps the missing content will be added next day or the days after.
You can object to me that putting the cron script into /etc/cron.weekly gives more time for the indexer. But going that path has the user wait longer before having the results. In the previous scenario, the user has partial results after one day and new content every day.
Before I look into the code, am I missing something? I can't see the difficulty in implementing it.
I think that it's much more complex: scanning a photo means retrieve sus data in order to update up to 4 json files: the album's, the date album, the gps's, the search's. The latter may be many albums, according to descriptions, tags, etc.
Since the json files are saved when all the data are ready, stopping the scanner after a given time means that the json files aren't updated, and then at the next run the media have to be scanned again.
Perhaps I did not explain myself clearly enough. The trick is to stop phase 1 after a timeout/maximum is reached, and then let the scanner continue with phase 2: create the JSON files with the content of the data structures as they are (partially filled) when phase 1 was interrupted.
That way we can simulate and incremental scan. That's not really incremental because the scanner starts from the start at each run, but these operations are faster (hash comparisons in the cache vs encoding and creating new content in the cache). In the end, after each execution it can reach and add new content to the JSON files.
My understanding is worth only if the encoding part (phase 1) is much longer than the JSON part (phase 2), which I'm under the feeling. But I'll have to look at the code as you wrote that the search JSON file is related to many albums...
It should work because the pictures processing phases is much longer than the other steps. If we configure to work for 12 hours (picture processing) and it takes 10 minutes more to create the JSON files, the extra 10 min are not really important...
We could add an option (enabled by default) that after every directory generates the json files. It would make the scanner slower but more robust
Yes, good idea! Are you going to write the issue?
I did a quick test on my demo site and the incremental scan is working fine. I'll be able to fine tune the scanner on my server now. I want to thank you because this was a real pain when the scanner fired a complete scan and that my server became unusable for a few days with the process taking all resources.
If the server is also used as a desktop computer, I think a default of 12 hours is enough for 2-3 thousands pictures, so that the user can also use the computer. You need more time (and a dedicated server) when you have more than ten thousands pictures or lot of videos.