Limit the time the scanner can run

added TYPE: enhancement WHERE: python labels

added IMPORTANCE: low priority label and removed RESOLVED: wontfix label

it looks like its very very very very difficult to implement

assigned to @pmetras

I've been thinking of this option again this week-end after I upgraded to 5.3.10 to solve the ImageBomb issue and that the new version started rebuilding the cache from start. I only added the new pil_size_for_decompression_bomb_error and activated a password-protected album, and boom! the scanner started indexing for a few days, because of the number of medias and the limited power of my server...

From what I understand, there are mainly 2 phases in the indexing:

Walking the albums tree and resizing pictures, while collecting metadata.
Generate the JSON files containing the metadata.

Phase 1 is the one resource-intensive and taking most of the time (hours) while phase 2 is much faster (few seconds).

When someone adds 1 new picture to an existing albums tree, the scanner still do these 2 phases:

Walking the tree to detect if there are changes, while collecting metadata into memory.
1. When the new picture is found, resize it and add its metadata to Python data structures.
Generate the JSON files containing the metadata.

Now in that scenario, phase 1 still exist but is much faster as detecting changes is simpler than generating the pictures. The JSON files generated in phase 2 include the new picture and are complete: the JavaScript can use them without errors.

All this to say that the indexing is incremental and that each run builds upon the previous ones. And that even if phase 1 does not scan the content of the whole albums tree, the JSON files generated by phase 2 are complete (in the sense that they are sound and valid) and can be used by the JavaScript without causing errors.

If we add a limiting parameter to the configuration, a max_scanner_duration that limits phase 1 run time or a max_new_media that limits the number of new pictures that phase 1 processes, and when this limit in phase 1 is reached we stop phase 1 to continue to phase 2 to generate the JSON indexes for the content that has been seen by phase 1 until now, we still get valid JSON indexes. MyPhotoShare can display the gallery even if we still don't have the whole albums yet.

Justification for this

MyPhotoShare cron job is put in /etc/cron.daily. When the scanner process lasts more than 24 hours, a new process is started and stacked over the first one that haven't completed yet, and eventually a new one the day after because the previous two are competing for resources, etc. The server is dawning... if linux does not kill them when they use all resources.

If we are able to slice the scanner process in smaller runs less than 24 hours, we can prevent such situation from occurring. As said, the gallery does not show the whole content, but perhaps the missing content will be added next day or the days after.

You can object to me that putting the cron script into /etc/cron.weekly gives more time for the indexer. But going that path has the user wait longer before having the results. In the previous scenario, the user has partial results after one day and new content every day.

Before I look into the code, am I missing something? I can't see the difficulty in implementing it.

I think that it's much more complex: scanning a photo means retrieve sus data in order to update up to 4 json files: the album's, the date album, the gps's, the search's. The latter may be many albums, according to descriptions, tags, etc.

Since the json files are saved when all the data are ready, stopping the scanner after a given time means that the json files aren't updated, and then at the next run the media have to be scanned again.

Perhaps I did not explain myself clearly enough. The trick is to stop phase 1 after a timeout/maximum is reached, and then let the scanner continue with phase 2: create the JSON files with the content of the data structures as they are (partially filled) when phase 1 was interrupted.

That way we can simulate and incremental scan. That's not really incremental because the scanner starts from the start at each run, but these operations are faster (hash comparisons in the cache vs encoding and creating new content in the cache). In the end, after each execution it can reach and add new content to the JSON files.

My understanding is worth only if the encoding part (phase 1) is much longer than the JSON part (phase 2), which I'm under the feeling. But I'll have to look at the code as you wrote that the search JSON file is related to many albums...

Actually the scanning phase is made of:

reading the images in the directories, and building an internal data structure
processing the data in the internal data structure
generating the json files

The trick could be very simple:

when scanning a directory, process the subdirectories only if the given maximum time hasn't passed yet

Clearly at this point of the scanning work a certain amount of processing remains:

processing the data in the internal data structure
generating the json files

So maybe the maximum scanning time must be set taking into account the extra calculations.

Let me try doing something...

That's exactly what I had in mind!

It should work because the pictures processing phases is much longer than the other steps. If we configure to work for 12 hours (picture processing) and it takes 10 minutes more to create the JSON files, the extra 10 min are not really important...

mentioned in commit 04a4e0bb

implemented in development branch, please test it

I set the maximum time in minutes for testing purposes, after testing it can be easily changed to hours

closed

I think that minutes are the more versatile; we don't need hours nor seconds. I'll test soon...

PS: My news has been published on https://linuxfr.org/news/myphotoshare-une-galerie-de-medias-pour-le-web-pas-comme-les-autres and you'll probably see more traffic on your demonstration server.

My news has been published on https://linuxfr.org/news/myphotoshare-une-galerie-de-medias-pour-le-web-pas-comme-les-autres

A commenter requests a better robustness.

We could add an option (enabled by default) that after every directory generates the json files. It would make the scanner slower but more robust

We could add an option (enabled by default) that after every directory generates the json files. It would make the scanner slower but more robust

Yes, good idea! Are you going to write the issue?

I did a quick test on my demo site and the incremental scan is working fine. I'll be able to fine tune the scanner on my server now. I want to thank you because this was a real pain when the scanner fired a complete scan and that my server became unusable for a few days with the process taking all resources.

Are you going to write the issue?

#319 (closed)

What default value for max_scanner_duration? perhaps 20 hours? it's thinking in the usual situation of the daily cron job.

A 20 hours would not impact small galleries

If the server is also used as a desktop computer, I think a default of 12 hours is enough for 2-3 thousands pictures, so that the user can also use the computer. You need more time (and a dedicated server) when you have more than ten thousands pictures or lot of videos.

Limit the time the scanner can run

Designs

Child items ...

Activity