User changeable chain length

Chain length is an important measure of similarity significance. For example:

In an english text, finding that two authors have a high degree of similarity based on

  1. finding a high number of 3-letter sequences that are similar (the,are), is insignificant.
  2. finding the entirety of tolstoy's translated works exactly replicated, is significant.

We resolve this by maintaining a constant value of "chainlength" (or "chain length significance", or something like that). Any chain that is below this length, is immediately discarded as insignificant.

Significant length, is dependant on the context in of the use. For example,

  • if I am looking for general social relationships in the classroom: a low significance may be of interest.
  • if I have a culture of copying that the institution is attempting to change, it may be of value to filter out lower chain lengths to find the worst cases.
  • If I am engaged in a legal dispute, I probably want to focus on the most obvious and extreme cases to avoid frivolous counter arguments.
  • DNA is different from software, is different from English. Significant length is different in each of these cases.

Therefore users should be able to adjust properties around comparison chain selection to suite their needs.

It is probably worth gathering statistics around what people change this too (for default setting)