Improve speed of extracting reads belonging to whitelisted barcodes.

Related to workflow 3. Currently we do for loop in Python going through each line in the unfiltered FASTQ of R1, compare the sequence in the first 16 positions and check if it contained in the list of whitelisted barcodes.

Staying within the Python script, we could instead loop through reads (instead of lines), for example like this:

with open("testing/FASTQ/sample_1_R1_2300000.fastq") as fastq:
  # read content of FASTQ
  fastq_lines = fastq.readlines()
  # get number of lines
  num_lines = len(fastq_lines)
  # get number of reads
  num_reads = int(num_lines / 4)
  # for every read
  for i in range(1, 3):
    read_sequence = fastq_lines[(i-1)*4+1].rstrip("\n")[0:16]
    if read_sequence in filtered_barcode_list:
      lines_for_filtered_fastq.append((i-1)*4)
      lines_for_filtered_fastq.append((i-1)*4+1)
      lines_for_filtered_fastq.append((i-1)*4+2)
      lines_for_filtered_fastq.append((i-1)*4+3)

Then we just have to write the content of lines_for_filtered_fastq to a file. From some initial testing this should work and is more concise than how we are doing it currently, but can probably still be improved.

Looking for a solution outside of Python, I'm experimenting with grep. The idea is the following. First, we generate the whitelist of barcodes, then we add a ^ to the beginning of each line (because we are building regex patterns to look for (sed -i -e 's/^/^/' whitelist.txt). Note that the file will be overwritten with this command but we can also forward it to a new file with some adjustments (sed -e 's/^/^/' whitelist.txt > new_whitelist.txt). Then, we run a grep and pass the whitelist file as a list of patterns:

LC_ALL=C \
grep -A 2 -B 1 \
-f whitelist.txt \
FASTQ_c_graft.fastq \
| sed '/--/d' \
> out.fq

LC_ALL=C limits the number of characters that can be matched (only ASCII I think) to improve speed.
grep extracts the previous 1 and following 2 lines of a match.
sed removes unwanted separators between matches.

I also read that it is possible to parallelize even more using the GNU parallel command:

LC_ALL=C \
cat FASTQ_c_graft.fastq \
| parallel --pipe --block 1000M grep -A 2 -B 1 \
-f whitelist.txt \
| sed '/--/d' \
> out.fq

However, it seems that both still take a considerable amount of time to finish (they are not finished yet...).

I'm also afraid that with increasing number of reads and whitelisted barcodes the time might increase exponentially, which is always bad for scaling.

The problem is also that we are dealing with such a long whitelist because those barcodes are not error-corrected yet. However, I don't really see how it can be done at that step since also the reads in the FASTQ contain errors. And apart from that I still have no idea how to do the error-correction in the first place. The C++ source code of STARsolo (https://github.com/alexdobin/STAR/tree/master/source) might be useful but I currently don't have the time to dig into that :/

I just wanted to take note of these thoughts.