README.md

# STRipy-pipeline

## Description
STRipy-pipeline is non-graphical command-line version of [STRipy](https://stripy.org) that can be integrated into pipelines and analyse multiple STR or VNTR loci in parallel. STRipy allows to use BAM and CRAM files, aligned on either hs1 (T2T-CHM13), hg38 or hg19 reference genomes. STRipy can also genotype hg38/hg19 reference-missing XYLT1 locus. There is a Docker container available that contains all the required tools. Genotyping results along with the alignment visualisations in HTML format are generated for an easy review (plain JSON file can also be outputted).

STRipy is primarily designed as an initial screening tool for tandem repeat expansions, performing well with short expansions within a read. However, for longer ones, particularly those exceeding the fragment length (over a few hundred base pairs), it is important to verify its results due to limitations in short-read sequencing (e.g. by using long-read sequencing).

## Requirements and  installation instructions
STRipy-pipeline requires [Python 3](https://www.python.org/downloads/) to run, have access to [ExpansionHunter's](https://github.com/Illumina/ExpansionHunter/releases), [REViewer's](https://github.com/Illumina/REViewer/releases), [Samtools](http://www.htslib.org/download/) executables as well as reference genome you want to use. For analysing XYLT1, [BWA](https://github.com/lh3/bwa) is also required.

1. Clone the repository to create a local copy on your system by running the following command in terminal:
`git clone git@gitlab.com:andreassh/stripy-pipeline.git`

*Skip the following steps when using Docker and read how to use it in the Usage section below.*

2. Download and install required tools into your system (if not installed already):
* Python 3: [https://www.python.org/downloads/](https://www.python.org/downloads/)
* ExpansionHunter: [https://github.com/Illumina/ExpansionHunter/releases](https://github.com/Illumina/ExpansionHunter/releases)
* REViewer: [https://github.com/Illumina/REViewer/releases](https://github.com/Illumina/REViewer/releases)
* Samtools: [http://www.htslib.org/download/](http://www.htslib.org/download/)
* BWA: [https://github.com/lh3/bwa](https://github.com/lh3/bwa) _(XYLT1 only)_

3. If those tools are in $PATH (i.e. you can run them from command line by typing the tool name) then you are good to go. Otherwise, you need to edit the [config.json](./config.json) file and specify the file path for the tools.

4. To add the required Python modules for STRipy, run the following command in the root folder: `python3 -m pip install -r requirements.txt`

## Configuration
You can configure some parameters by editing the [config.json](./config.json) file. Please see the description of parameters and options in the following table:
| Parameter                    | Default value       | Description                                                                                                     |
|------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------------|
| filepath_expansionhunter     | ExpansionHunter     | Path where ExpansionHunter's executable file is stored. If you added bin/ into your $PATH then no need to edit this, otherwise use either "bin/ExpansionHunter" or "full/path/to/stripy-pipeline/bin/ExpansionHunter" |
| filepath_reviewer            | REViewer            | Path where REViewer's executable file is stored. If you added bin/ into your $PATH then no need to edit.        |
| filepath_samtools            | samtools            | Path where Samtools's executable file is stored. It *should* be in your $PATH after installation.               |
| filepath_bwa                 | bwa                 | Path where BWA's executable file is stored. It *should* be in your $PATH after installation.                    |
| num_threads                  | all                 | Number of CPU threads/processors to use (e.g. 2, 4, 8 (NB! numerical values are without quotes) or "all" to use all available cores/threads)                       |
| verbose                      | false               | Use verbose mode for error tracking? Set true to print out extra information during the analysis.                       |
| output_html                  | true                | Output HTML file? true or false. HTML file is for graphical viewing of results (recommended). If this is set to false then SVG image is not stored in the output JSON file (next parameter).                                   |
| output_json                  | false               | Output JSON file? true or false. The file contains results in JSON (text) format and can be used to analyse data with programs.                                                |
| output_tsv                   | true                | Output TSV file? true or false. The file contains the main results in TSV (text) format and can be used to open in Excel, etc.                                                |
| output_svg_flag_threshold    | 1                   | Flag threshold to display read alignment images (SVG) in HTML output (0 = show for all loci, 1 = show for unknown (all), intermediate and pathogenic, 2 = show for unknown (population outliers), intermediate and pathogenic, 3 = show only for pathogenic loci.                                                |
| log_flag_threshold           | 1                   | Flag threshold for log file (when using --logflags). Specify -1 to log all analysed loci (including errors), specify 0 to log all loci which genotype was determined (including those in normal range), specify 1 to 3 to log only loci with positive flags (1 = unknown, 2 = intermediate/unknown population outlier and 3 = pathogenic range; e.g. specifying "2" logs only those in intermediate and pathogenic range) |
| use_alt_contigs              | false               | Use alternative contigs? true or false. Turning it on can also look reads from alternative contigs (e.g. chr16_KZ559113v1_fix, HLA-..., etc.), but in that case, the same reference genome your samples are aligned to has to be used. If the alternative contig does not exist in the reference file, this will result an error. It is likely having an effect only for CNBP and XYLT1 locus.                                              |
| extended_analysis            | stripy              | Use "eh_only" to genotype a locus with off-target regions directly with ExpansionHunter. Use "stripy" to pre-process reads from off-target regions beforehand (aims to eliminate some noise) |
| ea_max_depth_offtarget       | 1000                | Maximum mean depth of an off-target region to include. Some regions (e.g. LINC00486) may have hundreds of thousands of reads which processing takes a very long time, therefore do not process those regions to speed up analysis. |
| eh_analysis_mode             | seeking             | ExpansionHunter's analysis mode - seeking or streaming. [Read more](https://github.com/Illumina/ExpansionHunter/blob/master/docs/03_Usage.md#analysis-modes)                                         |
| fregion_length_bp            | 2000                | How many nucleotides are extracted out before as well as after the STR locus.                                    |
| min_no_reads_eh              | 10                  | Filter (ExpansionHunter): if the number of all reads combined (spanning, flanking and in-repeat) is below that value, then consider the locus low coverage and filter results out. Value less than 10 can be used only if the version of ExpansionHunter in the system supports it.  |
| min_no_reads_stripy          | 0                   | Filter (STRipy): if the number of spanning and flanking reads combined with the pathogenic motif is below that value, then filter it out. STRipy counts only pathogenic motifs (unlike ExpansionHunter) and therefore using some threshold here will help to analyse Nested/Replaced type of reads as results which doesn't contain or has only few reads with pathogenic motif, will be filtered out. Change to 0 if you want to see results for loci such as RFC1 that does not contain pathogenic motifs.

## Usage
### Single sample
Run STRipy from command line by specifying the input file, reference genome, output folder, analysis method and either known pathogenic locus/loci or custom ones. For instance, by using the example file, we can genotype AFF2, ATXN3, HTT and PHOX2B loci in Sample001_hg38.bam file by running the following command:
```
python3 stri.py \
        --genome hg38 \
        --reference reference/hg38.fa \
        --output results/ \
        --locus AFF2,ATXN3,HTT,PHOX2B \
        --input examples/Sample001_hg38.bam
```

There is also Sample002_hs1.bam which is aligned on hs1 (telomere-to-telomere) reference genome, where DMPK and XYLT1 locus can be genotyped.

NB! Find the locus ID that you want to target from [STRipy's database](https://stripy.org/database) (first column).

Names and description of all parameters that can be specified:

| Parameter     |  Description                                                                                                    |
|---------------|-----------------------------------------------------------------------------------------------------------------|
| genome        |  (Optional) Reference file genome (hs1, hg19 or hg38). Default is hg38.        |
| reference     |  Reference genome FASTA file path.        |
| output        |  Path to a folder where results (HTML and/or JSON file) will be saved.        |
| locus         |  Name of locus or list of loci to target. If using a list, please separate by comma. For example, to analyse only HTT locus use `--locus HTT` and to analyse multiple, separate locus names by comma. Required if no custom loci are used (otherwise optional).        |
| input         | Path to an input file (BAM or CRAM). |
| analysis      | (Optional) Type of analysis (standard or extended). Extended can also determine expansions longer than the fragment length. However, "standard" is a recommended method (and the default option) that is significantly faster. Extended analysis could be used after standard analysis resulted a locus with long expansion (over the read length).       |
| sex           |  (Optional) Can be either male or female (for example, `--sex male`). You might want to use this while analysing loci in the X chromosome of male samples. If not specified, samples will be analysed as "females" (i.e. two alleles expected).    |
| custom        |  (Optional) Path to a BED file that contains coordinates for analysing custom loci.        |
| logflags      |  (Optional) Log results (input file name, locus and flag) into a specified tab-delimited text file (results will be appended). Flag threshold can be specified in config file. 0 = normal, 1 = unknown, 2 = intermediate, 3 = pathogenic range    |
| config        |  (Optional) File path of the config file. If left empty then the default one in the root folder will be used.    |


#### Using custom loci
Besides the known pathogenic loci, custom ones are also supported. To analyse those, firstly create a tab-delimited BED file containing at least the following four values: chromosome, start and end position of the STR locus and motif on the plus strand. Optionally, the locus name/ID can be specified as fifth value. Additionally, you can also specify disease name, inheritance, normal range and pathogenic cut-off values which are then being used to colourise results. Please see the [example BED file](./examples/vntr.bed) that contains information to genotype four VNTRs, including information about the associated disease. When running STRipy, specify the bed file, such as: `--custom examples/vntr.bed` (can be additional to `--locus` or replacing it). Now, STRipy catalogue contains VNTRs as well and those loci can be specified in `--locus`.


### Batch processing
If you want to process multiple samples at once (such as all in a folder) then this can be done through a batch.sh script. First run `chmod +x batch.sh` to give executable permissions for the script. Then, for example, to genotype all BAM files in the examples/ folder, run:
```
./batch.sh \
        -g hg38 \
        -r reference/hg38.fa \
        -o results/ \
        -l AFF2,ATXN3,HTT,PHOX2B \
        -i examples/*hg38.bam
```

NB! The input file (-i) parameter has to be the last. This also allows to specify multiple files in a row, such as `-i examples/Sample001.bam examples/SampleTwo.cram etc...`

All parameters are as follows (same as above when analysing single sample, only shortened):
```
  -g <genome>:    (Optional) Name of the reference genome (hs1, hg19 or hg38; default: hg38)
  -r <reference>: (Required) Reference genome (FASTA file)
  -o <output>:    (Required) Output folder
  -l <locus>:     (Required) Locus ID to target (or loci as CSV string)
  -a <analysis>:  (Optional) Analysis type (standard or extended; default: standard)
  -s <sex>:       (Optional) Sex of the sample
  -c <custom>:    (Optional) Custom loci file path (BED file)
  -f <logflags>:  (Optional) File path of flag log file
  -n <config>:    (Optional) File path of the config file
  -i <input>:     (Required) Input file(s) (indexed BAM or CRAM, e.g.: *.bam or: Sample001.bam Sample002.bam); must be the last parameter
```

### Using via Docker
If you have Docker installed into your system then it is very easy to analyse your samples by using `runDocker.sh` script. At the first run, the script builds the container and all the required tools for analysis (ExpansionHunter, REViewer, Samtools, BWA and Python) will be installed. You can specify the same parameters as for the batch script above. NB! When using wildcard, the path has to be between quotes, such as:
```
./runDocker.sh \
        -g hs1 \
        -r reference/hs1.fa \
        -o results/ \
        -l DMPK,XYLT1 \
        -i "examples/*hs1.bam"
```

### List of loci
Here are sets of loci you might be interested of using:
* All loci: `--locus AFF2,AR,ARX_1,ARX_2,ATN1,ATXN1,ATXN10,ATXN2,ATXN3,ATXN7,ATXN8OS,BEAN1,C9ORF72,CACNA1A,CBL,CNBP,COMP,DAB1,DIP2B,DMD,DMPK,FGF14,FMR1,FOXL2,FXN,GIPC1,GLS,HOXA13_1,HOXA13_2,HOXA13_3,HOXD13,HTT,JPH3,LRP12,MARCHF6,NIPA1,NOP56,NOTCH2NLC,NUTM2B-AS1,PABPN1,PHOX2B,PPP2R2B,PRDM12,RAPGEF2,RFC1,RILPL1,RUNX2,SAMD12,SOX3,STARD7,TBP,TBX1,TCF4,TNRC6A,XYLT1,YEATS2,ZIC2,ZIC3`
* Only loci with a neurological disease: `--locus AFF2,AR,ARX_1,ARX_2,ATN1,ATXN1,ATXN10,ATXN2,ATXN3,ATXN7,ATXN8OS,BEAN1,C9ORF72,CACNA1A,CBL,CNBP,DAB1,DIP2B,DMPK,FGF14,FMR1,FXN,GLS,HTT,JPH3,LRP12,MARCHF6,NIPA1,NOP56,NOTCH2NLC,NUTM2B-AS1,PABPN1,PHOX2B,PPP2R2B,PRDM12,RAPGEF2,RFC1,SAMD12,SOX3,STARD7,TBP,TNRC6A,YEATS2,ZIC2,ZFHX3`
* Only loci with childhood onset diseases: `--locus AFF2,ARX_1,ARX_2,ATN1,ATXN2,ATXN3,CBL,COMP,DIP2B,DMD,DMPK,FMR1,FOXL2,FXN,GLS,HOXA13_1,HOXA13_2,HOXA13_3,HOXD13,PHOX2B,PPP2R2B,PRDM12,RUNX2,SOX3,TBP,TBX1,XYLT1,ZIC2,ZIC3`
* Only loci in coding region: `--locus AR,ARX_1,ARX_2,ATN1,ATXN1,ATXN2,ATXN3,ATXN7,CACNA1A,COMP,FOXL2,HOXA13_1,HOXA13_2,HOXA13_3,HOXD13,HTT,PABPN1,PHOX2B,PRDM12,RUNX2,SOX3,TBP,TBX1,ZIC2,ZIC3,ZFHX3`
* Only loci in UTRs: `--locus AFF2,ATXN8OS,CBL,DIP2B,DMPK,FMR1,GIPC1,GLS,JPH3,LRP12,NIPA1,NOTCH2NLC,PPP2R2B,RILPL1,XYLT1`
* Only Standard type repeats: `--locus AFF2,AR,ATN1,ATXN1,ATXN10,ATXN2,ATXN3,ATXN7,ATXN8OS,C9ORF72,CACNA1A,CBL,CNBP,COMP,DIP2B,DMD,DMPK,FMR1,FXN,GIPC1,GLS,HTT,JPH3,LRP12,NIPA1,NOP56,NOTCH2NLC,NUTM2B-AS1,PABPN1,PPP2R2B,PRDM12,TBP,TCF4,XYLT1,ZIC3,ZFHX3`
* Only Imperfect GCNs repeats: `--locus ARX_1,ARX_2,FOXL2,HOXA13_1,HOXA13_2,HOXA13_3,HOXD13,PHOX2B,RUNX2,SOX3,TBX1,ZIC2`
* Only Replaced/Nested repeats: `--locus BEAN1,DAB1,MARCHF6,RAPGEF2,SAMD12,STARD7,TNRC6A,YEATS2`
* Only VNTRs: `--locus CSTB,EIF4A3,PRNP,VWA1`

### Is there a way to create a file that I can open in Excel?
Yes, either turn on "output_tsv" in config file, or you can convert it from created JSON output file by using linux tool 'jq'. Then, this file can be opened with Excel or other similar software. For example:

```
jq -r '
.GenotypingResults[] | to_entries[] |
  .value.TargetedLocus.LocusID + "\t" +
  .value.TargetedLocus.Coordinates + "\t" +
  .value.TargetedLocus.Motif + "\t" +
  (.value.Metadata.Coverage|tostring) + "\t" +
  (.value.Alleles[0].Repeats|tostring) + "\t" +
  (.value.Alleles[1].Repeats|tostring) + "\t" +
  (.value.Alleles[0].CI.Min|tostring) + "-" + (.value.Alleles[0].CI.Max|tostring) + "\t" +
  (.value.Alleles[1].CI.Min|tostring) + "-" + (.value.Alleles[1].CI.Max|tostring) + "\t" +
  (.value.Alleles[0].IsPopulationOutlier|tostring) + "\t" +
  (.value.Alleles[1].IsPopulationOutlier|tostring) + "\t" +
  .value.Alleles[0].Range + "\t" +
  .value.Alleles[1].Range + "\t" +
  (.value.TargetedLocus.CorrespondingDisease | to_entries | map(.value.DiseaseSymbol) | join(", "))
' YourSampleFile.json
```

Columns: `Locus ID, Coordinates, Motif, Coverage, Allele 1 repeats, Allele 2 repeats, Allele 1 CI, Allele 2 CI, Allele 1 is population outlier, Allele 2 is population outlier, Allele 1 range (normal/intermediate/pathogenic), Allele 2 range, Corrsponding disease(s)`

## Contact
If you have any suggestions, questions or issues to report then contact the author via the contact form on [this page](https://stripy.org/about).

Also, don't hesitate to let me know if you would like to swap the STRipy logo on the results page with yours ;)