Commit 9b601009 authored by San Lucas's avatar San Lucas
Browse files

recur: initial check-in.

parents
RECUR Usage FAQ
What genome builds are supported?
The default is set to hg19, but there is an option to provide your own reference file with chromosome lengths and centromere positions for other genome builds.
What types of data can I analyze?
You can use data from both snp and next-gen sequencing platforms including targeted sequencing. The number and distribution of heterozygous markers will dictate the power for the detection of bidirectional allelic imbalance.
Can I use this algorithm on other species?
Yes, you would have to provide a genome reference file with chromosome lengths and centromere boundaries.
How does the user denote missing genotype and BAF/LRR values?
The algorithm ignores genotypes that are not heterozygous.
How are BAF/RAF deviations and LRR/read-depth deviations calculated?
Set to 0.5 and 0 if no predefined AI segments are present; else they are calculated as the median BAF/RAF shift and LRR/read-depth value for markers outside AI segments.
My files have homozygous markers, can I still use RECUR?
Yes, the algorithm ignores homozygous markers.
How does the window testing option work? How many markers should I use?
This option is best for refining AI profiles in a specific genomic segment, such as a chromosome arm.
The window testing option breaks AI segments into windows of N markers (where N is supplied by the user). It then treats each window as an independent AI event. The default setting uses the sample with the highest BAF deviation at each window as the reference. There is the option to have only one reference sample (the sample with the largest median BAF deviation across all AI segments). The latter can be useful when running the window option on two or more samples with very similar BAF deviation amplitudes which can result in color flips but no refinement of bidirectional AI event boundaries.
We suggest using 20 or more markers for the window testing as a smaller number will decrease statistical power for detection of bidirectional AI. All statistically significant windows are reported, therefore, it may be best to test 500-100 marker windows first. Small window sizes and large AI segments for testing will slow down the algorithm. The window option may report a very large number of bidirectional events in cases where there are several samples and/or large event(s).
## Detection and visualization of mirrored allelic imbalance in R: `recur.R`
*Mirrored allelic imbalance (AI)* refers to a phenomenon where, over a region of the genome
exhibiting AI across 2 samples from a single individual (often tumor samples),
the over-represented and under-represented haplotypes are swapped
(see the regions of the genome in the example below where the blue and red haplotypes are
incosistent across the 3 samples).
**This application tests for the presence of and generates visualizations for
mirrored AI within multiple genotyped samples of an individual.**
Contributors: Jerry Fowler, Sasha Jakubek, Anthony San Lucas, Paul Scheet
![Sample 2 mirrored AI events](examples/test_output/sub_clone_baf_test.tumor_2.gsloh.png)
![Sample 3 mirrored AI events](examples/test_output/sub_clone_baf_test.tumor_3.gsloh.png)
![Sample 4 mirrored AI events](examples/test_output/sub_clone_baf_test.tumor_4.gsloh.png)
## Installation notes
Download the application from gitlab. The `recur.R` script is written in R and is run through the `Rscript recur.R` command. The script currently has the following package dependencies.
Install these packages.
```
install.packages("argparse")
install.packages("data.table")
install.packages("plyr")
```
## Quick start
### A simple example
Some examples are packaged together with the software. In the following example, 2 regions of a tumor were genotyped, and `recur.R` is used to identify regions of the genome between the two samples that exhibit mirrored AI.
Go into the `examples` directory and run the `test.baf.ucs_0609.sh` script, which specifies the `Rscript recur.R` command with parameters. A [report of mirrored events](examples/test_output/baf_test.ucs_0609.mirrored_ai_events.pairwise.txt) and 2 figures (1 for each test sample; see above) are generated and stored in the `examples/test_output` directory.
```
./test.baf.field.simple.sh
(1/7) Loading data for normal sample: germline.gt
(2/7) Loading data for test samples
loading test sample (1/4): tumor_1.baf
loading test sample (2/4): tumor_2.baf
loading test sample (3/4): tumor_3.baf
loading test sample (4/4): tumor_4.baf
(3/7) Loading in candidate genomic regions for AI mirror assessment events.txt
number of candidate event regions: 37
(4/7) Testing for mirroring in candidate event regions
number of sample pairs for assessing mirroring: 6
pairwise comparison (1/6): tumor_1.baf, tumor_2.baf
pairwise comparison (2/6): tumor_1.baf, tumor_3.baf
pairwise comparison (3/6): tumor_1.baf, tumor_4.baf
pairwise comparison (4/6): tumor_2.baf, tumor_3.baf
pairwise comparison (5/6): tumor_2.baf, tumor_4.baf
pairwise comparison (6/6): tumor_3.baf, tumor_4.baf
(5/7) Summarizing AI mirror events
number of putative mirrored regions = 7
number of putative pairwise mirrored events = 18
identifying reference sample for AI mirror region (1/7): 5:5397948-7802280
identifying reference sample for AI mirror region (2/7): 5:32146128-35945288
identifying reference sample for AI mirror region (3/7): 18:81092-13826521
identifying reference sample for AI mirror region (4/7): 18:14774149-46349590
identifying reference sample for AI mirror region (5/7): 18:46527222-77907003
identifying reference sample for AI mirror region (6/7): 21:14685576-48051100
identifying reference sample for AI mirror region (7/7): 6:474461-170756596
(6/7) Reporting mirror events
(7/7) Generating mirror plots
plotting mirrored AI for (1/4): tumor_1.baf
plotting mirrored AI for (2/4): tumor_2.baf
plotting mirrored AI for (3/4): tumor_3.baf
plotting mirrored AI for (4/4): tumor_4.baf
```
This is the command that was run by the example script:
```
Rscript ../recur.R/recur.R \
--normal_sample "test_data/germline.gt" \
--test_samples "test_data/tumor_1.baf" \
"test_data/tumor_2.baf" \
"test_data/tumor_3.baf" \
"test_data/tumor_4.baf" \
--regions "test_data/events.txt" \
--chromosomes "../resources/hg19.chromosome_lengths.txt" \
--out_dir "test_output" \
--out_prefix "sub_clone_baf_test_simple"
```
### Input files description
At a minimum, input for an analysis requires 3 types of files. A normal sample file that provides germline heterozygous sites, B allele frequency files from 2 or more samples, and a regions file that specifies genomic regions over which mirrored AI will be tested for.
#### Normal sample file
The normal sample file is used to identify germline het sites for the subject being analyzed. Although not optimal, a tumor sample can be used in lieu of a normal sample. A GT column is required, and the application keeps track of the coordinates of het sites. Valid values for het sites include: A|B, B|A, A/B, B/A, 0|1, 1|0, 0/1, 1/0. These values were selected to support standard genotype representations commonly found in genotype or variant call format files. Note: To make het-identification more generic, we are working on a split-and-compare approach to detect allele differences.
```
chr pos GT
1 876557 B|A
1 881755 B|A
1 887651 A|B
1 916610 A|B
...
```
#### B allele frequency files
B allele frequencies for each test sample should be specified in a separate
tab-delimited file. These tab-delimited files require 3 columns: chr, pos and BAF.
Note: in the future, we won't require the GT column as this could be specified from a "normal" (or "germline") sample.
```
chr pos BAF
1 876557 0.4272
1 881755 0.3508
1 887651 0.6367
1 916610 0.6208
...
```
#### Genomic regions file
A genomic region file is also required to specify regions of the genome for which mirrored AI will be tested. This file simply requires 3 columns: chr, begin and end.
Note: in the future if this is not provided, we will test for mirrored AI at chromosome arms.
```
chr begin end
1 2381568 32109849
1 32501215 41969327
1 44440146 211898313
...
```
### Output files description
#### Mirrored events report
#### Genomic plots
## Available parameters
Pass the `--help` or `-h` parameter to the script to see the available parameters.
```
Rscript recur.R -h
usage: recur.R [-h] -n NORMAL_SAMPLE -t TEST_SAMPLES [TEST_SAMPLES ...]
-r REGIONS [--out_dir OUT_DIR] [--out_prefix OUT_PREFIX]
[--ai_event_binom_pval_thresh AI_EVENT_BINOM_PVAL_THRESH]
[--mirror_ai_corr_pval_thresh MIRROR_AI_CORR_PVAL_THRESH]
[--column_baf COLUMN_BAF] [--column_lrr COLUMN_LRR]
[--column_chr COLUMN_CHR] [--column_pos COLUMN_POS]
[--column_begin COLUMN_BEGIN] [--column_end COLUMN_END]
[--column_switch COLUMN_SWITCH] [--haploh]
optional arguments:
-h, --help show this help message and exit
-n NORMAL_SAMPLE, --normal_sample NORMAL_SAMPLE
File of genotype data for normal sample. (default:
None)
-t TEST_SAMPLES [TEST_SAMPLES ...], --test_samples TEST_SAMPLES [TEST_SAMPLES ...]
Files of genotype data for samples where AI mirroring
will be assessed. (default: None)
-r REGIONS, --regions REGIONS
Genomic regions for which AI mirroring will be
assessed. (default: None)
--out_dir OUT_DIR Output directory. (default: .)
--out_prefix OUT_PREFIX
Output prefix. (default: )
--ai_event_binom_pval_thresh AI_EVENT_BINOM_PVAL_THRESH
Event detection binomial p-value threshold. (default:
0.0001)
--mirror_ai_corr_pval_thresh MIRROR_AI_CORR_PVAL_THRESH
Frequency-based haplotype correlation p-value at which
to call a mirroring event. (default: 0.0001)
--column_baf COLUMN_BAF
Column name for B-allele frequencies. (default: BAF)
--column_lrr COLUMN_LRR
Column name for log R ratios (field is optional).
(default: LRR)
--column_chr COLUMN_CHR
Column name from chromosomes. (default: chr)
--column_pos COLUMN_POS
Column name for the chromosomal position of a marker.
(default: pos)
--column_begin COLUMN_BEGIN
Column name for the beginning of an event. (default:
begin)
--column_end COLUMN_END
Column name for the end of an event. (default: end)
--column_switch COLUMN_SWITCH
Column name for the haploh phase consistency indicator
(field is optional). (default: SWITCH)
--haploh Flag to indicate that AI event detection (in addition
to mirror AI assessment) should be performed. In order
to perform this test, the data must contain switches.
(default: False)
```
## Notes
1. The current software is simply an R script. We could make this an R library as well.
1. The current test data available with the software is actual de-identified data. What should be the official test data that we eventually release with the software?
1. The current code needs to be cleaned up and documented much better. Some parameters still need to be exposed at the command line.
1. Right now, the normal sample is not used. Hets are identified within each sample. In the future, if a germline sample is provided, we will identify hets from that sample and then use those het sites in subsequent analyses.
1. Genotypes have to currently be represented as A and B alleles. With 0 and 1 support, we'd be able to support VCFs as well. Supporting A/C/G/T is a goal as well, but that can get complicated with more then 2 possible alleles.
1. The current mirrored event report is a "pairwise" report. Each entry describes a mirroring event between 2 samples. We will work on making this report more intuitive.
1. Output files are put into the specified output directory with an output prefix. We should output these filenames to the console so it is very easy for people to find the reports and figures when they first try the software.
1. Chromosomal lengths are currently hard-coded into the script (for hg19). We will extract this into a chromosome file and include that file in a resource directory. We can have an additional file for other genome builds and allow the specification of this file from the command line in case the user wants to use a different build.
# RECUR: R application for the bidirectional analysis and visualization of allelic imbalance profiles from multi-sample genomic data
## Authors
Yasminka A. Jakubek, F. Anthony San Lucas, Paul Scheet
## License
MIT license
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
## Description
REpeat Chromosomal changes Uncovered by Reflection (RECUR) is an R application
for the comparative analysis of AI profiles derived from SNP and next-generation
sequencing data. The algorithm accepts genotype calls and “B” or reference allele
frequencies from at least 2 samples derived from the same individual. For a predefined
set of genomic regions with AI, RECUR compares BAF shifts among samples. Each shift
has two possible directions —an increase of the maternal, relative to the paternal,
haplotype or vice versa — and each pair of samples can exhibit the same, or opposite,
shift. The latter phenomenon of bidirectional haplotype shifts has been referred to as
“mirrored subclonal allelic imbalance” and linked to clinico-pathological features of cancer.
The algorithm detects genomic segments where AI shifts have opposite directions between
samples. It also plots BAF values for all samples and uses a two-color scheme for intuitive
visualization of these bidirectional shifts.
## Availability and Implementation:
RECUR is available as an R application. Source code and documentation are available at scheet.org.
## Installation notes
Download the application from gitlab. The recur.R script is written in R and is run through
the Rscript recur.R command. The script currently has the following package dependencies.
install.packages("argparse")
install.packages("data.table")
install.packages("plyr")
## Quick start
A simple example
Some examples are packaged together with the software. In the following example, 2 regions of a tumor were genotyped, and recur.R is used to identify regions of the genome between the two samples that exhibit mirrored AI.
Go into the examples directory and run the test.baf.ucs_0609.sh script, which specifies the Rscript recur.R command with parameters. A report of mirrored events and 2 figures (1 for each test sample; see above) are generated and stored in the examples/test_output directory.
./test.baf.field.simple.sh
(1/7) Loading data for normal sample: germline.gt
(2/7) Loading data for test samples
loading test sample (1/4): tumor_1.baf
loading test sample (2/4): tumor_2.baf
loading test sample (3/4): tumor_3.baf
loading test sample (4/4): tumor_4.baf
(3/7) Loading in candidate genomic regions for AI mirror assessment events.txt
number of candidate event regions: 37
(4/7) Testing for mirroring in candidate event regions
number of sample pairs for assessing mirroring: 6
pairwise comparison (1/6): tumor_1.baf, tumor_2.baf
pairwise comparison (2/6): tumor_1.baf, tumor_3.baf
pairwise comparison (3/6): tumor_1.baf, tumor_4.baf
pairwise comparison (4/6): tumor_2.baf, tumor_3.baf
pairwise comparison (5/6): tumor_2.baf, tumor_4.baf
pairwise comparison (6/6): tumor_3.baf, tumor_4.baf
(5/7) Summarizing AI mirror events
number of putative mirrored regions = 7
number of putative pairwise mirrored events = 18
identifying reference sample for AI mirror region (1/7): 5:5397948-7802280
identifying reference sample for AI mirror region (2/7): 5:32146128-35945288
identifying reference sample for AI mirror region (3/7): 18:81092-13826521
identifying reference sample for AI mirror region (4/7): 18:14774149-46349590
identifying reference sample for AI mirror region (5/7): 18:46527222-77907003
identifying reference sample for AI mirror region (6/7): 21:14685576-48051100
identifying reference sample for AI mirror region (7/7): 6:474461-170756596
(6/7) Reporting mirror events
(7/7) Generating mirror plots
plotting mirrored AI for (1/4): tumor_1.baf
plotting mirrored AI for (2/4): tumor_2.baf
plotting mirrored AI for (3/4): tumor_3.baf
plotting mirrored AI for (4/4): tumor_4.baf
This is the command that was run by the example script:
Rscript ../recur.R/recur.R \
--normal_sample "test_data/germline.gt" \
--test_samples "test_data/tumor_1.baf" \
"test_data/tumor_2.baf" \
"test_data/tumor_3.baf" \
"test_data/tumor_4.baf" \
--regions "test_data/events.txt" \
--chromosomes "../resources/hg19.chromosome_lengths.txt" \
--out_dir "test_output" \
--out_prefix "sub_clone_baf_test_simple"
## Input files description
At a minimum, input for an analysis requires 3 types of files. A normal sample file that provides germline heterozygous sites, B allele frequency files from 2 or more samples, and a regions file that specifies genomic regions over which mirrored AI will be tested for.
### Normal sample file
The normal sample file is used to identify germline het sites for the subject being analyzed. Although not optimal, a tumor sample can be used in lieu of a normal sample. A GT column is required, and the application keeps track of the coordinates of het sites. Valid values for het sites include: A|B, B|A, A/B, B/A, 0|1, 1|0, 0/1, 1/0. These values were selected to support standard genotype representations commonly found in genotype or variant call format files. Note: To make het-identification more generic, we are working on a split-and-compare approach to detect allele differences.
chr pos GT
1 876557 B|A
1 881755 B|A
1 887651 A|B
1 916610 A|B
...
### B allele frequency files
B allele frequencies for each test sample should be specified in a separate
tab-delimited file. These tab-delimited files require 3 columns: chr, pos and BAF.
Note: in the future, we won't require the GT column as this could be specified from a "normal" (or "germline") sample.
chr pos BAF
1 876557 0.4272
1 881755 0.3508
1 887651 0.6367
1 916610 0.6208
...
### Genomic regions file
A genomic region file is also required to specify regions of the genome for which mirrored AI will be tested. This file simply requires 3 columns: chr, begin and end.
Note: in the future if this is not provided, we will test for mirrored AI at chromosome arms.
chr begin end
1 2381568 32109849
1 32501215 41969327
1 44440146 211898313
...
## Available parameters
Pass the --help or -h parameter to the script to see the available parameters.
Rscript recur.R -h
usage: recur.R [-h] -n NORMAL_SAMPLE -t TEST_SAMPLES [TEST_SAMPLES ...]
-r REGIONS [--out_dir OUT_DIR] [--out_prefix OUT_PREFIX]
[--ai_event_binom_pval_thresh AI_EVENT_BINOM_PVAL_THRESH]
[--mirror_ai_corr_pval_thresh MIRROR_AI_CORR_PVAL_THRESH]
[--column_baf COLUMN_BAF] [--column_lrr COLUMN_LRR]
[--column_chr COLUMN_CHR] [--column_pos COLUMN_POS]
[--column_begin COLUMN_BEGIN] [--column_end COLUMN_END]
[--column_switch COLUMN_SWITCH] [--haploh]
optional arguments:
-h, --help show this help message and exit
-n NORMAL_SAMPLE, --normal_sample NORMAL_SAMPLE
File of genotype data for normal sample. (default:
None)
-t TEST_SAMPLES [TEST_SAMPLES ...], --test_samples TEST_SAMPLES [TEST_SAMPLES ...]
Files of genotype data for samples where AI mirroring
will be assessed. (default: None)
-r REGIONS, --regions REGIONS
Genomic regions for which AI mirroring will be
assessed. (default: None)
--out_dir OUT_DIR Output directory. (default: .)
--out_prefix OUT_PREFIX
Output prefix. (default: )
--ai_event_binom_pval_thresh AI_EVENT_BINOM_PVAL_THRESH
Event detection binomial p-value threshold. (default:
0.0001)
--mirror_ai_corr_pval_thresh MIRROR_AI_CORR_PVAL_THRESH
Frequency-based haplotype correlation p-value at which
to call a mirroring event. (default: 0.0001)
--column_baf COLUMN_BAF
Column name for B-allele frequencies. (default: BAF)
--column_lrr COLUMN_LRR
Column name for log R ratios (field is optional).
(default: LRR)
--column_chr COLUMN_CHR
Column name from chromosomes. (default: chr)
--column_pos COLUMN_POS
Column name for the chromosomal position of a marker.
(default: pos)
--column_begin COLUMN_BEGIN
Column name for the beginning of an event. (default:
begin)
--column_end COLUMN_END
Column name for the end of an event. (default: end)
--column_switch COLUMN_SWITCH
Column name for the haploh phase consistency indicator
(field is optional). (default: SWITCH)
--haploh Flag to indicate that AI event detection (in addition
to mirror AI assessment) should be performed. In order
to perform this test, the data must contain switches.
(default: False)
install.packages("argparse")
install.packages("data.table")
install.packages("plyr")
quit()
#!/bin/bash
/usr/local/bin/Rscript ../recur/R/recur.R \
--normal_sample "test_data/germline.gsloh" \
--test_samples "test_data/tumor_1.gsloh" \
"test_data/tumor_2.gsloh" \
"test_data/tumor_3.gsloh" \
"test_data/tumor_4.gsloh" \
--regions "test_data/events.txt" \
--chromosomes "../resources/hg19.chromosome_lengths.txt" \
--out_dir "test_output" \
--out_prefix "sub_clone_baf_test"
#!/bin/bash
/usr/local/bin/Rscript ../recur/R/recur.R \
--normal_sample "test_data/germline.gt" \
--test_samples "test_data/tumor_1.baf" \
"test_data/tumor_2.baf" \
"test_data/tumor_3.baf" \
"test_data/tumor_4.baf" \
--regions "test_data/events.txt" \
--chromosomes "../resources/hg19.chromosome_lengths.txt" \
--out_dir "test_output" \
--out_prefix "sub_clone_baf_test_simple"
#!/bin/bash
/usr/local/bin/Rscript ../recur/R/recur.R \
--normal_sample "test_data/germline.gsloh" \
--test_samples "test_data/tumor_1.gsloh" \
"test_data/tumor_2.gsloh" \
"test_data/tumor_3.gsloh" \
"test_data/tumor_4.gsloh" \
--regions "test_data/events.txt" \
--window_num_hets 500 \
--color_by_reference_sample \
--out_dir "test_output" \
--out_prefix "sub_clone_baf_window_test"
#!/bin/bash
/usr/local/bin/Rscript ../recur/R/recur.R \
--normal_sample "test_data/germline.gsloh" \
--test_samples "test_data/tumor_1.gsloh" \
"test_data/tumor_2.gsloh" \
"test_data/tumor_3.gsloh" \
"test_data/tumor_4.gsloh" \
--regions "test_data/events.txt" \
--out_dir "test_output" \
--out_prefix "sub_clone_gsloh_test" \
--plot_lrr \
--haploh
chr begin end
2 81034 243035174
3 261471 79611984
3 115347476 197800769
4 346147 7582664
4 7689353 190916545
5 5397948 7802280
5 32146128 35945288
5 49464580 180579214
6 214320 421376
6 474461 170756596
7 127304 10885998
7 33053984 37592229
7 46885362 102474847
7 102506806 114644190
7 114783793 159119605
8 189068 65517204
8 65531882 146148201
9 150073 39315802
9 118287564 120691512
9 137454693 140696535
11 210605 71117393
11 82435531 131897764
11 132151515 134765165
12 76697848 133291834
13 19263050 115075559
15 22543788 102399343
16 333900 90142003
17 39210 81051338
18 81092 13826521
18 14774149 46349590
18 46527222 77907003
19 266151 59084876
20 63621 216898
20 347894 24581927
20 24760875 62901043
21 14685576 48051100
22 17978898 51182015
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment