Still under active development, not to use for production
- added mutect caller following gatk best practce. added freebayes caller. will add vardict caller
-
sample2json.py
can deal with single-end and paired-end fastqs and matching tumor-normal pairs. Need to add tumor without matching normal. - add copy number/tumor purity analysis into pipeline by sequenza.
work flow of the pipeline
Dependencies
All the dependencies need to be put on your path.
echo $PATH
- samtools v1.4.1
- fastqc v0.10.1
- bwa v0.7.17 after v0.7.16 bwa fixed malformatted SAM header when -R is used with TAB (#84)
- gatk v>=3.6
- [vcflib]
- [bcftools]
- lancet
R packages
- TEQC bioconductor package quality control for Whole exome/target sequencing.
-
Sequenza R package for WES copy number analysis and tumor purity estimation.
a new
sequenza-utils
python package which is used to format the samtools mpileup for sequenza R input is available. read this https://groups.google.com/forum/#!topic/sequenza-user-group/YrT2Q61PYSk. The updated python package can handle some erros when parsing the mpileup files. Install it bypip install sequenza-utils
.
How to distribute workflows
read doc
ssh shark.mdanderson.org
# start a screen session
screen
# make a folder, name it yourself, I named it workdir
mkdir /rsch2/genomic_med/krai/workdir/
cd /rsch2/genomic_med/krai/workdir/
git clone https://gitlab.com/tangming2005/snakemake_DNAseq_pipeline
cd snakemake_DNAseq_pipeline
## go to mutect branch
git checkout mutect
## edit the config.yaml file as needed, e.g. set mouse or human for ref genome, p value cut off for peak calling, the number of reads you want to downsample to
nano config.yaml
## skip this if on Shark, samir has py351 set up for you. see below STEPS
conda create -n snakemake python=3 snakemake
source activate snakemake
STEPS
on Shark
there is a python3 environment set up. just do
source activate py351
create the sample.json file by feeding a fastq folder. this folder should be a folder containing all the samples.
create a json file for samples.
The sample2json.py
scripts can be used to generate a sample json file.
It can deal with single-end and paired-end fastqs or aligned bam files.
TO do: it is also matters if the tumor sample has a normal matching control or not
a json file:
## for paired end fastqs
"sample_name":
{"normal":
{"R1":["normal1_R1.fastq", "normal2_R1.fastq"],
"R2":["normal1_R2.fastq", "normal2_R2.fastq"]},
"tumor":
{"R1":["tumor1_R1.fastq", "tumor_2_R1.fastq"],
"R2":["tumor1_R2.fastq", "tumor_2_R2.fastq"] }
}
## for single end fastqs:
"sample_name":
{"normal":
{"R1":["normal1_R1.fastq", "normal2_R1.fastq"]},
"tumor":
{"R1":["tumor1_R1.fastq", "tumor_2_R1.fastq"]}
}
# for bam files:
"sample_name":
{"normal": ["normal1.bam"],
"tumor": ["tumor1.bam"]}
}
please use the full path for the folder that contains your fastq folders and a tab delimited txt file describe the sample information.
e.g.
For bam files:
sample_name
is the name for the object. file_name
is the name you submit for sequencing. This name should be contained in the fastq or bam file name.
sample_type
should be what type of samples are sequenced for the same object (patient).
sample_name
_sample_type
will be used to construct the final output file name. e.g. patient1_leukocyte.sorted.bam
, patient1_leukocyte_rln_recal.bam
.
so sample_name
_sample_type
should be unique.
In the config.ymal
file, one needs to specify which one is used as control for calling mutations. In this example, leukoyte
is used, vcf file
patient1_pre-treatment_vs_patient1_leukocyte_mutect.vcf
and patient1_recurrence_vs_patient1_leukocyte_mutect.vcf
will be generated.
Avoid using underscore _
in the sample_type
column (-
is allowed). sample2josn.py
will check that.
cat sample_info.txt
sample_name file_name sample_type
patient1 B17042956985-KY290-2 leukocyte
patient1 P17042956986-KY290-2 ctDNA
patient1 F17042956987-KY290-2 pre-treatment
patient1 F17042956988-KY290-2 recurrence
patient2 B17050457295-KY290-2 leukocyte
patient2 P17050457296-KY290-2 ctDNA
patient2 F17042956989-KY290-2 pre-treatment
patient2 F17042956990-KY290-2 recurrence
patient3 B17051358726-KY290-2 leukocyte
patient3 P17051358727-KY290-2 ctDNA
patient3 F17051358728-KY290-2 pre-treatment
patient3 F17051358729-KY290-2 metastasis
patient3 F17051859286-KY290-2 pre-treatment2
patient4 B17031650918-KY290 leukocyte
patient4 P17031650914-CLN-KY290 ctDNA
patient4 F17031650907-KY290 pre-treatment
patient4 F17031650908-KY290 recurrence
patient4 F17031650909-KY290 metastasis
patient5 B17031650917-KY290 leukocyte
patient5 P17031650913-CLN-KY290 ctDNA
patient5 F17031650905-KY290 pre-treatment
patient5 F17031650906-KY290 recurrence
python3 sample2json.py --fastq_dir /rsch2/genomic_med/krai/rawfastqs/ --meta sample_info.txt
some information will be printed out and a samples.json
file will be generated.
check the file information in the json file:
less -S samples.json
dry run to test
## dry run
snakemake -np
if no errors, preceed below.
DRMAA
UsingDRMAA is only supported on Shark
.
module load drmma
./pyflow-drmaa-ChIPseq.sh
Using drmaa
can control + c
to stop the current run.
Dependent jobs are submitted one by one, if some jobs failed, the pipeline will stop. Good for initital testing.
submit all jobs to the cluster
./pyflow-ChIPseq.sh
All jobs will be submitted to the cluster on queue. This is useful if you know your jobs will succeed for most of them and the jobs are on queue to gain priority.
job control
To kill all of your pending jobs you can use the command:
bkill `bjobs -u krai |grep PEND |cut -f1 -d" "`
bjobs -pl
Display detailed information of all pending jobs of the invoker.
bjobs -ps
Display only pending and suspended jobs.
bjobs -u all -a
Display all jobs of all users.
bjobs -d -q short -m apple -u mtang1
Display all the recently finished jobs submitted by john to the
queue short, and executed on the host apple.
rerun some of the jobs
# specify the name of the rule, all files that associated with that rule will be rerun. e.g. rerun macs2 calling peaks rule,
./pyflow-DNAseq.sh -R freebayes
## rerun one sample, just specify the name of the target file
./pyflow-DNAseq.sh 08vcfanno/SKCM-M409-P011_tumor.vep.anno.vcf
##rerun only vcfanno rule
./pyflow-DNAseq.sh -f vcfanno
## check snakemake -h
## -R -f -F --until are useufl
checking results after run finish
snakemake --summary | sort -k1,1 | less -S
# or detailed summary will give you the commands used to generated the output and what input is used
snakemake --detailed-summary | sort -k1,1 > snakemake_run_summary.txt
clean the folders
I use echo to see what will be removed first, then you can remove all later.
find . -maxdepth 1 -type d -name "[0-9]*" | xargs echo rm -rf
Snakemake does not trigger re-runs if I add additional input files. What can I do?
Snakemake has a kind of “lazy” policy about added input files if their modification date is older than that of the output files. One reason is that information what to do cannot be inferred just from the input and output files. You need additional information about the last run to be stored. Since behaviour would be inconsistent between cases where that information is available and where it is not, this functionality has been encoded as an extra switch. To trigger updates for jobs with changed input files, you can use the command line argument –list-input-changes in the following way:
snakemake -n -R `snakemake --list-input-changes`
How do I trigger re-runs for rules with updated code or parameters?
snakemake -n -R `snakemake --list-params-changes`
and
$ snakemake -n -R `snakemake --list-code-changes`
Interval files for chromosome wise processing GATK steps
Whole exome and target panel sequencing
For the preprocessing steps of GATK, all steps are parallized by chromosome by specifying --intervals
.
If your genome reference file does not have chr
, the interval files need to remove chr
as well. In the human_g1k_v37_decoy.fasta
file, chromosome M is denoted as MT
.
from illumina nextera protocol
# download the bed file into the files folder in the repo
wget https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_nextera/nexterarapidcapture/nexterarapidcapture_exome_targetedregions_v1.2.bed
# create bed files by chromosome
cat nexterarapidcapture_exome_targetedregions_v1.2.bed | sort -k1,1 -k2,2n | sed 's/^chr//' | sed 's/^M/MT/' | awk '{close(f);f=$1}{print > f".bed"}'
# 1.bed, 2.bed, 3.bed ...X.bed, Y.bed, MT.bed will be created.
from xGene
https://www.idtdna.com/pages/products/nextgen/target-capture/xgen-lockdown-panels/xgen-exome-panel
wget https://www.idtdna.com/pages/docs/default-source/xgen-libraries/xGen-Lockdown-Panels/xgen-exome-research-panel-targets.bed?sfvrsn=6
mv xgen-exome-research-panel-targets.bed\?sfvrsn\=6 xgen-exome-research-panel-targets.bed
## no chrM is the captured region. change CHR in the config.ymal files accordingly.
cat xgen-exome-research-panel-targets.bed | cut -f1-3 | sort -k1,1 -k2,2n | sed 's/^chr//' | awk '{close(f);f=$1}{print > f".bed"}'
If you are using reference files from UCSC (keep chr):
cat nexterarapidcapture_exome_targetedregions_v1.2.bed | sort -k1,1 -k2,2n | awk '{close(f);f=$1}{print > f".bed"}'
should do the work.
PLEASE place the resulted *.bed
files in the files
folder.
For freebayes:
cat xgen-exome-research-panel-targets.bed |cut -f1-3 | sort -k1,1 -k2,2n | sed 's/^chr//' | awk '{print $1":"$2"-"$3}' > xgen-exome-research-panel-targets_no_chr_freebayes.txt
For lancet:
lancet is doing local assembly, so a large window size is needed to capture all the reads. see issue here https://github.com/nygenome/lancet/issues/12
create bed files with 300 bp extending upstream and downstream
cat xgen-exome-research-panel-targets.bed | cut -f1-3 | sort -k1,1 -k2,2n | awk '{print$1"\t"$2-300"\t"$3+300}' | sed 's/^chr//' | awk '{close(f);f=$1}{print > f"_extend.bed"}'
whole genome sequencing
You do not need to provide --intervals
for GATK for WGS. However, some regions of the genome are hard to sequence and the data are not reliable. Including those regions can slow down the process and get messy data.
see a post on the GATK forum.
A file ceph18.b37.include.2014-01-15.bed
I borrowed from speedseq
is also in the files
folder.
wget https://raw.githubusercontent.com/hall-lab/speedseq/master/annotations/ceph18.b37.include.2014-01-15.bed
cat ceph18.b37.include.2014-01-15.bed | sort -k1,1 -k2,2n | awk '{close(f);f=$1}{print > f".bed"}'
For freebayes, it accepts --target
in format 1:10468-11803
:
cat ceph18.b37.include.2014-01-15.bed | cut -f1-3 | sort -k1,1 -k2,2n | awk '{print $1":"$2"-"$3}' > ceph18.b37.include.2014-01-15_for_freebayes.txt
speed up GATK
In addition to the scatter-gather pattern by chromosome mentioned above, some GATK tools can run in multiple threads.
* `-nt / --num_threads`
controls the number of data threads sent to the processor
* `-nct / --num_cpu_threads_per_data_thread`
controls the number of CPU threads allocated to each data thread
NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather:
Tool | Full name | Type of traversal | NT | NCT | SG |
---|---|---|---|---|---|
RTC | RealignerTargetCreator | RodWalker | + | - | - |
IR | IndelRealigner | ReadWalker | - | - | + |
BR | BaseRecalibrator | LocusWalker | - | + | + |
PR | PrintReads | ReadWalker | - | + | - |
RR | ReduceReads | ReadWalker | - | - | + |
HC | HaplotypeCaller | ActiveRegionWalker | - | (+) | + |
UG | UnifiedGenotyper | LocusWalker | + | + | + |
Recommended settings:
Tool | RTC | IR | BR | PR | RR | HC | UG |
---|---|---|---|---|---|---|---|
Available modes | NT | SG | NCT,SG | NCT | SG | NCT,SG | NT,NCT,SG |
Cluster nodes | 1 | 4 | 4 | 1 | 4 | 4 | 4 / 4 / 4 |
CPU threads (-nct) | 1 | 1 | 8 | 4-8 | 1 | 4 | 3 / 6 / 24 |
Data threads (-nt) | 24 | 1 | 1 | 1 | 1 | 1 | 8 / 4 / 1 |
Memory (Gb) | 48 | 4 | 4 | 4 | 4 | 16 | 32 / 16 / 4 |
implemented parallel processing by chromosome:
Further readings: VQSR
for exomes you need at least 30 to run VQSR
For BQSR: it is best to have greater than 100 million bases per read group
https://gatkforums.broadinstitute.org/gatk/discussion/comment/14269#Comment_14269
"Small datasets" from the -L FAQ refers to datasets with less than 100M bases.
If you have 150M bases (sequenced reads) for a small targeted experiment, you should be fine to run BQSR. 1B is even better. Pretty much, anything above 100M bases is fine, but the bigger the dataset, the more accurate BQSR will be.
freebayes paramters
borrowed from bcbio http://bcb.io/2015/03/05/cancerval/ https://github.com/chapmanb/bcbio-nextgen/blob/8c3206f131f9b97715d33141e558e1e38a07ac12/bcbio/variation/freebayes.py#L154
freebayes -f hg19.fa --genotype-qualities --strict-vcf --ploidy 2 --targets regions.bed --min-repeat-entropy 1 --no-partial-observations --min-alternate-fraction 0.05 --pooled-discrete --pooled-continuous --report-genotype-likelihood-max --allele-balance-priors-off tumor.bam normal.bam | bcftools filter -i 'ALT="<*>" || QUAL > 5'
GATK best practice
This is the most updated version for mutect2
which can skip the indel-realign step.
for mutect1
, indel-realign is still required.
mutect2
can detect both SNVs and indels.
Note on read group
read this post in the GATK forum.
example:
@RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI
Meaning of the read group fields required by GATK
* ID = Read group identifier
This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.
Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.
* PU = Platform Unit
The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.
* SM = Sample
The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name.
* PL = Platform/technology used to produce the read
This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.
* LB = DNA preparation library identifier
MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.
If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.
Use Picard's AddOrReplaceReadGroups to appropriately label read group (@rg) fields, coordinate sort and index a BAM file. Only the five required @rg fields are included in the command shown. Consider the other optional @rg fields for better record keeping.
java -jar picard.jar AddOrReplaceReadGroups \
INPUT=reads.bam \
OUTPUT=reads_addRG.bam \
RGID=H0164.2 \ #be sure to change from default of 1
RGLB= library1 \
RGPL=illumina \
RGPU=H0164ALXX140820.2 \
RGSM=sample1 \
Deriving ID and PU fields from read names
H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288
Breaking down the common portion of the query names:
H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane
Multi-sample and multiplexed example
Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @rg fields in the header:
Dad's data:
@RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
@RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
Mom's data:
@RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
@RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
Kid's data:
@RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
@RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
Add read group when align using bwa
through -R
see this post: https://www.biostars.org/p/78400/
samples from different lanes need to get a different ID
in @RG
Install gemini and annotation data
Gemini is a tool from Aaron Quinlan's group
conda install -c bioconda gemini
cd /scratch/genomic_med/apps/gemini
gemini update --dataonly
The annotation data were downloaded to /scratch/genomic_med/apps/gemini/gemini/data
to annotate the vcf files, use vcfanno
:
configuration file:
https://github.com/brentp/vcfanno/blob/master/example/gem.conf
SNPdb129 hg19 version
snpdb129 is probably the latest version of dbsnp that is not contaminated by the recent large-scale cancer sequencing projects. It was only in hg18 version, but annovar
has a lifted over hg19 version.
Download that by:
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar snp129 humandb/
## change to bed format for vcfanno
less -S hg19_snp129.txt | cut -f2-7 | sed 's/^chr//' > hg19_snp129.bed
## copy it to vcfanno_db: /scratch/genomic_med/apps/gemini/gemini/data/
cp hg19_snp129.bed /scratch/genomic_med/apps/gemini/gemini/data/
bgzip hg19_snp129.bed
tabix -p bed hg19_snp129.bed.gz
add it to the gem.conf
file:
## dbsnp129 hg19 version
[[annotation]]
file="hg19_snp129.bed"
columns=[4]
names=["dbsnp129_rs_ids"]
ops=["concat"]
filter freebayes calls
from bcbio http://bcb.io/2015/03/05/cancerval/
bcftools filter --soft-filter 'FBQualDepth' -e '(AF[0] <= 0.5 && (DP < 4 || (DP < 13 && %QUAL < 10))) || (AF[0] > 0.5 && (DP < 4 && %QUAL < 50))' -m '+'
Somatic Score (SSC) filtering from speedseq.
I checked the source code:
function somatic_filter() {
awk -v MINQUAL="$1" -v SSC_THRES="$2" -v ONLY_SOMATIC="$3" 'BEGIN {NORMAL=10; TUMOR=11; GL_IDX=0;}
{
if ($0~"^#") { print ; next; }
if (! GL_IDX) {
split($9,fmt,":")
for (i=1;i<=length(fmt);++i) { if (fmt[i]=="GL") GL_IDX=i }
}
split($NORMAL,N,":");
split(N[GL_IDX],NGL,",");
split($TUMOR,T,":");
split(T[GL_IDX],TGL,",");
LOD_NORM=NGL[1]-NGL[2];
LOD_TUMOR_HET=TGL[2]-TGL[1];
LOD_TUMOR_HOM=TGL[3]-TGL[1];
if (LOD_TUMOR_HET > LOD_TUMOR_HOM) { LOD_TUMOR=LOD_TUMOR_HET }
else { LOD_TUMOR=LOD_TUMOR_HOM }
DQUAL=LOD_TUMOR+LOD_NORM;
if (DQUAL>=SSC_THRES && $NORMAL~"^0/0") {
$7="PASS"
$8="SSC="DQUAL";"$8
print
}
else if (!ONLY_SOMATIC && $6>=MINQUAL && $10~"^0/0" && ! match($11,"^0/0")) {
$8="SSC="DQUAL";"$8
print
}
}' OFS="\t"
}
Install VEP
The latest version of vep is on github http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#installer
it is version 89 when this was written.(bioinformatics tools evolve too fast!)
check this gist as well https://gist.github.com/ckandoth/f265ea7c59a880e28b1e533a6e935697
cd /scratch/genomic_med/apps
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
git status
# the Ensembl API will be installed
perl INSTALL.pl
export VEP_DATA="/scratch/genomic_med/apps/ensembl-vep-data"
export VEP_PATH="/scratch/genomic_med/apps/ensembl-vep"
rsync -avhP rsync://ftp.ensembl.org/ensembl/pub/release-89/variation/VEP/homo_sapiens_vep_89_GRCh37.tar.gz $VEP_DATA
tar -xvzf $VEP_DATA/homo_sapiens_vep_89_GRCh37.tar.gz -C $VEP_DATA
install the reference FASTAs for GRCh37:
a fasta file $VEP_DATA/homo_sapiens/89_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
will be downloaded.
perl INSTALL.pl --AUTO f --SPECIES homo_sapiens --ASSEMBLY GRCh37 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:
perl convert_cache.pl --species homo_sapiens --version 89_GRCh37 --dir $VEP_DATA
Annotate
vep --species homo_sapiens --assembly GRCh37 --offline --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/89_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.vcf --polyphen b --af_1kg --af_esp --regulatory
If you got an error message:
-------------------- EXCEPTION --------------------
MSG: ERROR: Cannot index bgzipped FASTA file with Bio::DB::Fasta
most likely, you perl version is too old. see an issue here
options:
- You can update your systems perl the suggested version (which was not suitable in my case)
- Or one can install a local version of the correct perl version. see here
- As the error is caused by reading a gziped file, one can simply unzip the reference.
Make sure you have write access to the folder where the fasta file resides. I am placing the fasta in our department shared folder.
from http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html
The first time you run the script with this parameter an index will be built which can take a few minutes. This is required if fetching HGVS annotations (--hgvs) or checking reference sequences (--check_ref) in offline mode (--offline).
Prioritize the tumor-only variants
from bcbio: http://bcb.io/2015/03/05/cancerval/
then extracts these to prioritize variants with high or medium predicted impact, not present in 1000 genomes or ExAC at more than 1% in any subpopulation, or identified as pathenogenic in COSMIC or ClinVar.
How to determine LOH from Sequenza output
see a post from the sequenza google group https://groups.google.com/forum/#!topic/sequenza-user-group/k-TgMaYlE6Q
LOH can generally be determined when B is equal to 0. If you are interested in copy neutral LOH, you should look at the A value, eg A == 2 and B == 0 means a genotype AA (copy neutral LOH in diploid chromosomes, whith genotyoe AB); A == 1 and B == 0 means a genotype A, indicating a copy number of 1 (non copy-neutral for a diploid chromosome) and also a loss of heterozygosity in the specific segment.
So in short you need to evaluate the B parameter to get all the LOH regions, and the A parameter if you want to further characterise the LOH.
To simplify the model fitting, we identify the B-allele frequency, defining the B-allele as the minor allele. So the minor allele can only have a frequency between 0 and 0.5, while the major allele would have a range between 0.5 and 1. The seqz file is produced by comparing position vs position the normal and tumor mpileups. We define het positions based on the normal sample genotype, and annotate the A and B allele frequencies of such position only of the tumor sample in the seqz file.
So the A and B frequencies in the seqz file are not to be considered to define zygosity (if the sample is high cellularity, eg cell lines, in LOH regions the A/B frequencies would be 1/0 respectively).
Regarding the depth.ratio, the number in the seqz file are merely the ratio between the depth of the two pileup. The GC normalization is applied during the processing in R, otherwise the average ratio depends on the two library size.
The normalization step in R is very simple,: in practice we calculate the mean ratio value for each GC "window", and then use the results to normalize the depth-ratio based on the respective GC content. This way the average mean is 1, solving library normalization and GC normalization.
You can have the logR by transforming in log2 of the normalized depth ratio (the depth.ratio column of the segments.txt file).
alternative solutions of ploidy and purity:
e.g. sample_alternative_solutions.txt are :
"cellularity" "ploidy" "SLPP"
1 3.8 0.795405007607475
1 1.9 6.67188954220472e-12
1 5.7 3.24694972360125e-14
Then the default value is cellularity = 1, ploidy = 3.8. to use other alternative solutions:
sequenza.results(....cellularity =1, ploidy = 1.9)