Skip to content
Snippets Groups Projects
Select Git revision
  • lancet
  • multiRG
  • recount
  • freebayes_anno
  • mutect
  • master default protected
  • freebayes_by_region
7 results

snakemake_DNAseq_pipeline

  • Clone with SSH
  • Clone with HTTPS
  • Tommy Tang's avatar
    Tommy Tang authored
    aa115a42
    History

    Still under active development, not to use for production

    • added mutect caller following gatk best practce. added freebayes caller. will add vardict caller
    • sample2json.py can deal with single-end and paired-end fastqs and matching tumor-normal pairs. Need to add tumor without matching normal.
    • add copy number/tumor purity analysis into pipeline by sequenza.

    work flow of the pipeline

    Dependencies

    All the dependencies need to be put on your path.

    echo $PATH

    R packages

    How to distribute workflows

    read doc

    ssh shark.mdanderson.org
    
    # start a screen session
    screen
    
    # make a folder, name it yourself, I named it workdir
    mkdir /rsch2/genomic_med/krai/workdir/
    
    cd /rsch2/genomic_med/krai/workdir/
    
    git clone https://gitlab.com/tangming2005/snakemake_DNAseq_pipeline
    
    cd snakemake_DNAseq_pipeline
    
    ## go to mutect branch
    git checkout mutect
    
    ## edit the config.yaml file as needed, e.g. set mouse or human for ref genome, p value cut off for peak calling, the number of reads you want to downsample to
    nano config.yaml
    
    ## skip this if on Shark, samir has py351 set up for you. see below STEPS
    conda create -n snakemake python=3 snakemake
    source activate snakemake

    STEPS

    on Shark

    there is a python3 environment set up. just do

    source activate py351

    create the sample.json file by feeding a fastq folder. this folder should be a folder containing all the samples.

    create a json file for samples.

    The sample2json.py scripts can be used to generate a sample json file.

    It can deal with single-end and paired-end fastqs or aligned bam files.

    TO do: it is also matters if the tumor sample has a normal matching control or not

    a json file:

    
    ## for paired end fastqs
    
    "sample_name":
    {"normal":
        {"R1":["normal1_R1.fastq", "normal2_R1.fastq"],
        "R2":["normal1_R2.fastq", "normal2_R2.fastq"]},
      "tumor":
        {"R1":["tumor1_R1.fastq", "tumor_2_R1.fastq"],
        "R2":["tumor1_R2.fastq", "tumor_2_R2.fastq"] }
    }
    
    ## for single end fastqs:
    "sample_name":
    {"normal":
        {"R1":["normal1_R1.fastq", "normal2_R1.fastq"]},
      "tumor":
        {"R1":["tumor1_R1.fastq", "tumor_2_R1.fastq"]}
    }
    
    # for bam files:
    
    "sample_name":
    {"normal": ["normal1.bam"],
      "tumor":  ["tumor1.bam"]}
    }

    please use the full path for the folder that contains your fastq folders and a tab delimited txt file describe the sample information.

    e.g.

    For bam files:

    sample_name is the name for the object. file_name is the name you submit for sequencing. This name should be contained in the fastq or bam file name. sample_type should be what type of samples are sequenced for the same object (patient).

    sample_name_sample_type will be used to construct the final output file name. e.g. patient1_leukocyte.sorted.bam, patient1_leukocyte_rln_recal.bam.

    so sample_name_sample_type should be unique.

    In the config.ymal file, one needs to specify which one is used as control for calling mutations. In this example, leukoyte is used, vcf file patient1_pre-treatment_vs_patient1_leukocyte_mutect.vcf and patient1_recurrence_vs_patient1_leukocyte_mutect.vcf will be generated.

    Avoid using underscore _ in the sample_type column (- is allowed). sample2josn.py will check that.

    cat sample_info.txt
    sample_name file_name sample_type
    patient1  B17042956985-KY290-2  leukocyte
    patient1  P17042956986-KY290-2  ctDNA
    patient1  F17042956987-KY290-2  pre-treatment
    patient1  F17042956988-KY290-2  recurrence
    patient2  B17050457295-KY290-2  leukocyte
    patient2  P17050457296-KY290-2  ctDNA
    patient2  F17042956989-KY290-2  pre-treatment
    patient2  F17042956990-KY290-2  recurrence
    patient3  B17051358726-KY290-2  leukocyte
    patient3  P17051358727-KY290-2  ctDNA
    patient3  F17051358728-KY290-2  pre-treatment
    patient3  F17051358729-KY290-2  metastasis
    patient3  F17051859286-KY290-2  pre-treatment2
    patient4  B17031650918-KY290  leukocyte
    patient4  P17031650914-CLN-KY290  ctDNA
    patient4  F17031650907-KY290  pre-treatment
    patient4  F17031650908-KY290  recurrence
    patient4  F17031650909-KY290  metastasis
    patient5  B17031650917-KY290  leukocyte
    patient5  P17031650913-CLN-KY290  ctDNA
    patient5  F17031650905-KY290  pre-treatment
    patient5  F17031650906-KY290  recurrence
    

    python3 sample2json.py --fastq_dir /rsch2/genomic_med/krai/rawfastqs/ --meta sample_info.txt

    some information will be printed out and a samples.json file will be generated.

    check the file information in the json file:

    less -S samples.json

    dry run to test

    ## dry run
    snakemake -np

    if no errors, preceed below.

    Using DRMAA

    job control through drmaa

    DRMAA is only supported on Shark.

    module load drmma
    ./pyflow-drmaa-ChIPseq.sh

    Using drmaa can control + c to stop the current run.

    Dependent jobs are submitted one by one, if some jobs failed, the pipeline will stop. Good for initital testing.

    submit all jobs to the cluster

    ./pyflow-ChIPseq.sh

    All jobs will be submitted to the cluster on queue. This is useful if you know your jobs will succeed for most of them and the jobs are on queue to gain priority.

    job control

    To kill all of your pending jobs you can use the command:

    bkill `bjobs -u krai |grep PEND |cut -f1 -d" "`
    bjobs -pl
     Display detailed information of all pending jobs of the invoker.
    
    bjobs -ps
     Display only pending and suspended jobs.
    
    bjobs -u all -a
     Display all jobs of all users.
    
    bjobs -d -q short -m apple -u mtang1
     Display all the recently finished jobs submitted by john to the
     queue short, and executed on the host apple.

    rerun some of the jobs

    
    # specify the name of the rule, all files that associated with that rule will be rerun. e.g. rerun macs2 calling peaks rule,
    ./pyflow-DNAseq.sh -R freebayes
    
    ## rerun one sample, just specify the name of the target file
    
    ./pyflow-DNAseq.sh 08vcfanno/SKCM-M409-P011_tumor.vep.anno.vcf
    
    ##rerun only vcfanno rule
    ./pyflow-DNAseq.sh -f vcfanno
    
    ## check snakemake -h
    ## -R -f -F --until are useufl

    checking results after run finish

    
    snakemake --summary | sort -k1,1 | less -S
    
    # or detailed summary will give you the commands used to generated the output and what input is used
    snakemake --detailed-summary | sort -k1,1 > snakemake_run_summary.txt

    clean the folders

    I use echo to see what will be removed first, then you can remove all later.

    find . -maxdepth 1 -type d -name "[0-9]*" | xargs echo rm -rf

    Snakemake does not trigger re-runs if I add additional input files. What can I do?

    Snakemake has a kind of “lazy” policy about added input files if their modification date is older than that of the output files. One reason is that information what to do cannot be inferred just from the input and output files. You need additional information about the last run to be stored. Since behaviour would be inconsistent between cases where that information is available and where it is not, this functionality has been encoded as an extra switch. To trigger updates for jobs with changed input files, you can use the command line argument –list-input-changes in the following way:

    snakemake -n -R `snakemake --list-input-changes`
    

    How do I trigger re-runs for rules with updated code or parameters?

    snakemake -n -R `snakemake --list-params-changes`

    and

    $ snakemake -n -R `snakemake --list-code-changes`

    Interval files for chromosome wise processing GATK steps

    Whole exome and target panel sequencing

    For the preprocessing steps of GATK, all steps are parallized by chromosome by specifying --intervals.

    If your genome reference file does not have chr, the interval files need to remove chr as well. In the human_g1k_v37_decoy.fasta file, chromosome M is denoted as MT.

    from illumina nextera protocol

    
    # download the bed file into the files folder in the repo
    wget https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_nextera/nexterarapidcapture/nexterarapidcapture_exome_targetedregions_v1.2.bed
    
    # create bed files by chromosome
    cat nexterarapidcapture_exome_targetedregions_v1.2.bed | sort -k1,1 -k2,2n | sed 's/^chr//' | sed 's/^M/MT/' | awk '{close(f);f=$1}{print > f".bed"}'
    
    # 1.bed, 2.bed, 3.bed ...X.bed, Y.bed, MT.bed will be created.
    

    from xGene

    https://www.idtdna.com/pages/products/nextgen/target-capture/xgen-lockdown-panels/xgen-exome-panel

    
    wget https://www.idtdna.com/pages/docs/default-source/xgen-libraries/xGen-Lockdown-Panels/xgen-exome-research-panel-targets.bed?sfvrsn=6
    
    mv xgen-exome-research-panel-targets.bed\?sfvrsn\=6 xgen-exome-research-panel-targets.bed
    
    ## no chrM is the captured region. change CHR in the config.ymal files accordingly.
    
    cat xgen-exome-research-panel-targets.bed | cut -f1-3 | sort -k1,1 -k2,2n | sed 's/^chr//' | awk '{close(f);f=$1}{print > f".bed"}'
    

    If you are using reference files from UCSC (keep chr):

    cat nexterarapidcapture_exome_targetedregions_v1.2.bed | sort -k1,1 -k2,2n | awk '{close(f);f=$1}{print > f".bed"}'

    should do the work.

    PLEASE place the resulted *.bed files in the files folder.

    For freebayes:

    cat xgen-exome-research-panel-targets.bed |cut -f1-3 | sort -k1,1 -k2,2n |  sed 's/^chr//' |  awk '{print $1":"$2"-"$3}' > xgen-exome-research-panel-targets_no_chr_freebayes.txt

    For lancet:

    lancet is doing local assembly, so a large window size is needed to capture all the reads. see issue here https://github.com/nygenome/lancet/issues/12

    create bed files with 300 bp extending upstream and downstream

    cat xgen-exome-research-panel-targets.bed | cut -f1-3 | sort -k1,1 -k2,2n | awk '{print$1"\t"$2-300"\t"$3+300}' | sed 's/^chr//' | awk '{close(f);f=$1}{print > f"_extend.bed"}'

    whole genome sequencing

    You do not need to provide --intervals for GATK for WGS. However, some regions of the genome are hard to sequence and the data are not reliable. Including those regions can slow down the process and get messy data.

    see a post on the GATK forum.

    A file ceph18.b37.include.2014-01-15.bed I borrowed from speedseq is also in the files folder.

    wget https://raw.githubusercontent.com/hall-lab/speedseq/master/annotations/ceph18.b37.include.2014-01-15.bed
    
    cat ceph18.b37.include.2014-01-15.bed | sort -k1,1 -k2,2n | awk '{close(f);f=$1}{print > f".bed"}'

    For freebayes, it accepts --target in format 1:10468-11803:

    
    cat ceph18.b37.include.2014-01-15.bed | cut -f1-3 | sort -k1,1 -k2,2n | awk '{print $1":"$2"-"$3}' > ceph18.b37.include.2014-01-15_for_freebayes.txt
    

    speed up GATK

    In addition to the scatter-gather pattern by chromosome mentioned above, some GATK tools can run in multiple threads.

    * `-nt / --num_threads`
    controls the number of data threads sent to the processor
    
    * `-nct / --num_cpu_threads_per_data_thread`
    controls the number of CPU threads allocated to each data thread

    NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather:

    Tool Full name Type of traversal NT NCT SG
    RTC RealignerTargetCreator RodWalker + - -
    IR IndelRealigner ReadWalker - - +
    BR BaseRecalibrator LocusWalker - + +
    PR PrintReads ReadWalker - + -
    RR ReduceReads ReadWalker - - +
    HC HaplotypeCaller ActiveRegionWalker - (+) +
    UG UnifiedGenotyper LocusWalker + + +

    Recommended settings:

    Tool RTC IR BR PR RR HC UG
    Available modes NT SG NCT,SG NCT SG NCT,SG NT,NCT,SG
    Cluster nodes 1 4 4 1 4 4 4 / 4 / 4
    CPU threads (-nct) 1 1 8 4-8 1 4 3 / 6 / 24
    Data threads (-nt) 24 1 1 1 1 1 8 / 4 / 1
    Memory (Gb) 48 4 4 4 4 16 32 / 16 / 4

    implemented parallel processing by chromosome:

    Further readings: VQSR

    https://gatkforums.broadinstitute.org/gatk/discussion/8019/which-gatk-parameters-to-consider-in-whole-exome-sequencing-data

    for exomes you need at least 30 to run VQSR

    For BQSR: it is best to have greater than 100 million bases per read group

    https://gatkforums.broadinstitute.org/gatk/discussion/comment/14269#Comment_14269

    "Small datasets" from the -L FAQ refers to datasets with less than 100M bases.

    If you have 150M bases (sequenced reads) for a small targeted experiment, you should be fine to run BQSR. 1B is even better. Pretty much, anything above 100M bases is fine, but the bigger the dataset, the more accurate BQSR will be.

    freebayes paramters

    borrowed from bcbio http://bcb.io/2015/03/05/cancerval/ https://github.com/chapmanb/bcbio-nextgen/blob/8c3206f131f9b97715d33141e558e1e38a07ac12/bcbio/variation/freebayes.py#L154

    freebayes -f hg19.fa --genotype-qualities --strict-vcf --ploidy 2 --targets regions.bed --min-repeat-entropy 1 --no-partial-observations --min-alternate-fraction 0.05 --pooled-discrete --pooled-continuous --report-genotype-likelihood-max --allele-balance-priors-off tumor.bam normal.bam | bcftools filter -i 'ALT="<*>" || QUAL > 5'

    GATK best practice

    This is the most updated version for mutect2 which can skip the indel-realign step. for mutect1, indel-realign is still required.

    mutect2 can detect both SNVs and indels.

    Note on read group

    read this post in the GATK forum.

    example:

    @RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

    Meaning of the read group fields required by GATK

    * ID = Read group identifier
    This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.
    Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.
    
    * PU = Platform Unit
    The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.
    
    * SM = Sample
    The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name.
    
    * PL = Platform/technology used to produce the read
    This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.
    
    * LB = DNA preparation library identifier
    MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

    If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.

    Use Picard's AddOrReplaceReadGroups to appropriately label read group (@rg) fields, coordinate sort and index a BAM file. Only the five required @rg fields are included in the command shown. Consider the other optional @rg fields for better record keeping.

    java -jar picard.jar AddOrReplaceReadGroups \
        INPUT=reads.bam \
        OUTPUT=reads_addRG.bam \
        RGID=H0164.2 \ #be sure to change from default of 1
        RGLB= library1 \
        RGPL=illumina \
        RGPU=H0164ALXX140820.2 \
        RGSM=sample1 \

    Deriving ID and PU fields from read names

    H0164ALXX140820:2:1101:10003:23460
    H0164ALXX140820:2:1101:15118:25288

    Breaking down the common portion of the query names:

    H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
    _____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
    _______________:2 #portion of @RG ID and PU fields indicating flow cell lane

    Multi-sample and multiplexed example

    Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @rg fields in the header:

    Dad's data:
    @RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
    @RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
    @RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
    @RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
    
    Mom's data:
    @RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
    @RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
    @RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
    @RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
    
    Kid's data:
    @RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
    @RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
    @RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
    @RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

    Add read group when align using bwa through -R see this post: https://www.biostars.org/p/78400/

    samples from different lanes need to get a different ID in @RG

    Install gemini and annotation data

    Gemini is a tool from Aaron Quinlan's group

    conda install -c bioconda gemini
    
    cd /scratch/genomic_med/apps/gemini
    
    gemini update --dataonly

    The annotation data were downloaded to /scratch/genomic_med/apps/gemini/gemini/data

    to annotate the vcf files, use vcfanno: configuration file:

    https://github.com/brentp/vcfanno/blob/master/example/gem.conf

    SNPdb129 hg19 version

    snpdb129 is probably the latest version of dbsnp that is not contaminated by the recent large-scale cancer sequencing projects. It was only in hg18 version, but annovar has a lifted over hg19 version.

    Download that by:

    
    perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar snp129  humandb/
    
    ## change to bed format for vcfanno
    less -S hg19_snp129.txt | cut -f2-7 | sed 's/^chr//' > hg19_snp129.bed
    
    ## copy it to vcfanno_db: /scratch/genomic_med/apps/gemini/gemini/data/
    cp  hg19_snp129.bed /scratch/genomic_med/apps/gemini/gemini/data/
    bgzip hg19_snp129.bed
    tabix -p bed hg19_snp129.bed.gz

    add it to the gem.conf file:

    ## dbsnp129 hg19 version
    [[annotation]]
    file="hg19_snp129.bed"
    columns=[4]
    names=["dbsnp129_rs_ids"]
    ops=["concat"]

    filter freebayes calls

    from bcbio http://bcb.io/2015/03/05/cancerval/

    bcftools filter --soft-filter 'FBQualDepth' -e '(AF[0] <= 0.5 && (DP < 4 || (DP < 13 && %QUAL < 10))) || (AF[0] > 0.5 && (DP < 4 && %QUAL < 50))' -m '+'
    

    Somatic Score (SSC) filtering from speedseq.

    I checked the source code:

    function somatic_filter() {
        awk -v MINQUAL="$1" -v SSC_THRES="$2" -v ONLY_SOMATIC="$3" 'BEGIN {NORMAL=10; TUMOR=11; GL_IDX=0;}
        {
            if ($0~"^#") { print ; next; }
            if (! GL_IDX) {
                split($9,fmt,":")
                for (i=1;i<=length(fmt);++i) { if (fmt[i]=="GL") GL_IDX=i }
            }
            split($NORMAL,N,":");
            split(N[GL_IDX],NGL,",");
            split($TUMOR,T,":");
            split(T[GL_IDX],TGL,",");
            LOD_NORM=NGL[1]-NGL[2];
            LOD_TUMOR_HET=TGL[2]-TGL[1];
            LOD_TUMOR_HOM=TGL[3]-TGL[1];
            if (LOD_TUMOR_HET > LOD_TUMOR_HOM) { LOD_TUMOR=LOD_TUMOR_HET }
            else { LOD_TUMOR=LOD_TUMOR_HOM }
            DQUAL=LOD_TUMOR+LOD_NORM;
            if (DQUAL>=SSC_THRES && $NORMAL~"^0/0") {
                $7="PASS"
                $8="SSC="DQUAL";"$8
                print
            }
            else if (!ONLY_SOMATIC && $6>=MINQUAL && $10~"^0/0" && ! match($11,"^0/0")) {
                $8="SSC="DQUAL";"$8
                print
            }
        }' OFS="\t"
    }
    

    Install VEP

    The latest version of vep is on github http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#installer

    it is version 89 when this was written.(bioinformatics tools evolve too fast!)

    check this gist as well https://gist.github.com/ckandoth/f265ea7c59a880e28b1e533a6e935697

    cd /scratch/genomic_med/apps
    git clone https://github.com/Ensembl/ensembl-vep.git
    cd ensembl-vep
    git status
    # the Ensembl API will be installed
    perl INSTALL.pl
    export VEP_DATA="/scratch/genomic_med/apps/ensembl-vep-data"
    export VEP_PATH="/scratch/genomic_med/apps/ensembl-vep"
    
    rsync -avhP rsync://ftp.ensembl.org/ensembl/pub/release-89/variation/VEP/homo_sapiens_vep_89_GRCh37.tar.gz $VEP_DATA
    tar -xvzf $VEP_DATA/homo_sapiens_vep_89_GRCh37.tar.gz -C $VEP_DATA

    install the reference FASTAs for GRCh37:

    a fasta file $VEP_DATA/homo_sapiens/89_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz will be downloaded.

    perl INSTALL.pl --AUTO f --SPECIES homo_sapiens --ASSEMBLY GRCh37 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
    

    Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:

    perl convert_cache.pl --species homo_sapiens --version 89_GRCh37 --dir $VEP_DATA

    Annotate

    vep --species homo_sapiens --assembly GRCh37 --offline --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/89_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.vcf --polyphen b --af_1kg --af_esp --regulatory

    If you got an error message:

    -------------------- EXCEPTION --------------------
    MSG: ERROR: Cannot index bgzipped FASTA file with Bio::DB::Fasta

    most likely, you perl version is too old. see an issue here

    options:

    1. You can update your systems perl the suggested version (which was not suitable in my case)
    2. Or one can install a local version of the correct perl version. see here
    3. As the error is caused by reading a gziped file, one can simply unzip the reference.

    Make sure you have write access to the folder where the fasta file resides. I am placing the fasta in our department shared folder.

    from http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html

    The first time you run the script with this parameter an index will be built which can take a few minutes. This is required if fetching HGVS annotations (--hgvs) or checking reference sequences (--check_ref) in offline mode (--offline).

    Prioritize the tumor-only variants

    from bcbio: http://bcb.io/2015/03/05/cancerval/

    then extracts these to prioritize variants with high or medium predicted impact, not present in 1000 genomes or ExAC at more than 1% in any subpopulation, or identified as pathenogenic in COSMIC or ClinVar.

    How to determine LOH from Sequenza output

    see a post from the sequenza google group https://groups.google.com/forum/#!topic/sequenza-user-group/k-TgMaYlE6Q

    LOH can generally be determined when B is equal to 0. If you are interested in copy neutral LOH, you should look at the A value, eg A == 2 and B == 0 means a genotype AA (copy neutral LOH in diploid chromosomes, whith genotyoe AB); A == 1 and B == 0 means a genotype A, indicating a copy number of 1 (non copy-neutral for a diploid chromosome) and also a loss of heterozygosity in the specific segment.

    So in short you need to evaluate the B parameter to get all the LOH regions, and the A parameter if you want to further characterise the LOH.

    Questions on BAF and LogR

    To simplify the model fitting, we identify the B-allele frequency, defining the B-allele as the minor allele. So the minor allele can only have a frequency between 0 and 0.5, while the major allele would have a range between 0.5 and 1. The seqz file is produced by comparing position vs position the normal and tumor mpileups. We define het positions based on the normal sample genotype, and annotate the A and B allele frequencies of such position only of the tumor sample in the seqz file.

    So the A and B frequencies in the seqz file are not to be considered to define zygosity (if the sample is high cellularity, eg cell lines, in LOH regions the A/B frequencies would be 1/0 respectively).

    Regarding the depth.ratio, the number in the seqz file are merely the ratio between the depth of the two pileup. The GC normalization is applied during the processing in R, otherwise the average ratio depends on the two library size.

    The normalization step in R is very simple,: in practice we calculate the mean ratio value for each GC "window", and then use the results to normalize the depth-ratio based on the respective GC content. This way the average mean is 1, solving library normalization and GC normalization.

    You can have the logR by transforming in log2 of the normalized depth ratio (the depth.ratio column of the segments.txt file).

    alternative solutions of ploidy and purity:

    e.g. sample_alternative_solutions.txt are :

    "cellularity"	"ploidy"	"SLPP"
    1	3.8	0.795405007607475
    1	1.9	6.67188954220472e-12
    1	5.7	3.24694972360125e-14

    Then the default value is cellularity = 1, ploidy = 3.8. to use other alternative solutions:

    sequenza.results(....cellularity =1, ploidy = 1.9)