Commit 0b79e215 authored by Sebastian Schmeier's avatar Sebastian Schmeier

Merge branch 'revision' into 'master'

modifications due to reviewers' comments

See merge request !4
parents d3aa25c8 56382f6d
Outputs of the 16S rRNA metabarcoding analysis
Here we keep `otu_table.biom` file containing OTUs identified
with QIIME's `` workflow.
Files `otu_table_sorted_L*.txt.gz` are compressed outputs of
QIIME's `` workflow for taxonomic
levels L2 and L6.
The folder `raw_counts` contains compressed files of raw read counts for each sample.
The files contain Ensembl gene ID in the first column and the corresponding read count
in the second column.
The file `counts_summary.havana.cols.uniqgenes.gz` contains summarised information from the files
in the `raw_counts`folder. In this file, the gene IDs were changed from Ensembl to Entrez gene IDs
and genes are shown in rows while sample IDs are shown in columns. A number on the position
in the `x`-th row and `y`-th column is the read count for the `x`-th gene of the `y`-th sample.
The file `tpm.readyForClassifier.tsv` contains data from `counts_summary.havana.cols.uniqgenes.gz`
transformed to a form suitable as an input for the CMS classifier.
In this file, the Entrez gene IDs are in columns, sample IDs are in rows and a number on the
position in the `x`-th row and `y`-th column is TPM value (transcript-per-million) of the gene
`y` in the sample `x`. For the details of the transformation see
The gene expression profiles from `tpm.readyForClassifier.tsv`
were used as input for `scripts/rnaseq-subtype-classification/classify_samples_with_CMSclassifier.R`
to classify CRC samples into CMSs.
Classification of the samples into CMS subtypes is available as the file
......@@ -17,4 +17,12 @@ Files containing abundances of taxa in tems of percentage on all taxa within the
## Files containing information on used bacterial taxa in the Kraken database
- Supplementary_table_K1.xlsx
- Supplementary_table_K2.xlsx
\ No newline at end of file
- Supplementary_table_K2.xlsx
All NCBI refseq bacterial genomes with "Complete Genomes"- or "Chromosome"-level genomes were downloaded from NCBI FTP
site ( based on information in the
file as of 19th January 2017.
A list of the genomes can be accessed in Supplementary_table_K1.xlsx.
Additional genomes known to play role in CRC were added disregarding their genome status (see Supplementary_table_K2.xlsx).
Using the genome fasta-files, a new Kraken database was created using "kraken-build --build" with default parameter.
The resulting database had a size of 131GB.
# For each of the listed parameters we give its name, value,
# a script where the parameter is used, and a short description.
parameter value script description
quality 0.01 scripts/16S-metabarcoding/ minimum base quality for the sequenced bases kept after the data preprocessing
length 250 scripts/16S-metabarcoding/ minimum fragment length for the fragments kept after the data preprocessing
flashOptions -M 200 scripts/16S-metabarcoding/ maximum overlap of reads from one read pair
p 0.01 scripts/rnaseq-subtype-classification/ minimum base quality for the sequenced bases kept after the data preprocessing
l 50 scripts/rnaseq-subtype-classification/ minimum read length for the reads kept after the data preprocessing
o "--runThreadN 4 \
--limitBAMsortRAM 20000000000 \
--genomeLoad LoadAndRemove \
--outFilterMultimapNmax 20 \
--outSAMtype BAM SortedByCoordinate \
--outFilterType BySJout \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--quantMode GeneCounts \
--outReadsUnmapped Fastx \
--outFilterMatchNminOverLread 0.4 \
--outFilterScoreMinOverLread 0.4" scripts/rnaseq-subtype-classification/ parameters passed to STAR
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment