README.md 18.9 KB
Newer Older
1
gplas: binning plasmid-predicted contigs
Sergio_Arredondo's avatar
Sergio_Arredondo committed
2 3
================

4 5
# Introduction

6 7 8 9
gplas is a tool to bin plasmid-predicted contigs based on sequence
composition, coverage and assembly graph information. Gplas is a new
tool that extends the possibility of accurately binning predicted
plasmid contigs into several discrete plasmid components.
Sergio_Arredondo's avatar
Sergio_Arredondo committed
10

Sergio_Arredondo's avatar
Sergio_Arredondo committed
11 12
![](figures/logo.png)<!-- -->

13
# Installation
Sergio_Arredondo's avatar
Sergio_Arredondo committed
14

Sergio_Arredondo's avatar
Sergio_Arredondo committed
15 16 17
``` bash
git clone https://gitlab.com/sirarredondo/gplas.git
cd gplas
18
./gplas.sh -i test/faecium_graph.gfa -c mlplasmids -s 'Enterococcus faecium' -n 'installation'
Sergio_Arredondo's avatar
Sergio_Arredondo committed
19
```
Sergio_Arredondo's avatar
Sergio_Arredondo committed
20

21
First-time installation can take some time depending on your internet
22
speed (~20 minutes).
23

Anita Schurch's avatar
Anita Schurch committed
24
The good news is that you do not have to install any dependencies. The
25 26
snakemake pipeline and different conda environments integrate all the
dependencies required to run gplas.
27 28

After the first-time installation, you will get the prediction of gplas
29
in a few minutes and using a single thread\!
30

Anita Schurch's avatar
Anita Schurch committed
31
Gplas first checks if the following tools are present on your system:
32

33
1.  [Conda](https://bioconda.github.io/)
Sergio_Arredondo's avatar
Sergio_Arredondo committed
34

35 36
2.  [Snakemake](https://snakemake.readthedocs.io/en/stable/) version
    5.5.4
Sergio_Arredondo's avatar
Sergio_Arredondo committed
37

38 39
After this, gplas will start the snakemake pipeline and will install
different conda environments with the following R packages:
40

41 42 43
      
[igraph](https://cran.r-project.org/web/packages/igraph/index.html)
version 1.2.4.1
44

45 46 47
      
[ggraph](https://cran.r-project.org/web/packages/ggraph/index.html)
version 1.0.2
48

49 50 51
      
[Biostrings](https://www.bioconductor.org/packages/release/bioc/html/Biostrings.html)
version 2.50.2
52

53 54 55
      
[seqinr](https://cran.r-project.org/web/packages/seqinr/index.html)
version 3.4-5
56

57
       [tidyverse](https://www.tidyverse.org/) version 1.2.1
58

59 60 61
      
[spatstat](https://cran.r-project.org/web/packages/spatstat/index.html)
version 1.59-0
62

63 64 65
      
[ggrepel](https://cran.r-project.org/web/packages/ggrepel/index.html)
version 0.8.0
66

67 68
Following this, it will install the tools that we use to predict
plasmid-derived contigs.
Sergio_Arredondo's avatar
Sergio_Arredondo committed
69

70 71
4.  [mlplasmids](https://gitlab.com/sirarredondo/mlplasmids) version
    1.0.0
72

73
5.  [plasflow](https://anaconda.org/bioconda/plasflow) version 1.1
74

75
# Usage
76

77
## Quick usage
78

79
### Running gplas with an assembly graph
80

81
Gplas only requires a single argument **‘-i’** corresponding to an
82 83 84 85
assembly graph in gfa format. Such an assembly graph can be obtained
with [SPAdes genome assembler](https://github.com/ablab/spades). You
need to specify which classifier gplas is going to use, mlplasmids or
plasflow, with the argument **‘-c’**
86

87 88 89
If you choose mlplasmids for the prediction, there is an additional
mandatory argument **‘-s’** in which you need to list any of the
following three bacterial species:
90

91 92 93
  - ‘Enterococcus faecium’
  - ‘Klebsiella pneumoniae’
  - ‘Escherichia coli’
94

95
You can use plasflow as a classifier if you have a different bacterial
96
species.
97 98

``` bash
99
./gplas.sh -i test/faecium_graph.gfa -c mlplasmids -s 'Enterococcus faecium' -n 'my_isolate'
100 101
```

102
    ## 
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
    ## 
    ## 
    ##   _______ .______    __           ___           _______.
    ##  /  _____||   _  \  |  |         /   \         /       |
    ## |  |  __  |  |_)  | |  |        /  ^  \       |   (----`
    ## |  | |_ | |   ___/  |  |       /  /_\  \       \   \    
    ## |  |__| | |  |      |  `----. /  _____  \  .----)   |   
    ##  \______| | _|      |_______|/__/     \__\ |_______/    
    ## 
    ## 
    ## ##################################################################
    ## 
    ## 
    ## This is your input graph: test/faecium_graph.gfa 
    ## 
    ## This is the bacterial species that you have indicated: Enterococcus faecium 
    ## 
120
    ## Your results will be named  my_isolate 
121
    ## 
122
    ## You did not indicate a threshold prediction. Using 0.5 because you are using mlplasmids
123
    ## 
124
    ## You did not pass the number of times to look for walks based on each plasmid seed, using 20 as default
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
    ## 
    ## ##################################################################
    ## Conda is present
    ## 
    ## Creating (only the first-time) a conda environment to install and run snakemake
    ## The flag 'directory' used in rule awk_parsing_alignment is only valid for outputs, not inputs.
    ## Building DAG of jobs...
    ## Unlocking working directory.
    ## The flag 'directory' used in rule awk_parsing_alignment is only valid for outputs, not inputs.
    ## Building DAG of jobs...
    ## Using shell: /bin/bash
    ## Provided cores: 1
    ## Rules claiming more threads will be scaled down.
    ## Job counts:
    ##  count   jobs
    ##  1   awk_links
    ##  1   awk_nodes
    ##  1   gplas_coocurr
    ##  1   gplas_coverage
    ##  1   gplas_paths
    ##  1   mlplasmids
    ##  6
    ## 
148 149
    ## [Wed Jan 22 13:56:34 2020]
    ## Job 1: Extracting the nodes from the graph test/faecium_graph.gfa
150
    ## 
151
    ## Activating conda environment: /home/sergi/gplas/.snakemake/conda/70552874
152 153
    ## [Wed Jan 22 13:56:38 2020]
    ## Finished job 1.
154 155
    ## 1 of 6 steps (17%) done
    ## 
156 157
    ## [Wed Jan 22 13:56:38 2020]
    ## Job 5: Extracting the links from the graph test/faecium_graph.gfa
158
    ## 
159
    ## Activating conda environment: /home/sergi/gplas/.snakemake/conda/70552874
160 161
    ## [Wed Jan 22 13:56:41 2020]
    ## Finished job 5.
162 163
    ## 2 of 6 steps (33%) done
    ## 
164
    ## [Wed Jan 22 13:56:41 2020]
165 166
    ## Job 3: Running mlplasmids to obtain the plasmid prediction using the nodes extracted from the graph. If this is the first time running mlplasmids, installation can take a few minutes
    ## 
167
    ## Activating conda environment: /home/sergi/gplas/.snakemake/conda/70552874
168
    ## [Wed Jan 22 13:56:54 2020]
169 170 171
    ## Finished job 3.
    ## 3 of 6 steps (50%) done
    ## 
172
    ## [Wed Jan 22 13:56:54 2020]
173 174 175
    ## Job 2: Extracting the sd k-mer coverage from the chromosome-predicted contigs
    ## 
    ## R script job uses conda environment but R_LIBS environment variable is set. This is likely not intended, as R_LIBS can interfere with R packages deployed via conda. Consider running `unset R_LIBS` or remove it entirely before executing Snakemake.
176
    ## Activating conda environment: /home/sergi/gplas/.snakemake/conda/70552874
177
    ## WARNING: ignoring environment value of R_HOME
178
    ## [Wed Jan 22 13:57:07 2020]
179 180 181
    ## Finished job 2.
    ## 4 of 6 steps (67%) done
    ## 
182
    ## [Wed Jan 22 13:57:07 2020]
183
    ## Job 4: Searching for plasmid-like walks using a greedy approach
184 185
    ## 
    ## R script job uses conda environment but R_LIBS environment variable is set. This is likely not intended, as R_LIBS can interfere with R packages deployed via conda. Consider running `unset R_LIBS` or remove it entirely before executing Snakemake.
186
    ## Activating conda environment: /home/sergi/gplas/.snakemake/conda/70552874
187
    ## WARNING: ignoring environment value of R_HOME
188
    ## [Wed Jan 22 13:57:50 2020]
189 190 191
    ## Finished job 4.
    ## 5 of 6 steps (83%) done
    ## 
192
    ## [Wed Jan 22 13:57:50 2020]
193
    ## Job 0: Generating weights for the set of new edges connecting plasmid unitigs
194 195
    ## 
    ## R script job uses conda environment but R_LIBS environment variable is set. This is likely not intended, as R_LIBS can interfere with R packages deployed via conda. Consider running `unset R_LIBS` or remove it entirely before executing Snakemake.
196
    ## Activating conda environment: /home/sergi/gplas/.snakemake/conda/70552874
197
    ## WARNING: ignoring environment value of R_HOME
198
    ## NULL
199 200 201 202
    ##       Algorithm Modularity Original_component Decision
    ## 1      Walktrap       0.13                  1 No_split
    ## 2 Leading-eigen       0.13                  1 No_split
    ## 3       Louvain       0.13                  1 No_split
203 204
    ## null device 
    ##           1 
205
    ## [Wed Jan 22 13:58:04 2020]
206 207
    ## Finished job 0.
    ## 6 of 6 steps (100%) done
208
    ## Complete log: /home/sergi/gplas/.snakemake/log/2020-01-22T135633.934474.snakemake.log
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
    ##   _______ .______    __           ___           _______.
    ##  /  _____||   _  \  |  |         /   \         /       |
    ## |  |  __  |  |_)  | |  |        /  ^  \       |   (----`
    ## |  | |_ | |   ___/  |  |       /  /_\  \       \   \    
    ## |  |__| | |  |      |  `----. /  _____  \  .----)   |   
    ##  \______| | _|      |_______|/__/     \__\ |_______/    
    ## 
    ## 
    ## Congratulations! Prediction succesfully done.
    ## 
    ## Input graph: test/faecium_graph.gfa 
    ## 
    ## Bacterial species:  'Enterococcus faecium' 
    ## 
    ## Classifier: mlplasmids 
    ## 
    ## Threshold for predicting plasmid-derived contigs: 0.5
    ## 
227
    ## Number of plasmid walks created per node: 20
228 229 230
    ## 
    ## Threshold of gplas scores: 0.1
    ## 
231 232
    ## Minimum frequency to consider an edge: 0.1
    ## 
233 234
    ## Modularity threshold used to partition the network: 0.2
    ## 
235 236
    ## 
    ## 
237
    ## Your results are in results/ and walks/
238
    ## 
239
    ## We hope it helps your research, thanks for using gplas!
240 241 242 243 244 245 246 247 248
    ## 
    ## If you have used plasflow as a classifier please cite:
    ##   Pawel S Krawczyk et al. PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures, Nucleic Acids Research, doi: 10.1093/nar/gkx1321
    ## 
    ## 
    ## If you have used mlplasmids as a classifier please cite:
    ##   Arredondo-Alonso et al. mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species, Microbial Genomics, doi: 10.1099/mgen.0.000224
    ## 
    ## 
249
    ## gplas version 0.6.1 - Preprint of gplas: https://www.biorxiv.org/content/10.1101/835900v1
250

251
## Main output files
252 253

Gplas will create a folder called ‘results’ with the following files:
254

255
``` bash
256
ls results/my_isolate*
257 258
```

259 260 261 262
    ## results/my_isolate_bin_1.fasta
    ## results/my_isolate_bins.tab
    ## results/my_isolate_plasmidome_network.png
    ## results/my_isolate_results.tab
263

264
### results/\*results.tab
265 266

Tab delimited file containing the prediction given by mlplasmids or
267 268
plasflow together with the bin prediction by gplas. The file contains
the following information: contig number, probability of being
269
chromosome-derived, probability of being plasmid-derived, class
270
prediction, contig name, k-mer coverage, length, bin
271
assigned.
272

273 274 275 276 277 278 279 280 281 282 283 284
| number | Contig\_name                             | Prob\_Chromosome | Prob\_Plasmid | Prediction | length | coverage | Bin |
| -----: | :--------------------------------------- | ---------------: | ------------: | :--------- | -----: | -------: | --: |
|     18 | S18\_LN:i:54155\_dp:f:1.0514645940835776 |             0.01 |          0.99 | Plasmid    |  54155 |     1.05 |   1 |
|     31 | S31\_LN:i:21202\_dp:f:1.194722937126809  |             0.15 |          0.85 | Plasmid    |  21202 |     1.19 |   1 |
|     33 | S33\_LN:i:18202\_dp:f:1.1628830074648842 |             0.40 |          0.60 | Plasmid    |  18202 |     1.16 |   1 |
|     46 | S46\_LN:i:8487\_dp:f:1.2210058174026983  |             0.03 |          0.97 | Plasmid    |   8487 |     1.22 |   1 |
|     47 | S47\_LN:i:8177\_dp:f:0.9996798934685464  |             0.04 |          0.96 | Plasmid    |   8177 |     1.00 |   1 |
|     50 | S50\_LN:i:4993\_dp:f:1.1698997426343487  |             0.02 |          0.98 | Plasmid    |   4993 |     1.17 |   1 |
|     52 | S52\_LN:i:4014\_dp:f:0.9783821389091624  |             0.03 |          0.97 | Plasmid    |   4014 |     0.98 |   1 |
|     54 | S54\_LN:i:3077\_dp:f:1.1553028848000615  |             0.08 |          0.92 | Plasmid    |   3077 |     1.16 |   1 |
|     57 | S57\_LN:i:2626\_dp:f:0.9929149754371588  |             0.03 |          0.97 | Plasmid    |   2626 |     0.99 |   1 |
|     60 | S60\_LN:i:1589\_dp:f:1.0577429501871556  |             0.00 |          1.00 | Plasmid    |   1589 |     1.06 |   1 |
285

286
### results/\*components.tab
287 288

Tab delimited file containing the bin prediction reported by gplas with
289 290 291 292 293 294 295
the following information: contig number, bin assignation

| number | Bin |
| -----: | --: |
|     18 |   1 |
|     33 |   1 |
|     31 |   1 |
296 297
|     47 |   1 |
|     46 |   1 |
298 299 300 301
|     50 |   1 |
|     52 |   1 |
|     57 |   1 |
|     54 |   1 |
302
|     60 |   1 |
303

304
### results/\*plasmidome\_network.png
305 306

Png file of the plasmidome network generated by gplas after creating an
307 308
undirected graph from edges between plasmid unitigs co-existing in the
walks created by gplas.
309

310
![](results/my_isolate_plasmidome_network.png)<!-- -->
Sergio_Arredondo's avatar
Sergio_Arredondo committed
311

312
### results/\*components.fasta
313

314
Fasta files with the nodes belonging to each predicted component.
315 316

``` bash
317
grep '>' results/my_isolate*.fasta
318 319
```

320 321 322 323 324 325 326 327 328 329
    ## >S18_LN:i:54155_dp:f:1.0514645940835776
    ## >S31_LN:i:21202_dp:f:1.194722937126809
    ## >S33_LN:i:18202_dp:f:1.1628830074648842
    ## >S46_LN:i:8487_dp:f:1.2210058174026983
    ## >S47_LN:i:8177_dp:f:0.9996798934685464
    ## >S50_LN:i:4993_dp:f:1.1698997426343487
    ## >S52_LN:i:4014_dp:f:0.9783821389091624
    ## >S54_LN:i:3077_dp:f:1.1553028848000615
    ## >S57_LN:i:2626_dp:f:0.9929149754371588
    ## >S60_LN:i:1589_dp:f:1.0577429501871556
330

331
### walks/\*solutions.csv
332

333 334 335 336 337
gplas generates plasmid-like walks per each plasmid starting node. These
paths are used later to generate the edges from the plasmidome network
but they can also be useful to observe all the different walks starting
from a single node (plasmid unitig). These walk can be directly given to
Bandage to visualize and manually inspect a walk.
338

339 340
In this case, we find different possible plasmid walks starting from the
node 18+. These paths may contain inversions and rearrangements since
341 342 343
repeats units such as transposases which can be present several times in
the same plasmid sequence. In these cases, gplas can traverse the
sequence in different ways generating different plasmid-like
344
    paths.
345 346

``` bash
347
head -n 10 walks/my_isolate_solutions.csv
348 349
```

350
    ## 18+,76-,102+,33+,76-,102+,92+,47+,115-,64+,31-,79+,60-,70-,50+,64-,116+,61-,88-,89+,69-,96-,119+,64-,116+,61-,88-,90+,69-,100+,119+,64-,116+,63+,115-,64+,119-,100-,69+
351
    ## 18+,76-,52+,94+,57-,77+,18+
352
    ## 18+,76-,52+,94+,57-,77+,87-,65+,54-,94+
353
    ## 18+,76-,52+,94+,57-,77+,87-,65+,54-,94+
354 355 356 357
    ## 18+,76-,52+,94+,57-,77+,87-,65+,54-,94+
    ## 18+,76-,102+,33+,76-,52+,94+,57-,77+,18+
    ## 18+,76-,52+,94+,57-,77+,18+
    ## 18+,76-,102+,92+,47+,115-,64+,31-,79+,60-,70-,50+,64-,113+
358
    ## 18+,76-,52+,94+,57-,77+,18+
359
    ## 18+,76-,102+,92+,47+,115-,64+,31-,79+,46-,79+,60-,70-,50+,64-,114+
360 361 362 363

For example, we can inspect in Bandage the path:
18+,76-,52+,94+,57-,77+,18+

364 365 366
This path forms a circular sequence since there is overlap between the
initial and end node of the path.

367
![](figures/bandage_path.jpg)<!-- -->
368

369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394
## Complete usage

Gplas can take the following arguments:

Mandatory arguments:

  - **-i**: Path to the graph file in \*.gfa format used to extract
    nodes and links. Gfa file format
  - **-c**: Classifier used to predict the contigs extracted from the
    input graph. String value: ‘plasflow’ or ‘mlplasmids’
  - **-s**: Only applicable if mlplasmids is chosen. Bacterial species
    from the graph file. If you have specified mlplasmids as classifier
    you need to provide one of the following three bacterial species:
    ‘Enterococcus faecium’,‘Klebsiella pneumoniae’ or ‘Escherichia
    coli’

Optional arguments:

  - **-n**: Project name given to gplas. Default: ‘unnamed’
  - **-t**: Threshold to predict plasmid-derived sequences. Integer
    value ranging from 0 to 1. Default mlplasmids threshold: 0.5 Default
    plasflow threshold: 0.7
  - **-x**: Number of times gplas finds plasmid paths per each plasmid
    starting node. Integer value ranging from 1 to infinite. Default: 20
  - **-f**: Gplas filtering threshold score to reject possible outcoming
    edges. Integer value ranging from 0 to 1. Default: 0.1
395 396
  - **-q**: Modularity threshold to split components present in the
    plasmidome network. Integer value ranging from 0 to 1. Default: 0.2
397 398 399 400 401

For benchmarking purposes you can pass a complete genome to gplas and
will generate a precision and completeness. Using this you can assess
the performance of gplas on a small set of genomes in which perhaps you
have generated long-reads.
402 403

  - **-r**: Path to the complete reference genome corresponding to the
404 405 406 407
    graph given. For optimal results using this benchmarking flag,
    please name the reference genomes using the Unicycler scheme: e.g.
    ‘1 length=4123456’ ‘2 length=10000’ ‘3 length=2000’ for your
    chromosome and plasmids. Fasta file format Fasta file format
408 409 410 411 412 413 414 415 416 417 418 419 420

# Help page

``` bash
./gplas.sh -h
```

    ##   _______ .______    __           ___           _______.
    ##  /  _____||   _  \  |  |         /   \         /       |
    ## |  |  __  |  |_)  | |  |        /  ^  \       |   (----`
    ## |  | |_ | |   ___/  |  |       /  /_\  \       \   \    
    ## |  |__| | |  |      |  `----. /  _____  \  .----)   |   
    ##  \______| | _|      |_______|/__/     \__\ |_______/    
421
    ## Welcome to the user guide of gplas (version 0.5.0):
422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444
    ## 
    ## Basic usage example: ./gplas.sh -i mygraph.gfa -c mlplasmids -s 'Enterococcus faecium'
    ## 
    ## Input:
    ##       -i      Mandatory: Path to the graph file in *.gfa format used to extract nodes and links. Gfa file format
    ## 
    ## Classifier:
    ##       -c      Mandatory: Classifier used to predict the contigs extracted from the input graph. String value: 'plasflow' or 'mlplasmids'
    ## 
    ## Bacterial species: 
    ##       -s      Mandatory (if mlplasmids is chosen): Bacterial species from the graph file. If you have specified mlplasmids as classifier
    ##                  you need to provide one of the following three bacterial species:
    ## 
    ##                 'Enterococcus faecium','Klebsiella pneumoniae' or 'Escherichia coli'
    ## 
    ## Output name:
    ##       -n      Optional: Output name used in the files generated by gplas. Default: 'unnamed'
    ## 
    ## Settings:
    ##       -t      Optional: Threshold to predict plasmid-derived sequences. Integer value ranging from 0 to 1.
    ##                  Default mlplasmids threshold: 0.5
    ##                  Default plasflow threshold: 0.7
    ## 
445
    ##   -x      Optional: Number of times gplas finds plasmid walks per each plasmid starting node. Integer value ranging from 1 to infinite.
446 447 448 449 450
    ##                  Default: 20
    ## 
    ##   -f      Optional: Gplas filtering threshold score to reject possible outcoming edges. Integer value ranging from 0 to 1.
    ##                  Default: 0.1
    ## 
451 452 453
    ##   -q      Optional: Modularity threshold to split components present in the plasmidome network. Integer value ranging from 0 to 1
    ##                  Default: 0.2
    ## 
454
    ## Benchmarking purposes: 
455 456 457
    ##       -r      Optional: Path to the complete reference genome corresponding to the graph given. For optimal results using this
    ##                  benchmarking flag, please name the reference genomes using the Unicycler scheme: e.g. '1 length=4123456' '2 length=10000' '3 length=2000'
    ##                  for your chromosome and plasmids. Fasta file format
458

459 460 461
# Issues/Bugs

You can report any issues or bugs that you find while installing/running
462 463
gplas using the [issue
tracker](https://gitlab.com/sirarredondo/gplas/issues)