Commit fb9a4895 authored by Dmitry Mozzherin's avatar Dmitry Mozzherin
Browse files

Close #94, close #96 new noparse names

Do not parse names that have "bacterium" as "epithet".
Do not parse names that start with "Candidatus".
parent 8ffcd3e8
Pipeline #229135807 passed with stages
in 3 minutes and 21 seconds
......@@ -2,7 +2,11 @@
## Unreleased
## [v.0.14.2]
- Add [#96]: Do not parse names starting with "Candidatus".
- Add [#95]: Remove make depenency on gRPC tooling.
- Add [#94]: Do not parse names with "bacterium" "epithet.
- Add [#90]: Allow `ß` in names.
- Add [#89]: Support `subspec.` as a rank.
- Add [#82]: Support authors with prefix `zu`.
......@@ -136,6 +140,7 @@ array of names instead of a stream.
This document follows [changelog guidelines]
[v0.14.3]: https://gitlab.com/gogna/gnparser/compare/v0.14.2...v0.14.3
[v0.14.2]: https://gitlab.com/gogna/gnparser/compare/v0.14.1...v0.14.2
[v0.14.1]: https://gitlab.com/gogna/gnparser/compare/v0.14.0...v0.14.1
[v0.14.0]: https://gitlab.com/gogna/gnparser/compare/v0.13.1...v0.14.0
......@@ -155,6 +160,14 @@ This document follows [changelog guidelines]
[v0.6.0]: https://gitlab.com/gogna/gnparser/compare/v0.5.1...v0.6.0
[v0.5.1]: https://gitlab.com/gogna/gnparser/tree/v0.5.1
[#100]: https://gitlab.com/gogna/gnparser/issues/100
[#99]: https://gitlab.com/gogna/gnparser/issues/99
[#98]: https://gitlab.com/gogna/gnparser/issues/98
[#97]: https://gitlab.com/gogna/gnparser/issues/97
[#96]: https://gitlab.com/gogna/gnparser/issues/96
[#95]: https://gitlab.com/gogna/gnparser/issues/95
[#94]: https://gitlab.com/gogna/gnparser/issues/94
[#93]: https://gitlab.com/gogna/gnparser/issues/93
[#92]: https://gitlab.com/gogna/gnparser/issues/92
[#91]: https://gitlab.com/gogna/gnparser/issues/91
[#90]: https://gitlab.com/gogna/gnparser/issues/90
......
......@@ -71,7 +71,6 @@ gnparser -h
<!-- vim-markdown-toc -->
## Introduction
Global Names Parser or ``gnparser`` is a program written in Go for breaking up
......@@ -112,19 +111,19 @@ more efficient JSON conversion.
## Features
- Fastest parser ever.
- Very easy to install, just placing executable somewhere in the PATH is
* Fastest parser ever.
* Very easy to install, just placing executable somewhere in the PATH is
sufficient.
- Extracts all elements from a name, not only canonical forms.
- Works with very complex scientific names, including hybrid formulas.
- Includes gRPC server that can be used as if a native method call from C++,
C#, Java, Python, Ruby, PHP, JavaScript, Objective C, Dart.
- Use as a native library from Go projects.
- Can run as a command line application.
- Can be scaled to many CPUs and computers (if 300 millions names an
hour is not enough).
- Calculates a stable UUID version 5 ID from the content of a string.
- Provides C-binding to incorporate parser into other languages.
* Extracts all elements from a name, not only canonical forms.
* Works with very complex scientific names, including hybrid formulas.
* Includes gRPC server that can be used as if a native method call from C++,
* C#, Java, Python, Ruby, PHP, JavaScript, Objective C, Dart.
* Use as a native library from Go projects.
* Can run as a command line application.
* Can be scaled to many CPUs and computers (if 300 millions names an
hour is not enough).
* Calculates a stable UUID version 5 ID from the content of a string.
* Provides C-binding to incorporate parser into other languages.
## Use Cases
......@@ -213,12 +212,12 @@ If there are problems with parsing a name, parser generates ``qualityWarnings``
messages and lowers parsing ``quality`` of the name. Quality values mean the
following:
- ``"quality": 1`` - No problems were detected
- ``"quality": 2`` - There were small problems, normalized result
* ``"quality": 1`` - No problems were detected
* ``"quality": 2`` - There were small problems, normalized result
should still be good
- ``"quality": 3`` - There were serious problems with the name, and the
* ``"quality": 3`` - There were serious problems with the name, and the
final result is rather doubtful
- ``"quality": 0`` - A string could not be recognized as a scientific
* ``"quality": 0`` - A string could not be recognized as a scientific
name and parsing fails
### Creating stable GUIDs for name-strings
......@@ -435,8 +434,8 @@ to parser. API calls would be accessibe on ``http://0.0.0.0:9000/api``.
Make sure to CGI-escape name-strings for GET requests. An '&' character
needs to be converted to '%26'
- ``GET /api?q=Aus+bus|Aus+bus+D.+%26+M.,+1870``
- ``POST /api`` with request body of JSON array of strings
* ``GET /api?q=Aus+bus|Aus+bus+D.+%26+M.,+1870``
* ``POST /api`` with request body of JSON array of strings
```ruby
require 'json'
......@@ -478,22 +477,22 @@ docker run gnames/gognparser "Amaurorhinus bewichianus (Wollaston,1860) (s.str.)
package main
import (
"fmt"
"fmt"
"gitlab.com/gogna/gnparser"
"gitlab.com/gogna/gnparser"
)
func main() {
opts := []gnparser.Option{
gnparser.Format("csv"),
gnparser.WorkersNum(100),
}
gnp := gnparser.NewGNparser(opts...)
res, err := gnp.ParseAndFormat("Bubo bubo")
if err != nil {
fmt.Println(err)
}
fmt.Println(res)
opts := []gnparser.Option{
gnparser.Format("csv"),
gnparser.WorkersNum(100),
}
gnp := gnparser.NewGNparser(opts...)
res, err := gnp.ParseAndFormat("Bubo bubo")
if err != nil {
fmt.Println(err)
}
fmt.Println(res)
}
```
......@@ -507,9 +506,9 @@ o := gnp.ParseToObject("Homo sapiens")
fmt.Println(o.Canonical.Simple)
switch d := o.Details.(type) {
case *pb.Parsed_Species:
fmt.Println(d.Species.Genus)
fmt.Println(d.Species.Genus)
case *pb.Parsed_Uninomial:
fmt.Println(d.Uninomial.Value)
fmt.Println(d.Uninomial.Value)
...
}
```
......@@ -554,13 +553,12 @@ provide a warning "Possible ICN author instead of subgenus".
## Authors
- [Dmitry Mozzherin]
* [Dmitry Mozzherin]
## Contributors
- [Geoff Ower]
- [Hernan Lucas Pereira]
* [Geoff Ower]
* [Hernan Lucas Pereira]
If you want to submit a bug or add a feature read
[CONTRIBUTING] file.
......@@ -568,10 +566,11 @@ If you want to submit a bug or add a feature read
## References
Rees, T. (compiler) (2019). The Interim Register of Marine and Nonmarine
Genera. Available from http://www.irmng.org at VLIZ.
Genera. Available from `http://www.irmng.org` at VLIZ.
Accessed 2019-04-10
## License
Released under [MIT license]
[releases]: https://gitlab.com/gogna/gnparser/-/releases
......
This diff is collapsed.
......@@ -21,7 +21,7 @@ func NoParse(data []byte) bool {
noparse1 = ("Not" | "None" | "Un" ("n"? "amed" | "identified"));
noparse2 = any* [Ii] "nc" ("." | "ertae") space* [Ss] "ed" ("." | "is");
noparse3 = any* ("phytoplasma" | "plasmid" "s"? | [^A-Z] "RNA" [^A-Z]*);
noparse3 = any* ("phytoplasma" | space "bacterium"| "plasmid" "s"? | [^A-Z] "RNA" [^A-Z]*);
main := (noparse1 | noparse2 | noparse3) %/setMatch
......
......@@ -63,6 +63,9 @@ func Preprocess(bs []byte) *Preprocessor {
pr.NoParse = true
return pr
}
if name == "Candidatus" || strings.HasPrefix("Candidatus ", name) {
return pr
}
pr.NoParse = NoParse(bs[0:i])
if pr.NoParse {
return pr
......
......@@ -3065,6 +3065,33 @@ noparse
3bf556bb-ea7c-536e-8b62-93ba329c559d,Uropodoideaincertaesedis,0,,,,,,0
#>
#SECTION: No parsing -- bacterium, Candidatus<
Acidobacteria bacterium
noparse
{"parsed":false,"quality":0,"verbatim":"Acidobacteria bacterium","cardinality":0,"surrogate":false,"virus":false,"hybrid":false,"bacteria":false,"nameStringId":"c982b4fd-c41a-5987-bcc8-989c4164b9ec","parserVersion":"test_version"}
c982b4fd-c41a-5987-bcc8-989c4164b9ec,Acidobacteria bacterium,0,,,,,,0
Acidimicrobiales bacterium JGI 01_E13
noparse
{"parsed":false,"quality":0,"verbatim":"Acidimicrobiales bacterium JGI 01_E13","cardinality":0,"surrogate":false,"virus":false,"hybrid":false,"bacteria":false,"nameStringId":"8b71a29b-4271-5a83-8a92-5dab1d9dc4c3","parserVersion":"test_version"}
8b71a29b-4271-5a83-8a92-5dab1d9dc4c3,Acidimicrobiales bacterium JGI 01_E13,0,,,,,,0
Acidobacterium ailaaui Myers & King, 2016
Acidobacterium ailaaui Myers & King, 2016
{"parsed":true,"quality":1,"verbatim":"Acidobacterium ailaaui Myers \u0026 King, 2016","normalized":"Acidobacterium ailaaui Myers \u0026 King 2016","cardinality":2,"canonicalName":{"full":"Acidobacterium ailaaui","simple":"Acidobacterium ailaaui","stem":"Acidobacterium ailaau"},"authorship":"Myers \u0026 King 2016","details":[{"genus":{"value":"Acidobacterium"},"specificEpithet":{"value":"ailaaui","authorship":{"value":"Myers \u0026 King 2016","basionymAuthorship":{"authors":["Myers","King"],"year":{"value":"2016"}}}}}],"positions":[["genus",0,14],["specificEpithet",15,22],["authorWord",23,28],["authorWord",31,35],["year",37,41]],"surrogate":false,"virus":false,"hybrid":false,"bacteria":true,"nameStringId":"b9f4555f-d2e0-5d40-acde-2b546a28a7fc","parserVersion":"test_version"}
b9f4555f-d2e0-5d40-acde-2b546a28a7fc,"Acidobacterium ailaaui Myers & King, 2016",2,Acidobacterium ailaaui,Acidobacterium ailaaui,Acidobacterium ailaau,Myers & King 2016,2016,1
Candidatus Amesbacteria bacterium GW2011_GWC1_46_24
noparse
{"parsed":false,"quality":0,"verbatim":"Candidatus Amesbacteria bacterium GW2011_GWC1_46_24","cardinality":0,"surrogate":false,"virus":false,"hybrid":false,"bacteria":false,"nameStringId":"83382178-94bf-5bf3-a8c8-fdbca4af927c","parserVersion":"test_version"}
83382178-94bf-5bf3-a8c8-fdbca4af927c,Candidatus Amesbacteria bacterium GW2011_GWC1_46_24,0,,,,,,0
Candidatus
noparse
{"parsed":false,"quality":0,"verbatim":"Candidatus","cardinality":0,"surrogate":false,"virus":false,"hybrid":false,"bacteria":false,"nameStringId":"fb9138ac-ae7a-58c9-a912-d31d0a4eeed3","parserVersion":"test_version"}
fb9138ac-ae7a-58c9-a912-d31d0a4eeed3,Candidatus,0,,,,,,0
#>
#SECTION: No parsing -- 'Not', 'None', 'Unidentified' phrases<
None recorded
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment