Commit 07ab04d3 authored by Jamie A. Jennings's avatar Jamie A. Jennings

Revising examples for README.md

parent effe4171
Pipeline #51143465 passed with stage
in 1 minute and 11 seconds
......@@ -6,10 +6,9 @@
RPL is a variant of modern Regular Expressions (regex) that is designed
to scale to big data, many developers, and large collections of patterns.
Rosie/RPL provides a CLI (like Unix `grep`, but better) and a library,
`librosie`, that can be used from programs written in Python, Go, Haskell, C,
and other languages. (See [Using Rosie from programs](Using Rosie from
programs) below.)
The Rosie project provides a CLI (like Unix `grep`) and a library, `librosie`,
that can be used within programs written in Python, Go, Haskell, C, and other
languages. (See [Using Rosie from programs](Using Rosie from programs) below.)
Using RPL is like being able to name regex and reuse them to build larger expressions:
......@@ -57,14 +56,10 @@ pattern, `net.url_strict` matches a URL according to
In that last screenshot, we used the output option `-o subs` to tell Rosie to
print only the sub-matches (similar to `grep -o`).
## Why use RPL instead of regex?
* Looks like a programming language, and plays well with development tools (like `diff`)
* Comes with a library of dozens of useful patterns (timestamps, network addresses, and more)
* Pattern development tools: tracing, REPL, color-coded match output
* Produces JSON output (and other formats)
* Implements Parsing Expression Grammars (PEG), which are more powerful than
regular expressions
Below is a gif showing the `rosie match` command with the pattern `all.things`
(a choice among times and dates (in dozens of formats), network patterns,
identifiers, words, and numbers. The default output format for `rosie match` is
`color`, and there is a customizable mapping from pattern types to colors.
<blockquote>
<table>
......@@ -78,18 +73,51 @@ identifier; Yellow: word
</table>
</blockquote>
## Why RPL?
* More *readable* and *maintainable* than regex
* RPL looks like a programming language, with whitespace, comments, identifiers
* Built-in unit test framework (useful as regression tests when modifying patterns)
* Plays well with development tools (like `diff`)
* Patterns can be (optionally) put into namespaces, e.g. `net` for network patterns
* Better *development experience* than regex
* Rosie ships with a library of dozens of useful patterns (timestamps, network addresses, and more)
* Want to see how a match succeeds or fails? Use the `rosie trace` command.
* Pattern development tools: tracing, REPL, color-coded match output
* The `rosie test` command compiles patterns and runs their unit tests. Use
this command during your project's build to avoid run-time errors.
* Match output can be in various formats including:
* Color, for human-readable use at the command line
* Plain text (full text `-o data` or list of sub-matches `-o subs`) for scripting
* JSON for post-processing by other programs
* Native data structures in Python, Haskell, Go, etc.
* RPL is a Parsing Expression Grammar (PEG) language
* A superset of regular expressions, i.e. more powerful
* This formalism is a coherent and even elegant alternative to the ad hoc
extensions to regular expressions found in most "regex" libraries.
* Supports linear run-time (in the input size) for common use cases
## NEWS
* v1.1.0 released (March 2019)
* Rosie has moved from github to gitlab!
* Rosie has moved from
[GitHub](https://github.com/jamiejennings/rosie-pattern-language) to
[GitLab](https://gitlab.com/rosie-pattern-language/rosie)
* [Brew installer for Mac OS X](https://gitlab.com/rosie-community/packages/homebrew-rosie)
* Pip installer for Python interface to librosie -- `pip install rosie`
## Features
- Small: the Rosie compiler/runtime/libraries take up less than 600KB of disk
- Good performance: faster than
[Grok](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html),
slower than [grep](https://en.wikipedia.org/wiki/Grep), does more than both of them
- Reasonably small:
* The Rosie compiler/runtime/libraries take up less than 2MB on disk
<!-- du -ch /usr/local/lib/librosie.a /usr/local/bin/rosie /usr/local/lib/rosie/lib /usr/local/lib/rosie/rpl -->
* Memory usage by the CLI is currently excessive, but this [will improve](Memory consumption) soon
- Good performance:
* faster than [Grok](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html)
(by a factor of 4 or more)
* slower than [grep](https://en.wikipedia.org/wiki/Grep)
* and Rosie does more than both of them
- Extensible pattern library
- Rosie is fluent in Unicode (UTF-8), ASCII, and the
[binary language of moisture vaporators](http://www.starwars.com/databank/moisture-vaporator)
......@@ -97,7 +125,55 @@ identifier; Yellow: word
### Using Rosie from programs
**TO BE WRITTEN**
Starting with version 1.1.0, the interface libraries for `librosie` have their
own repositories in the
[clients group](https://gitlab.com/rosie-community/clients) of the Rosie
Community group on GitLab. Roughly speaking, they support using Rosie in a way
similar to regex: you compile an expression and then use it for matching.
This example counts source lines of code in files, given the comment start
string on the command line, e.g. `python sloc.py "#" *.py`.
```python
# sloc.py
import sys, rosie
comment_start = sys.argv[1]
engine = rosie.engine()
source_line = engine.compile('!{[:space:]* "' + comment_start + '"/$}')
def count(f):
count = 0
for line in f:
if line and source_line.match(line): count += 1
return count
for f in sys.argv[2:]:
label = f + ": " if f else ""
print(label.ljust(36) + str(count(open(f, 'r'))).rjust(4) + " non-comment, non-blank lines")
```
The pattern `source_line` in the code above is defined as "not matching
optional-whitespace followed by either a comment or the end-of-line" -- in other
words, source lines.
Perhaps a more common usage of Rosie would be to leverage libraries of RPL
patterns, such as the ones that ship with Rosie. The python code below searches
files for ipv4 addresses, and is invoked as, e.g. ```python ipv4.py test/resolv.conf```.
```python
# ipv4.py
import sys, rosie
engine = rosie.engine()
engine.import_package('net')
ipv4 = engine.compile('net.ipv4')
for f in sys.argv[1:]:
matches = ipv4.search(open(f, 'r').read())
for m in matches:
print(m)
```
### The Rosie CLI
......@@ -114,6 +190,52 @@ identifier; Yellow: word
### Speed
A measurement we use in regression testing for performance is how long it takes
to process 1 million syslog entries using the pattern `syslog` as defined here:
```
import date, time, net, word, num
~ = [:space:]+
message = .*
datetime = { date.rfc3339 "T" time.rfc3339 }
syslog = datetime net.ipv4 {word.any "["num.int"]:"} message
```
The current release, v1.1.0, performs as follows:
Number of lines | Output format | Seconds (user) | Lines/sec | Machine
--------------- | ------------- | -------------- | ----------------- | -------
1,000,000 | binary | 5.8 | 172,000 (approx.) | MacBook Pro 2.9 GHz Intel Core i5
1,000,000 | JSON | 11.7 | 85,500 (approx.) | MacBook Pro 2.9 GHz Intel Core i5
Notes:
- The binary output format encodes the pattern type and start/end positions of
the match.
- The JSON output format includes the pattern type, start/end positions, and a
copy of the data (input) that matched.
### Memory consumption
The current release, v1.1.0, consumes memory egregiously (in the range 64MB --
250Mb). Our roadmap includes reducing this number considerably. A proof of
concept is the program
[match.c](https://gitlab.com/rosie-pattern-language/rosie/blob/dev/src/rpeg/test/match.c),
which requires less than 1MB. This small command-line program is linked only
with the Rosie run-time. It needs an RPL pattern that has been compiled in
advance, which is a prototype feature that has not yet been released.
This proof of concept needs around 9.0 seconds (user time) to execute the syslog
test mentioned above, generating JSON output. This rate, about 110k lines/sec,
suggests that current Rosie CLI has significant unnecessary overhead.
We hypothesize that much of the memory and speed overhead of the current CLI
derives from the file processing loop, which is written in Lua. And Lua interns
strings. The `match.c` prototype does the same work, but is written in pure C.
## Installing
- [x] Build from source (see [local installation](#local-installation) below)
......
# Find all the ipv4 addresses in files given on the command line
#
# EXAMPLE: python ipv4.py ../../test/resolv.conf
import sys, rosie
engine = rosie.engine()
engine.import_package('net')
ipv4 = engine.compile('net.ipv4')
for f in sys.argv[1:]:
matches = ipv4.search(open(f, 'r').read())
for m in matches:
print(m)
# -*- coding: utf-8; -*-
# -*- Mode: Python; -*-
#
# generic_sloc.py
# sloc.py
#
# © Copyright Jamie A. Jennings 2018.
# © Copyright Jamie A. Jennings 2018, 2019.
# LICENSE: MIT License (https://opensource.org/licenses/mit-license.html)
# AUTHOR: Jamie A. Jennings
#
# DEPENDENCIES:
# - A Rosie installation in /usr/local or other standard place
# - pip install rosie
#
# USAGE:
# python sloc.py "--" ../../src/lua/*.lua
from __future__ import unicode_literals, print_function
import os, sys
import rosie
# pip install rosie
# python generic_sloc.py "--" ../../src/lua/*.lua
def print_usage():
print("Usage: " + sys.argv[0] + " <comment_start> [files ...]")
import sys
if len(sys.argv) < 2:
print("Usage: " + sys.argv[0] + " <comment_start> [files ...]")
print_usage();
sys.exit(-1)
comment_start = sys.argv[1]
import rosie
engine = rosie.engine()
source_line, errs = engine.compile(bytes('!{[:space:]* "' + comment_start + '"/$}'))
if errs:
print(str(errs))
if os.path.isfile(comment_start) or len(comment_start) > 2:
print('Error: first argument should be the comment character, e.g. "#" or "--"')
print_usage()
sys.exit(-1)
def is_source(line):
if not line: return False
match, leftover, abend, t0, t1 = engine.match(source_line, bytes(line), 1, b"bool")
return match and True or False
# The pattern source_line is defined below as "not looking at
# optional-whitespace followed by either the start of a comment or the
# end of the line". So it will match non-comment, non-blank lines.
engine = rosie.engine()
source_line = engine.compile('!{[:space:]* "' + comment_start + '"/$}')
def count(f):
count = 0
for line in f:
if is_source(line): count += 1
if line and source_line.match(line): count += 1
return count
description = " non-comment, non-blank lines"
if len(sys.argv) == 2:
print(str(count(sys.stdin)) + description)
else:
for f in sys.argv[2:]:
label = f + ": " if f else ""
print(label + str(count(open(f, 'r'))).rjust(4) + description)
for f in sys.argv[2:]:
label = f + ": " if f else ""
print(label.ljust(36) + str(count(open(f, 'r'))).rjust(4) + " non-comment, non-blank lines")
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment