librosie.md 18.6 KB
Newer Older

# Documentation for librosie

## Overview

* pthreads
* convention about return value indicating an API error
* define engine
* input data is sequence of bytes
* output encoder defn, produces linearized output as a sequence of bytes (except bool)

## Guide to librosie.h

### Types

Librosie uses fixed-width integer types (e.g. `uint32_t`) in all key data
structures.  This ensures that we can do the following in platform-independent
ways: 

- publish accurate limits on things like input size and number of capture names
  in a pattern
- read and write data to disk in a single format

An `Engine` struct holds a pointer to the (opaque) engine state, and a lock to
restrict access to one thread at a time.  In a multi-threaded program, an Engine
should be created for each thread that will use librosie.  (The state of an
Engine is reasonably small.  Example programs in C and Go have spawned 1,000+
threads, each with their own Engine.)

	typedef struct rosie_engine {
		 void *L;
		 pthread_mutex_t lock;
	} Engine;

Rosie strings have a length and pointer to data.  They are not null terminated,
and may contain nulls.  Input data from the caller must be passed to librosie in
this form.

Librosie does not modify the input data, making it possible to pass to librosie
a "native" pointer to the data if the client language provides one.  For
example, the Python `cffi` binding to [libffi](https://sourceware.org/libffi/)
lets the Rosie Python client pass a pointer to the input data, which is a Python
`bytes` object. This avoids copying of the input data, saving time and memory.
Of course, the input data must be a contiguous sequence of bytes.

	typedef struct rosie_string {
		 uint32_t len;
		 byte_ptr ptr;
	} rstr;

The `rosie_match` API returns a structure describing a match result.  The fields
are:
- `data`, a string encoding of the results (**important:** see
  [Interpreting match results](#interpreting-match-results))
- `leftover`, an integer number of bytes left unmatched (when the match
  succeeded)
- `abend`, 0 when the match ended normally, 1 when it ended abnormally by encountering an
  RPL `error` pattern
- `ttotal`, an integer number of microseconds spent in the call, 
  subject to the platform's clock resolution (see **clock()** in **time.h**)
- `tmatch`, an integer number of microseconds spent actually doing the matching
  (whereas `ttotal` includes time spent encoding the results to produce `data`)

	typedef struct rosie_matchresult {
		 str data;
		 int leftover;
		 int abend;
		 int ttotal;
		 int tmatch;
	} match;

### Tuning parameters

#### INITIAL_RPLX_SLOTS 32

Compiled patterns are assigned a positive integer handle, which is returned to
the client.  This number of slots are allocated when an engine is created.  More
are allocated on demand.

#### MIN_ALLOC_LIMIT_MB 8192

See [**rosie_alloc_limit**](#rosie_alloc_limit).  Do not lower this value.
Increasing it simply raises the minimum allocation limit that can be set through
**rosie_alloc_limit**.
 
#### MAX_ENCODER_NAME_LENGTH 64

Each of Rosie's output encoders has a name, e.g. `color`, `byte`.  The encoders
implemented in Lua are declared in **init.lua**, and those implemented in C are
named in `encoder_table` in **common.lua**, which maps from names to the numbers
used by the C code.

The `MAX_ENCODER_NAME_LENGTH` must be at least 1 more than the length of the
longest of these encoder names. Do not change this value.

### Uniform status codes returned

Each librosie API returns a status code:

- `SUCCESS` will always be defined as 0
- `ERR_OUT_OF_MEMORY`, when an allocation request fails
- `ERR_SYSCALL_FAILED`, when a system call fails
- `ERR_ENGINE_CALL_FAILED`, when a Rosie API call fails

The status `ERR_ENGINE_CALL_FAILED` indicates a bug in librosie.  That is, it is
safe to print a message suggesting that this be reported as an issue when
encountering this return value.

### Interpreting match results

In a `rosie_matchresult`, the `data` field is a `rosie_string` containing a
length and a pointer.  When the pointer is non-null, it points to a string (byte
sequence) with the given length (**not** null terminated).  This is the data
returned by the output encoder.

But when the `data` pointer is null, the `data` length field indicates the
actual result:

- `NO_MATCH` will always be defined as 0
- `MATCH_WITHOUT_DATA` will always be defined as 1, and is returned when the
output encoder produced no output data
- `ERR_NO_ENCODER`, when the output encoder or trace style passed to librosie is
invalid 
- `ERR_NO_PATTERN`, when the pattern handle passed to librosie is invalid
- `ERR_NO_FILE`, when a filename passed to librosie cannot be found (`rosie_matchfile` only)

## API

### Engine management

**Engine *rosie_new(str \*messages)**

**void rosie_finalize(Engine \*e)**

**int rosie_libpath(Engine \*e, str \*newpath)**

**int rosie_alloc_limit(Engine \*e, int \*newlimit, int \*usage)**

The front-end of the RPL compiler, the CLI, and some of the output encoders
(such as `color` and `jsonpp`) are written in Lua, a language that has garbage
collection. The **rosie_alloc_limit** API allows the client program to set and
query a "soft limit" on the size of the Lua heap.

The functions **rosie_match**, **rosie_trace**, and **rosie_matchfile** check to
see if the Lua heap has grown beyond the current limit, and if so, invokes the
garbage collector.  

When called with **newlimit** of 0, the limit is removed, and will default to
Lua's default garbage collection settings.

When called with **newlimit** of -1, the call is a query.  On return,
**newlimit** will be set to the current limit, and **usage** to the current Lua
heap usage.

The units of **newlimit** and **usage** are Kb (1024 bytes).

### Loading RPL into an engine

Strings containing RPL blocks are processed by an engine using **rosie_load**.
A block may contain a single statement (e.g. `d=[:digit:]`) or many statements.
A block may also contain comments, an RPL language version declaration, a
package declaration, and import statements.

**int rosie_load(Engine \*e, int \*ok, str \*src, str \*pkgname, str \*messages)**

The string **src** is read, compiled, and the resulting bindings are stored in
the engine's environment.  If **ok** is 0 on return, no errors occurred.  There
may still be **messages** (e.g. warnings).

If **ok** is non-zero, an error occurred, and **messages** will contain a
JSON-encoded error structure.

_TODO: Document the JSON violation structure._

The client is responsible for freeing **messages** with **rosie_free_string_ptr**.

If **src** contained a package declaration, the package name will be returned in
**pkgname**. 

The client is responsible for freeing **pkgname** with **rosie_free_string_ptr**.


**int rosie_loadfile(Engine \*e, int \*ok, str \*fn, str \*pkgname, str \*messages)**

Same functionality as **rosie_load**, except **fn** is a filename and librosie
reads and processes the contents of that file.


**int rosie_import(Engine \*e, int \*ok, str \*pkgname, str \*as, str \*actual_pkgname, str \*messages)**

Calling **rosie_import** with package <pkgname> causes the same actions as
calling **rosie_load** with the string `import <pkgname>`, with one exception:
**rosie_import** will always find and load the RPL package `<pkgname>` in the
filesystem.  By contrast, when **rosie_load** encounters `import <pkgname>`, the
package may have already been loaded into the engine.

Including a (string) value for the **as** parameter behaves like `import
<pkgname> as <as>` with the same caveats.


### Compiling an RPL expression

An RPL expression must be compiled before it can be used to match (or trace)
with an input string.

**int rosie_compile(Engine \*e, str \*expression, int \*pat, str \*messages)**

The string **expression** is compiled into an _rplx_ object and an integer
handle to that object is returned.  The object will be available until
explicitly freed, or until the engine **e** is freed with **rosie_finalize**.

If **pat** is non-zero upon return, it is the _rplx handle_, which behaves
somewhat like a Unix file descriptor in that (1) it remains valid until
explicitly freed (with **rosie_free_rplx**) and (2) the same integer value may
be reused by the engine afterwards.

Regardless of error status, **messages** may contain errors, warnings, or other
information. 

The client is responsible for freeing **messages** with **rosie_free_string_ptr**.


**int rosie_free_rplx(Engine \*e, int pat)**

Call **rosie_free_rplx** to allow the engine to reclaim the compiled pattern **pat**.


### Matching and tracing

**int rosie_match(Engine \*e, int pat, int start, char \*encoder, str \*input, match \*match)**

Using engine **e** and its pattern **pat**, match the pattern against **input**
and produce match data (a string) using output encoder **encoder**.  Note that
**encoder** is a null-terminated C-style string.

The **match** argument is a pointer to a **rosie_matchresult** structure that is
_allocated by the client program,_ into which the match results will be written.
A single struct may be used across repeated calls to **rosie_match**, and indeed
this is recommended.

As noted in the earlier section on [librosie types](#types), a
**rosie_matchresult** contains one dynamically allocated object, its **data**
field.  The client program does not need to and _should not_ manage the storage
for **data** because librosie will automatically reuse it, making it larger as
needed (using **realloc**).

IMPORTANT: Because librosie reuses the match results **data** field (a string),
the client program must make a copy of that string, if necessary, before calling
**rosie_match** again.


**int rosie_matchfile(Engine \*e, int pat, char \*encoder, int wholefileflag,
		    char \*infilename, char \*outfilename, char \*errfilename,
		    int \*cin, int \*cout, int \*cerr,
		    str \*err)**

This is a convenience function, and useful if you are writing a new CLI.  With
the same meanings of **e**, **pat**, and **encoder** as above, this function
reads **infilename** line by line, unless **wholefileflag** is non-zero, in
which case the entire file contents is read at once.  Match output, produced by
**encoder** is written to **outfilename**, and input lines that did not match
are written to **errfilename**.

An empty string passed in for a filename argument defaults to the standard
input, output, and error channels, respectively.  To ignore one of the outputs,
set its filename to "/dev/null" or the equivalent on your platform.

When the value returned in **cin** is 0 or more, **rosie_matchfile** executed
successfully. 

And on a successful return, **cin**, **cout**, and **cerr** will contain the
number of lines read from the input and written to **outfilename** and
**errfilename**.

If the value returned in **cin** is -1, then **cout** will contain an error
code such as `ERR_NO_FILE` and **err** will hold a human-readable explanation.

The client is responsible for freeing **err** with **rosie_free_string_ptr**.


**int rosie_trace(Engine \*e, int pat, int start, char \*trace_style, str \*input, int \*matched, str \*trace)**

282 283
Like **rosie_match**, but executes the trace operation where **trace_style** is
a null-terminated C string argument analogous to **encoder**.
284

285 286 287 288
Return values are the boolean **matched** (0 for false, 1 for true) and the
string **trace** (which holds the trace output as a string).  As with the
**data** field in a match result, a null pointer field in **trace** requires
checking the length field to determine whether an error occurred.
289

290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383
When the **trace** pointer is null and its length is also null, then no trace
data was returned.  (Currently, all trace styles produce some data, so this
outcome is not possible.)  A non-zero length with a null pointer indicates one
of the errors listed above in the
[interpreting match results](#interpreting-match-results) section.

The client is responsible for freeing **trace** with **rosie_free_string_ptr**.


### Configuration

**int rosie_config(Engine \*e, str \*retvals)**

The **rosie_config** API provides a way to read the configuration of an engine
and of the Rosie installation that created it.  The string returned in
**retvals** is a JSON-encoded list of 3 configuration tables:

(1) The first table describes the engine-independent Rosie installation
configuration.
(2) The second table describes the engine configuration.
(3) The third table is a set of configuration parameters that is passed to every
output encoder.  (An encoder may use any, all, or none of these.)

Each of the tables is a list of items.  Each item has the following structure,
where all JSON values are strings:

- `name`: a unique name for this item of the configuration
- `set_by`: **distribution** if this aspect of the configuration is set by the
  Rosie distribution that was installed; **build** if set at build-time;
  **default** if it is a run-time default that can be customized; **rcfile** if
  set in the Rosie init file **.rosierc**; **CLI** if set on the command line
  (CLI only); other values, including the empty string, are possible
- `value`: the current value for this item
- `description`: a human-readable description of the item
- Additional (undocumented) keys may be present.


### Init file processing

These functions have two intended uses: writing a new CLI, and using the Rosie
init file format to customize an application that uses Rosie.

**int rosie_read_rcfile(Engine \*e, str \*filename, int \*file_exists, str \*options, str \*messages)**

Given a **filename**, return whether **file_exists** (1) or not (0), the
**options** declared in the file, and any processing **messages**.  The
**options** string is returned as a JSON-encoded list of items, where each item
is a structure containing a single key/value pair.  The key is the name of a
configuration parameter set in **filename**, e.g. `libpath`.  The value is a
JSON string containing the value set in the init file.

Important notes:
- The init file is allowed to contain keys that are not recognized by Rosie,
  though using these as custom keys runs the risk of a name collision in the
  future, should Rosie start using that key name.  (When it becomes necessary,
  we can easily mitigate this with namespaces.)
- A configuration key is allowed to be repeated in an init file.  When this
  occurs, the item list returned by **rosie_config** returns the settings in the
  order they appeared.
- Certain Rosie configuration keys, like `libpath` and `colors`, are treated by
  Rosie as additive when used multiple times. When this happens, Rosie coalesces
  the multiple values into a single value string. (E.g. multiple `colors`
  settings are appended into a single colon-separated string.)

A better interface would be to accept a string instead of a filename, and let
the client program read the init file and slurp the contents into a string.  A
future **rosie_read_configuration** API may replace **rosie_read_rcfile**.

The client is responsible for freeing **options** and **messages** with **rosie_free_string_ptr**.


**int rosie_execute_rcfile(Engine \*e, str \*filename, int \*file_exists, int \*no_errors, str \*messages)**

Processes **filename** the same way the Rosie CLI does.  Returns two boolean
flags, **file_exists** and **no_errors**, and possibly also **messages**.

Because of the race condition that can occur between reading an init file with
**rosie_read_rcfile** and executing it, this API will very likely be replaced by
one that accepts a JSON-encoded configuration as input.  The usage pattern will
then become:

(1) Read a configuration file or string, returning a JSON-encoded structure.
(2) Analyze it, making changes as needed, for example processing and then
removing custom settings.
(3) Execute the resulting configuration (via a future
**rosie_execute_configuration**). 

The client is responsible for freeing **messages** with **rosie_free_string_ptr**.


### String management

Rosie strings are contiguous sequences of bytes, represented by a pointer to the
start and a length.
384 385

**str rosie_new_string(byte_ptr msg, size_t len)**
386 387 388 389

Copies **len** bytes at the pointer **msg** into newly allocated space.  Returns
a new string structure initialized such that it refers to the copy.

390
**str \*rosie_new_string_ptr(byte_ptr msg, size_t len)**
391 392 393 394 395

Copies **len** bytes at the pointer **msg** into newly allocated space, and
allocates a new string structure initialized such that it refers to the copy.
Returns a pointer to the structure.

396 397 398 399
**str rosie_string_from(byte_ptr msg, size_t len)**



400 401 402 403 404 405 406
**str \*rosie_string_ptr_from(byte_ptr msg, size_t len)**




**void rosie_free_string(str s)**
**void rosie_free_string_ptr(str \*s)**
407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460



**int rosie_expression_refs(Engine \*e, str \*input, str \*refs, str \*messages)**
**int rosie_block_refs(Engine \*e, str \*input, str \*refs, str \*messages)**
**int rosie_expression_deps(Engine \*e, str \*input, str \*deps, str \*messages)**
**int rosie_block_deps(Engine \*e, str \*input, str \*deps, str \*messages)**
**int rosie_parse_expression(Engine \*e, str \*input, str \*parsetree, str \*messages)**
**int rosie_parse_block(Engine \*e, str \*input, str \*parsetree, str \*messages)**

/\*

Administrative:
+  status:int, engine:void\* = new(const char \*name)
+  status:int = finalize(void \*engine)
+  status:int, desc:string = config(void \*engine)
\*  status:int = setlibpath(void \*engine, const char \*libpath)
+  set soft memory limit to m MB, with optional logging of when it is hit
  logging level (to stderr)?
  clone an engine?  (to avoid setup cost; but cloned engine must be in new Lua state)


RPL:
+  status:int, pkgname:string, errors:strings = load(void \*engine, const char \*rpl)
+  status:int, pkgname:string, errors:strings = import(packageref, localname)
  status:int = undefine(id)
  test(rpl)?
  testfile(filename)?

Match/trace:
+  status:int, pat:int, errors:strings = compile(void \*engine, const char \*expression)
+  status:int = free_rplx(void \*engine, int pat)
+  status:int = match(void \*engine, int pat, int start, str \*encoder,
		str \*input, match \*match);
+  status:int, tracestring:\*buffer = trace(void \*engine, int pat, buffer \*input, int start, int encoder, int tracestyle)

  status:int, cin:int, cout:int, cerr:int, errors:strings =
    matchfile(void \*engine, int pat, 
       const char \*infilename, const char \*outfilename, const char \*errfilename, 
       int start, int encoder, int wholefile)

  status:int, cin:int, cout:int, cerr:int, errors:strings =
    tracefile(void \*engine, void pat, 
       const char \*infilename, const char \*outfilename, const char \*errfilename, 
       int start, int encoder, int readmethod, int tracestyle)

Debugging:
  status:int, desc:string = lookup(void \*engine, const char \*id)
  status:int, expr:string, errors:strings = expand(void \*engine, const char \*expr)
  status:int, descs:strings = list(void \*engine, const char \*localnamefilter, const char \*packagenamefilter)

\*/

#endif