Commit 3d9078ad authored by gerd's avatar gerd

Finished pxp-pp.


git-svn-id: https://godirepo.camlcity.org/svn/lib-pxp/trunk@697 dbe99aee-44db-0310-b2b3-d33182c8eb97
parent 00518123
......@@ -7,16 +7,18 @@ with_lex=1
with_wlex=1
with_wlex_compat=1
with_ulex=1
with_pp=1
lexlist="utf8,iso88591,iso88592,iso88593,iso88594,iso88595,iso88596,iso88597,iso88598,iso88599,iso885910,iso885913,iso885914,iso885915,iso885916"
version="1.1.95test1"
version="1.1.95test2"
exec_suffix=""
help_lex="Enable/disable ocamllex-based lexical analyzer for the -lexlist encodings"
help_wlex="Enable/disable wlex-based lexical analyzer for UTF-8"
help_wlex_compat="Enable/disable wlex-style compatibility package for UTF-8 and ISO-8859-1"
help_ulex="Enable/disable ulex-based lexical analyzer for UTF-8"
help_pp="Enable/disable the build of the preprocessor (pxp-pp)"
options="lex wlex wlex_compat ulex"
options="lex wlex wlex_compat ulex pp"
lexlist_options="utf8 usascii iso88591 iso88592 iso88593 iso88594 iso88595 iso88596 iso88597 iso88598 iso88599 iso885910 iso885913 iso885914 iso885915 iso885916 koi8r windows1250 windows1251 windows1252 windows1253 windows1254 windows1255 windows1256 windows1257 windows1258 cp437 cp737 cp775 cp850 cp852 cp855 cp856 cp857 cp860 cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp1006 macroman"
print_options () {
......@@ -189,6 +191,7 @@ if [ $with_wlex -gt 0 ]; then
echo "not found"
echo "wlex support is disabled"
with_wlex=0
with_wlex_compat=0
fi
fi
......@@ -206,6 +209,12 @@ if [ $with_ulex -gt 0 ]; then
fi
fi
# If ulex not found/disabled, also disable pxp-pp:
if [ $with_ulex -eq 0 ]; then
with_pp=0
fi
######################################################################
# Check Lexing.lexbuf type
......@@ -229,6 +238,25 @@ fi
rm -f tmp.*
######################################################################
# Check type of camlp4 locations
printf "%s" "Checking type of camlp4 location... "
cat <<EOF >tmp.ml
open Stdpp;;
raise_with_loc (0,0) Not_found;;
EOF
if ocamlc -c -I +camlp4 tmp.ml >/dev/null 2>/dev/null; then
echo "old style"
camlp4_loc=""
else
echo "new style"
camlp4_loc="-DOCAML_NEW_LOC"
fi
rm -f tmp.*
######################################################################
# Pregenerated wlex lexers
......@@ -286,7 +314,10 @@ print_options
echo
pkglist="pxp pxp-engine"
# Currently pkglist is constant
if [ $with_pp -gt 0 ]; then
pkglist="$pkglist pxp-pp"
fi
genpkglist=""
# Generated packages
......@@ -405,6 +436,7 @@ ALLGENPKGLIST = $allgenpkglist
EXEC_SUFFIX = $exec_suffix
LEXBUF_307 = $lexbuf_307
LEX_OPT = $lex_opt
CAMLP4_LOC = $camlp4_loc
_EOF_
######################################################################
......
......@@ -85,8 +85,8 @@ The following text is valid XML:
The first element has the expanded name (namespace1,a) while the second element
has the expanded name (namespace2,a); so the elements have different types. As
already pointed out, PXP does not support the expanded names directly (there is
some support for them in elements, but not in attributes). Alternatively, the
already pointed out, PXP does not support the expanded names directly.
Alternatively, the
XML text is transformed while it is being parsed such that the prefixes become
unique. In this example, the transformed text would read:
......@@ -124,6 +124,52 @@ because PXP normalizes any prefixes for namespace1 or namespace2 to the
preferred prefixes "x" and "y".
</p>
<p>Since PXP-1.1.95, the namespace support has been extended. In
addition to prefix normalization, the parser now also stores the
scoping structure of the namespaces (in the namespace_scope
objects). More or less, this means that the parser remembers
which elements have which "xmlns" attributes. There are two
important applications of this feature:</p>
<p>First, it is now possible to look up the namespace URI when
only the original, non-normalized namespace prefix is known.
A number of XML standards, e.g. XSchema, use namespace prefixes
within data nodes. Of course, these prefixes are not normalized
by PXP, but simply remain as they are when the XML text is
parsed. To get the URI of such a prefix p in the context of node
n, just call
<code>
n # namespace_scope # uri_of_display_prefix p
</code>
In PXP terminology, the non-normalized prefixes are now called
"display prefixes".</p>
<p>The other application is that it is now even possible to
retrieve the original "display" prefix of node names, e.g.
<code>
n # display_prefix
</code>
returns it. However, the display prefix is only guessed in the
sense that when there are several prefixes bound to the same
URI, one of the prefixes may be taken. For instance, in
<code><![CDATA[
<x:a xmlns:x="sample" xmlns:y="sample"/>
]]></code>
both "x" and "y" are bound to the same URI "sample", and
the display_prefix method selects now one of the prefixes
at random.</p>
<p>It is now even possible to output the parsed XML text
with original namespace structure: The "display" method
outputs XML text where the namespaces are declared as in the
original XML text.</p>
<p>Regarding the "xmlns" attributes, PXP treats them in a very special
way. It is not only allowed not to declare them in the DTD, such declarations
would be even not applied to the actual "xmlns" attributes. For example,
......
......@@ -58,6 +58,17 @@ the runtime part of wlex, and not the "wlex" command itself.</p>
<p>-with-wlex-compat</p>
<p>Creates a compatibility package pxp-wlex that includes lexers
for UTF8 and ISO-8859-1 (may be required to build old software)</p>
</li>
<li>
<p>-with-ulex</p>
<p>Enables the lexical analyzer that works for UTF-8 as internal encoding, and that is based on Alain Frisch's ulex tool. It
is relatively small, but a bit slower than the ocamllex-based lexers.
ulex will supersede wlex soon.</p>
</li>
<li>
<p>-with-pp</p>
<p>Enables the PXP preprocessor (installed as package pxp-pp).
See the file PREPROCESSOR for details. The preprocessor also requires ulex.</p>
</li>
<li>
<p>-lexlist &lt;list-of-encodings&gt;</p>
......
......@@ -23,7 +23,7 @@ installrel = $$HOME/homepage/ocaml-programming.de/packages/documentation/pxp/ind
.PHONY: all
all: README INSTALL ABOUT-FINDLIB SPEC EXTENSIONS
all: README INSTALL ABOUT-FINDLIB SPEC EXTENSIONS PREPROCESSOR
README: README.xml common.xml config.xml readme.dtd
$(readme) -text README.xml >README
......@@ -40,6 +40,9 @@ SPEC: SPEC.xml common.xml config.xml readme.dtd
EXTENSIONS: EXTENSIONS.xml common.xml config.xml readme.dtd
$(readme) -text EXTENSIONS.xml >EXTENSIONS
PREPROCESSOR: PREPROCESSOR.xml common.xml config.xml readme.dtd
$(readme) -text PREPROCESSOR.xml >PREPROCESSOR
DEV: DEV.xml common.xml config.xml readme.dtd
$(readme) -text DEV.xml >DEV
#$(readme) -html DEV.xml >$(installdev)
......
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE readme SYSTEM "readme.dtd" [
<!ENTITY % common SYSTEM "common.xml">
%common;
<!ENTITY m "<em>PXP</em>">
]>
<readme title="The Preprocessor for PXP">
<sect1>
<title>The Preprocessor for PXP</title>
<p>Since PXP-1.1.95, there is a preprocessor as part of the PXP
distribution. It allows you to compose XML trees and event lists
dynamically, which is very handy to write XML transformations.</p>
<p>To enable the preprocessor, compile your source files as in:
<code>
ocamlfind ocamlc -syntax camlp4o -package pxp-pp,... ...
</code>
The package pxp-pp contains the preprocessor. The -syntax option
enables camlp4, on which the preprocessor is based. It is also
possible to use it together with the revised syntax, use
"-syntax camlp4r" in this case.</p>
<p>In the toploop, type
<code>
ocaml
# #use "topfind";;
# #camlp4o;;
# #require "pxp-pp";;
# #require "pxp";;
</code>
</p>
<p>The preprocessor defines the following new syntax notations,
explained below in detail:
<code><![CDATA[
<:pxp_charset< CHARSET_DECL >>
<:pxp_tree< EXPR >>
<:pxp_vtree< EXPR >>
<:pxp_evlist< EXPR >>
<:pxp_evpull< EXPR >>
<:pxp_text< TEXT >>
]]></code>
The basic notation is "pxp_tree" which creates a tree of PXP document
nodes as described in EXPR. "pxp_vtree" is the variant where the tree
is immediately validated. "pxp_evlist" creates a list of PXP events
instead of nodes, useful together with the event-based parser.
"pxp_evpull" is a variation of the latter: Instead of an event list
an event generator is created that works like a pull parser.</p>
<p>The "pxp_charset" notation only configures the character sets to
assume. Finally, "pxp_text" is a notation for string literals.</p>
<sect2>
<title>Creating constant XML</title>
<p>The following examples are all written for "pxp_tree". You can
also use one of the other XML composers instead, but see the notes
below.</p>
<p>In order to use "pxp_tree", you must define two variables in
the environment: "spec" and "dtd":
<code>
let spec = Pxp_tree_parser.default_spec;;
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
</code>
These variables occur in the code generated by the preprocessor.
The "dtd" variable is the DTD object. Note that you need it even
in well-formedness mode (validation turned off). The "spec" variable
controls which classes are instantiated as node representation
(see PXP manual).</p>
<p>Now you can create XML trees like in
<code><![CDATA[
let book =
<:pxp_tree<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
]]></code>
As you can see, the syntax is somehow XML-related but not really XML.
(Many ideas are borrowed from CDUCE, by the way.) In particular,
there are start tags like &lt;title&gt; but no end tags. Instead,
we are using square brackets to denote the children of an XML
element. Furthermore, character data must be put into double
quotes.</p>
<p>You may ask why the well-known XML syntax has been modified for
this preprocessor. There are many reasons, and they will become
clearer in the following explanations. For now, you can see the advantage
that the syntax is less verbose, as you need not to repeat the
element names in end tags. Furthermore, you can exactly control
which characters are part of the data nodes without having to make
compromises with indentation.</p>
<p>Attributes are written as in XML:
<code><![CDATA[
let book =
<:pxp_tree<
<book id="BOOK_001">
[ <title lang="en">[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
]]></code>
</p>
<p>An element without children can be written
<code><![CDATA[
<element>[]
]]></code>
or slightly shorter:
<code><![CDATA[
<element/>
]]></code>
</p>
<p>You can also create processing instructions and comment nodes:
<code><![CDATA[
let list =
<:pxp_tree<
<list>
[ <!>"Now the list of books follows!"
<?>"formatter_directive" "one book per page"
book
]
>>
]]></code>
The notation "&lt;!>" creates a comment node with the following string
as contents. The notation "&lt;?>" needs two strings, first the target,
then the value (here, this results in
"&lt;?formatter_directive one book per page?>". </p>
<p>Look again at the last example: The O'Caml variable "book" occurs,
and it inserts its tree into the list of books. Identifiers
without "decoration" just refer to O'Caml variables. We will see
more examples below.</p>
<p>The preprocessor syntax knows a number of shortcuts and variations.
First, you can omit the square brackets when an element has exactly
one child:
<code><![CDATA[
<element><child>"Data inside child"
]]></code>
This is the same as
<code><![CDATA[
<element>[ <child>[ "Data inside child" ] ]
]]></code>
Second, you are already used to a common abbreviation: Strings are
automatically converted to data nodes. The "expanded" syntax is
<code><![CDATA[
<*>"Data string"
]]></code>
where "&lt;*>" denotes a data node, and the following string is
used as contents. Usually, you can omit "&lt;*>". However, there
are a few occasions where this notation is still useful, see below.</p>
<p>In strings, the usual entity references can be used:
"Double quotes: &amp;quot;". For a newline character,
write &amp;#10;.</p>
<p>The preprocessor knows two operators: "^" concatenates strings,
and "@" concatenates lists. Examples:
<code><![CDATA[
<element>[ "Word1" ^ "Word2" ]
<element>([ <a/> ] @ [ <b/> ])
]]></code></p>
<p>Parentheses can be used to clarify precedence. For example:
<code><![CDATA[
<element>(l1 @ l2)
]]></code>
Here, the concatenation operator "@" could also be parsed as
<code><![CDATA[
(<element> l1) @ l2
]]></code>
Parentheses may be used in every expression.</p>
<p>Rarely used, there is also a notation for the
"super root" nodes (see the PXP manual for their meaning):
<code><![CDATA[
<^>[ <element> ... ]
]]></code>
</p>
</sect2>
<sect2>
<title>Dynamic XML</title>
<p>Let us begin with an example. The task is to convert
O'Caml values of type
<code><![CDATA[
type book =
{ title : string;
author : string;
isbn : string;
}
]]></code>
to XML trees like
<code><![CDATA[
<book id="BOOK_'isbn'">
<title>'title'</title>
<author>'author'</title>
</book>
]]></code>
(conventional syntax). When b is the book variable, the solution is
<code><![CDATA[
let book =
let title = b.title
and author = b.author
and isbn = b.isbn in
<:pxp_tree<
<book id=("BOOK_" ^ isbn)>
[ <title><*>title
<author><*>author
]
>>
]]></code>
First, we bind the simple O'Caml variables "title", "author", and
"isbn". The reason is that the preprocessor syntax does not allow
expressions like "b.title" directly in the XML tree (but see below
for a better workaround).</p>
<p>The XML tree contains the O'Caml variables. The "id" attribute
is a concatenation of the fixed prefix "BOOK_" and the contents of
"isbn". The "title" and "author" elements contain a data node
whose contents are the O'Caml strings "title", and "author",
respectively.</p>
<p>Why "&lt;*>"? If we just wrote "&lt;title>title", the
generated code would assume that the "title" variable is an XML node,
and not a string. From this point of view, "&lt;*>" works like
a type annotation, as it specialises the type of the following
expression.</p>
<p>Here is an alternate solution:
<code><![CDATA[
let book =
<:pxp_tree<
<book id=("BOOK_" ^ (: b.isbn :))>
[ <title><*>(: b.title :)
<author><*>(: b.author :)
]
>>
]]></code>
The notation "(: ... :)" allows you to include arbitrary O'Caml
expressions into the tree. In this solution it is no longer necessary
to create artificial O'Caml variables for the only purpose of
injecting values into trees.
</p>
<p>It is possible to create XML elements with dynamic names:
Just put parentheses around the expression. Example:
<code><![CDATA[
let name = "book" in
<:pxp_tree< <(name)> ... >>
]]></code>
With the same notation, one can also set attribute names dynamically:
<code><![CDATA[
let att_name = "id" in
<:pxp_tree< <book (att_name)=...> ... >>
]]></code>
Finally, it is also possible to include complete attribute lists
dynamically:
<code><![CDATA[
let att_list = [ "id", ("BOOK_" ^ b.isbn) ] in
<:pxp_tree< <book (: att_list :) > ... >>
]]></code>
</p>
<p>Typing: Depending on where a variable or O'Caml expression occurs,
different types are assumed. Compare the following examples:
<code><![CDATA[
<:pxp_tree< <element>x1 >>
<:pxp_tree< <element>[x2] >>
<:pxp_tree< <element><*>x3 >>
]]></code>
As a rule of thumb, the most general type is assumed that would make
sense at a certain location. As x1 could be replaced by a list
of children, its type is assumed to be a node list. As x2 could
be replaced by a single node, its type is assumed to be a node.
And x3 is a string, we had this case already.
</p>
</sect2>
<sect2>
<title>Character Encodings</title>
<p>As the preprocessor generates code that builds XML trees, it
must know two character encodings:</p>
<ul>
<li><p>Which encoding is used in the source code (in the .ml file)
</p></li>
<li><p>Which encoding is used in the XML representation, i.e.
in the O'Caml values representing the XML trees</p></li>
</ul>
<p>Both encodings can be set independently. The syntax is:
<code><![CDATA[
<:pxp_charset< source="ENC" representation="ENC" >>
]]></code>
The default is ISO-8859-1 for both encodings. For example, to set
the representation encoding to UTF-8, use:
<code><![CDATA[
<:pxp_charset< representation="UTF-8" >>
]]></code>
The "pxp_charset" notation is a constant expression that always
evaluates to "()". (A requirement by camlp4 that looks artificial.)
</p>
<p>When you set the representation encoding, it is required that the
encoding stored in the DTD object is the same. Remember that we
need a DTD object like
<code>
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
</code>
Of course, we must change this to the representation encoding, too,
in our example:
<code>
let dtd = Pxp_dtd.create_dtd `Enc_utf8;;
</code>
The preprocessor cannot check this at compile time, and for performance
reasons, a runtime check is not generated. So it is up to the programmer
that the character encodings are used in a consistent way.
</p>
</sect2>
<sect2>
<title>Validated Trees</title>
<p>In order to validate trees, you need a filled DTD object.
In principle, you can create this object by a number of methods.
For example, you can parse an external file:
<code><![CDATA[
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_file "sample.dtd")
]]></code>
It is, however, often more convenient to include the DTD literally
into the program. This works by
<code><![CDATA[
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string "...")
]]></code>
As the double quotes are often used inside DTDs, O'Caml string
literals are a bit impractical, as they are also delimited by
double quotes, and one needs to add backslashes as escape characters.
The "pxp_text" notation is often more readable here:
&lt;:pxp_text&lt;STRING>> is just another way of writing
"STRING". In our DTD, we have
<code><![CDATA[
let dtd_text =
<:pxp_text<
<!ELEMENT book (title,author)>
<!ATTLIST book id CDATA #REQUIRED>
<!ELEMENT title (#PCDATA)>
<!ATTLIST title lang CDATA "en">
<!ELEMENT author (#PCDATA)>
>>;;
let config = default_config;;
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string dtd_text);;
]]></code>
Note that "pxp_text" is not restricted to DTDs, as it can be used
for any kind of string.</p>
<p>After we have the DTD, we can validate the trees. One
option is to call the "validate" function:
<code><![CDATA[
let book =
<:pxp_tree<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>;;
Pxp_document.validate book;;
]]></code>
(This example is invalid, as the "id" attribute is missing.)</p>
<p>Note that it is a misunderstanding that "pxp_tree" builds XML trees in
well-formed mode. You can create any tree with it, and the fact is that
"pxp_tree" just does not invoke the validator. So if the DTD enforces
validation, the tree is validated when the "validate" function is
called. If the DTD is in well-formedness mode, the tree is effectively
not validated, even when the "validate" function is invoked. Btw,
the following statements would create a DTD in well-formedness mode:
<code>
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
dtd # allow_arbitrary;
</code>
As an alternative of calling the "validate" function, one can also
use "pxp_vtree" instead. It immediately validates every XML element it
creates. However, "injected" subtrees are not validated, i.e. validation
does not proceed recursively to subnodes as the "validate" function
does it.</p>
</sect2>
<sect2>
<title>Generating Events</title>
<p>As PXP has also an event model to represent XML, the preprocessor
can also produce such events. In particular, there are two modes: The
"pxp_evlist" notation outputs lists of events (type "event list")
representing the XML expression. The "pxp_evpull" notation creates
an automaton from which one can "pull" events (like from a pull
parser).</p>
<p>These two notations work very much like "pxp_tree". For example,
<code><![CDATA[
let book =
<:pxp_evlist<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
]]></code>
generates
<code><![CDATA[
[ E_start_tag ("book", [], None, <obj>);
E_start_tag ("title", [], None, <obj>);
E_char_data "The Lord of The Rings";
E_end_tag ("title", <obj>);
E_start_tag ("author", [], None, <obj>);
E_char_data "J.R.R. Tolkien";
E_end_tag ("author", <obj>);
E_end_tag ("book", <obj>)
]
]]></code>
Note that you neither need a "dtd" variable nor a "spec" variable.
There is one important difference, however: Both nodes and lists
of nodes are represented by the same type, "event list". That
has the consequence that in the following example x1 and x2
have the same type "event list":
<code><![CDATA[
<:pxp_evlist< <element>x1 >>
<:pxp_evlist< <element>[x2] >>
<:pxp_evlist< <element><*>x3 >>
]]></code>
In principle, it could be checked at runtime whether x1 and x2
have the right structure. However, this is not done because of
performance reasons.</p>
<p>As mentioned, "pxp_evpull" works like a pull parser.
After defining
<code><![CDATA[
let book =
<:pxp_evpull<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
]]></code>
"book" is a function 'a->event. One can call it to get the events
one after the other:
<code><![CDATA[
let e1 = book();; (* = Some(E_start_tag ("book", [], None, <obj>)) *)
let e2 = book();; (* = Some(E_start_tag ("title", [], None, <obj>)) *)
...
]]></code>
After the last event, "book" returns None to indicate the end of the
event stream.</p>
<p>As for "pxp_evlist", it is not possible to distinguish between
nodes and node lists. In this example, both x1 and x2 are assumed
to have type 'a->event:
<code><![CDATA[
<:pxp_evlist< <element>x1 >>
<:pxp_evlist< <element>[x2] >>
<:pxp_evlist< <element><*>x3 >>
]]></code>
Note that "&lt;element>x1" actually means to build a new pull automaton
around the existing pull automaton x1: The children of "element" are
retrieved by pulling events from x1 until "None" is returned.</p>
<p>A consequence of the pull semantics is that once an event
is obtained from an automaton, the state of the automaton is modified
such that it is not possible to get the same event again. If you need
an automaton that can be reset to the beginning, just wrap the
"pxp_evlist" notation into a functional abstraction:
<code><![CDATA[
let book_maker() =
<:pxp_evpull< <book ...> ... >>;;
let book1 = book_maker();;
let book2 = book_maker();;
]]></code>
This way, "book1" and "book2" are independent event streams.</p>
<p>There is another implication of the nature of the
automatons: Subexpressions are lazily evaluated. For example,
in
<code><![CDATA[
<:pxp_evpull< <element>[ <*> (: get_data_contents() :) ] >>
]]></code>
the call of get_data_contents is performed just before the event
for the data node is constructed.</p>
</sect2>
<sect2>
<title>Namespaces</title>
<p>By default, the preprocessor does not generate nodes or