Commit 2eaff263 authored by gerd's avatar gerd

fixes for release


git-svn-id: https://godirepo.camlcity.org/svn/lib-pxp/trunk@748 dbe99aee-44db-0310-b2b3-d33182c8eb97
parent 182301ce
Copyright 1999 by Gerd Stolpmann
Copyright 1999-2009 by Gerd Stolpmann
The package PXP is copyright by Gerd Stolpmann.
......
......@@ -9,7 +9,7 @@ with_wlex_compat=1
with_ulex=1
with_pp=1
lexlist="utf8,iso88591,iso88592,iso88593,iso88594,iso88595,iso88596,iso88597,iso88598,iso88599,iso885910,iso885913,iso885914,iso885915,iso885916"
version="1.2.0test2"
version="1.2.1"
exec_suffix=""
help_lex="Enable/disable ocamllex-based lexical analyzer for the -lexlist encodings"
......
......@@ -50,6 +50,8 @@ for PXP; if you are looking for the stable distribution, please go
<sect1>
<title>Version History</title>
<ul>
<li><p>There is currently no development version.</p></li>
<--
<li>
<p><em>1.2.1:</em> Revised documentation</p>
<p>Addition: Pxp_event.unwrap_document</p>
......@@ -199,7 +201,7 @@ instruction, only misc* element misc* or whole documents are possible).
When an external entity A opens an external entity B, and B opens C,
relative paths of C have been interpreted wrong.</p>
</li>
-->
<!--
<li><p><em>1.1:</em> This is the new stable release!</p></li>
<li><p><em>1.0.99:</em></p>
......
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE readme SYSTEM "readme.dtd" [
<!ENTITY % common SYSTEM "common.xml">
%common;
<!-- Special HTML config: -->
<!ENTITY % readme:html:up '<a href="../..">up</a>'>
<!ENTITY % config SYSTEM "config.xml">
%config;
]>
<readme title="Extensions of the XML specification">
<sect1>
<title>This document</title>
<p>This parser has some options extending the XML specification. Here, the
options are explained.
</p>
</sect1>
<sect1>
<title>Optional declarations instead of mandatory declarations</title>
<p>The XML spec demands that elements, notations, and attributes must be
declared. However, there are sometimes situations where a different rule would
be better: <em>If</em> there is a declaration, the actual instance of the
element type, notation reference or attribute must match the pattern of the
declaration; but if the declaration is missing, a reasonable default declaration
should be assumed.</p>
<p>I have an example that seems to be typical: The inclusion of HTML into a
meta language. Imagine you have defined some type of "generator" or other tool
working with HTML fragments, and your document contains two types of elements:
The generating elements (with a name like "gen:xxx"), and the object elements
which are HTML. As HTML is still evolving, you do not want to declare the HTML
elements; the HTML fragments should be treated as well-formed XML fragments. In
contrast to this, the elements of the generator should be declared and
validated because you can more easily detect errors.</p>
<p>The following two processing instructions can be included into the DTD:</p>
<ul>
<li><p><code><![CDATA[<?pxp:dtd optional-element-and-notation-declarations?>]]></code>
References to unknown element types and notations no longer cause an
error. The element may contain everything, but it must be still
well-formed. It may have arbitrary attributes, and every attribute is
treated as an #IMPLIED CDATA attribute.</p>
</li>
<li><p><code><![CDATA[<?pxp:dtd optional-attribute-declarations elements="x y ..."?>]]></code>
References to unknown attributes inside one of the enumerated elements
no longer cause an error. Such an attribute is treated as an #IMPLIED
CDATA attribute.
</p>
<p>If there are several "optional-attribute-declarations" PIs, they are all
interpreted (implicitly merged).</p>
</li>
</ul>
</sect1>
<sect1>
<title>Normalized namespace prefixes</title>
<p>
The XML standard refers to names within namespaces as <em>expanded
names</em>. This is simply the pair (namespace_uri, localname); the namespace
prefix is not included in the expanded name.</p>
<p>
PXP does not support expanded names, but it does support namespaces. However,
it uses a model that is slightly different from the usual representation of
names in namespaces: Instead of removing the namespace prefixes and converting
the names into expanded names, PXP prefers it to normalize the namespace
prefixes used in a document, i.e. the prefixes are transformed such that they
refer uniquely to namespaces.</p>
<p>
The following text is valid XML:
<code><![CDATA[
<x:a xmlns:x="namespace1">
<x:a xmlns:x="namespace2">
</x:a>
</x:a>
]]></code>
The first element has the expanded name (namespace1,a) while the second element
has the expanded name (namespace2,a); so the elements have different types. As
already pointed out, PXP does not support the expanded names directly.
Alternatively, the
XML text is transformed while it is being parsed such that the prefixes become
unique. In this example, the transformed text would read:
<code><![CDATA[
<x:a xmlns:x="namespace1">
<x1:a xmlns:x1="namespace2">
</x1:a>
</x:a>
]]></code>
From a programmers point of view, this transformation has the advantage that
you need not to deal with pairs when comparing names, as all names are still
simple strings: here, "x:a", and "x1:a". However, the transformation seems to
be a bit random. Why not "y:a" instead of "x1:a"? The answer is that PXP allows
the programmer to control the transformation: You can simply demand that
namespace1 must use the normalized prefix "x", and namespace2 must use "y". The
declaration which normalized prefix to use can be programmed (by setting the
namespace_manager object), and it can be included into the DTD:
<code><![CDATA[
<?pxp:dtd namespace prefix="x" uri="namespace1"?>
<?pxp:dtd namespace prefix="y" uri="namespace2"?>
]]></code>
There is another advantage of using normalized prefixes: You can safely refer
to them in DTDs. For example, you could declare the two elements as
<code><![CDATA[
<!ELEMENT x:a (y:a)>
<!ELEMENT y:a ANY>
]]></code>
These declarations are applicable even if the XML text uses different prefixes,
because PXP normalizes any prefixes for namespace1 or namespace2 to the
preferred prefixes "x" and "y".
</p>
<p>Since PXP-1.1.95, the namespace support has been extended. In
addition to prefix normalization, the parser now also stores the
scoping structure of the namespaces (in the namespace_scope
objects). More or less, this means that the parser remembers
which elements have which "xmlns" attributes. There are two
important applications of this feature:</p>
<p>First, it is now possible to look up the namespace URI when
only the original, non-normalized namespace prefix is known.
A number of XML standards, e.g. XSchema, use namespace prefixes
within data nodes. Of course, these prefixes are not normalized
by PXP, but simply remain as they are when the XML text is
parsed. To get the URI of such a prefix p in the context of node
n, just call
<code>
n # namespace_scope # uri_of_display_prefix p
</code>
In PXP terminology, the non-normalized prefixes are now called
"display prefixes".</p>
<p>The other application is that it is now even possible to
retrieve the original "display" prefix of node names, e.g.
<code>
n # display_prefix
</code>
returns it. However, the display prefix is only guessed in the
sense that when there are several prefixes bound to the same
URI, one of the prefixes may be taken. For instance, in
<code><![CDATA[
<x:a xmlns:x="sample" xmlns:y="sample"/>
]]></code>
both "x" and "y" are bound to the same URI "sample", and
the display_prefix method selects now one of the prefixes
at random.</p>
<p>It is now even possible to output the parsed XML text
with original namespace structure: The "display" method
outputs XML text where the namespaces are declared as in the
original XML text.</p>
<p>Regarding the "xmlns" attributes, PXP treats them in a very special
way. It is not only allowed not to declare them in the DTD, such declarations
would be even not applied to the actual "xmlns" attributes. For example,
it is not possible to declare a default value for "xmlns:x", as in
<code><![CDATA[
<ATTLIST ... xmlns:x CDATA "mynamespaceuri">
]]></code>
The default value would be ignored. Furthermore, it is not possible to
declare "xmlns" attributes as being required - validation will always
fail even if the "xmlns" attribute is present.</p>
<p>The model behind this treatment is defined by the "XML information
set" standard. There are two kinds of attributes: normal attributes,
and namespace attributes. PXP validates only normal attributes.</p>
</sect1>
</readme>
......@@ -25,7 +25,7 @@ installrel = $(H)/homepage/ocaml-programming.de/packages/documentation/pxp/index
.PHONY: all
all: README INSTALL ABOUT-FINDLIB SPEC EXTENSIONS PREPROCESSOR
all: README INSTALL ABOUT-FINDLIB SPEC
README: README.xml common.xml config.xml readme.dtd
$(readme) -text README.xml >README
......
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE readme SYSTEM "readme.dtd" [
<!ENTITY % common SYSTEM "common.xml">
%common;
<!ENTITY m "<em>PXP</em>">
]>
<readme title="The Preprocessor for PXP">
<sect1>
<title>The Preprocessor for PXP</title>
<p>Since PXP-1.1.95, there is a preprocessor as part of the PXP
distribution. It allows you to compose XML trees and event lists
dynamically, which is very handy to write XML transformations.</p>
<p>To enable the preprocessor, compile your source files as in:
<code>
ocamlfind ocamlc -syntax camlp4o -package pxp-pp,... ...
</code>
The package pxp-pp contains the preprocessor. The -syntax option
enables camlp4, on which the preprocessor is based. It is also
possible to use it together with the revised syntax, use
"-syntax camlp4r" in this case.</p>
<p><em>Important:</em> Up to version 1.0.4, findlib (ocamlfind)
has a problem with the definition for pxp-pp. There is an easy
workaround: Use "-syntax camlp4o,byte".</p>
<p>In the toploop, type
<code>
ocaml
# #use "topfind";;
# #camlp4o;;
# #require "pxp-pp";;
# #require "pxp";;
</code>
</p>
<p>The preprocessor defines the following new syntax notations,
explained below in detail:
<code><![CDATA[
<:pxp_charset< CHARSET_DECL >>
<:pxp_tree< EXPR >>
<:pxp_vtree< EXPR >>
<:pxp_evlist< EXPR >>
<:pxp_evpull< EXPR >>
<:pxp_text< TEXT >>
]]></code>
The basic notation is "pxp_tree" which creates a tree of PXP document
nodes as described in EXPR. "pxp_vtree" is the variant where the tree
is immediately validated. "pxp_evlist" creates a list of PXP events
instead of nodes, useful together with the event-based parser.
"pxp_evpull" is a variation of the latter: Instead of an event list
an event generator is created that works like a pull parser.</p>
<p>The "pxp_charset" notation only configures the character sets to
assume. Finally, "pxp_text" is a notation for string literals.</p>
<sect2>
<title>Creating constant XML</title>
<p>The following examples are all written for "pxp_tree". You can
also use one of the other XML composers instead, but see the notes
below.</p>
<p>In order to use "pxp_tree", you must define two variables in
the environment: "spec" and "dtd":
<code>
let spec = Pxp_tree_parser.default_spec;;
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
</code>
These variables occur in the code generated by the preprocessor.
The "dtd" variable is the DTD object. Note that you need it even
in well-formedness mode (validation turned off). The "spec" variable
controls which classes are instantiated as node representation
(see PXP manual).</p>
<p>Now you can create XML trees like in
<code><![CDATA[
let book =
<:pxp_tree<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
]]></code>
As you can see, the syntax is somehow XML-related but not really XML.
(Many ideas are borrowed from CDUCE, by the way.) In particular,
there are start tags like &lt;title&gt; but no end tags. Instead,
we are using square brackets to denote the children of an XML
element. Furthermore, character data must be put into double
quotes.</p>
<p>You may ask why the well-known XML syntax has been modified for
this preprocessor. There are many reasons, and they will become
clearer in the following explanations. For now, you can see the advantage
that the syntax is less verbose, as you need not to repeat the
element names in end tags. Furthermore, you can exactly control
which characters are part of the data nodes without having to make
compromises with indentation.</p>
<p>Attributes are written as in XML:
<code><![CDATA[
let book =
<:pxp_tree<
<book id="BOOK_001">
[ <title lang="en">[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
]]></code>
</p>
<p>An element without children can be written
<code><![CDATA[
<element>[]
]]></code>
or slightly shorter:
<code><![CDATA[
<element/>
]]></code>
</p>
<p>You can also create processing instructions and comment nodes:
<code><![CDATA[
let list =
<:pxp_tree<
<list>
[ <!>"Now the list of books follows!"
<?>"formatter_directive" "one book per page"
book
]
>>
]]></code>
The notation "&lt;!>" creates a comment node with the following string
as contents. The notation "&lt;?>" needs two strings, first the target,
then the value (here, this results in
"&lt;?formatter_directive one book per page?>". </p>
<p>Look again at the last example: The O'Caml variable "book" occurs,
and it inserts its tree into the list of books. Identifiers
without "decoration" just refer to O'Caml variables. We will see
more examples below.</p>
<p>The preprocessor syntax knows a number of shortcuts and variations.
First, you can omit the square brackets when an element has exactly
one child:
<code><![CDATA[
<element><child>"Data inside child"
]]></code>
This is the same as
<code><![CDATA[
<element>[ <child>[ "Data inside child" ] ]
]]></code>
Second, you are already used to a common abbreviation: Strings are
automatically converted to data nodes. The "expanded" syntax is
<code><![CDATA[
<*>"Data string"
]]></code>
where "&lt;*>" denotes a data node, and the following string is
used as contents. Usually, you can omit "&lt;*>". However, there
are a few occasions where this notation is still useful, see below.</p>
<p>In strings, the usual entity references can be used:
"Double quotes: &amp;quot;". For a newline character,
write &amp;#10;.</p>
<p>The preprocessor knows two operators: "^" concatenates strings,
and "@" concatenates lists. Examples:
<code><![CDATA[
<element>[ "Word1" ^ "Word2" ]
<element>([ <a/> ] @ [ <b/> ])
]]></code></p>
<p>Parentheses can be used to clarify precedence. For example:
<code><![CDATA[
<element>(l1 @ l2)
]]></code>
Here, the concatenation operator "@" could also be parsed as
<code><![CDATA[
(<element> l1) @ l2
]]></code>
Parentheses may be used in every expression.</p>
<p>Rarely used, there is also a notation for the
"super root" nodes (see the PXP manual for their meaning):
<code><![CDATA[
<^>[ <element> ... ]
]]></code>
</p>
</sect2>
<sect2>
<title>Dynamic XML</title>
<p>Let us begin with an example. The task is to convert
O'Caml values of type
<code><![CDATA[
type book =
{ title : string;
author : string;
isbn : string;
}
]]></code>
to XML trees like
<code><![CDATA[
<book id="BOOK_'isbn'">
<title>'title'</title>
<author>'author'</title>
</book>
]]></code>
(conventional syntax). When b is the book variable, the solution is
<code><![CDATA[
let book =
let title = b.title
and author = b.author
and isbn = b.isbn in
<:pxp_tree<
<book id=("BOOK_" ^ isbn)>
[ <title><*>title
<author><*>author
]
>>
]]></code>
First, we bind the simple O'Caml variables "title", "author", and
"isbn". The reason is that the preprocessor syntax does not allow
expressions like "b.title" directly in the XML tree (but see below
for a better workaround).</p>
<p>The XML tree contains the O'Caml variables. The "id" attribute
is a concatenation of the fixed prefix "BOOK_" and the contents of
"isbn". The "title" and "author" elements contain a data node
whose contents are the O'Caml strings "title", and "author",
respectively.</p>
<p>Why "&lt;*>"? If we just wrote "&lt;title>title", the
generated code would assume that the "title" variable is an XML node,
and not a string. From this point of view, "&lt;*>" works like
a type annotation, as it specialises the type of the following
expression.</p>
<p>Here is an alternate solution:
<code><![CDATA[
let book =
<:pxp_tree<
<book id=("BOOK_" ^ (: b.isbn :))>
[ <title><*>(: b.title :)
<author><*>(: b.author :)
]
>>
]]></code>
The notation "(: ... :)" allows you to include arbitrary O'Caml
expressions into the tree. In this solution it is no longer necessary
to create artificial O'Caml variables for the only purpose of
injecting values into trees.
</p>
<p>It is possible to create XML elements with dynamic names:
Just put parentheses around the expression. Example:
<code><![CDATA[
let name = "book" in
<:pxp_tree< <(name)> ... >>
]]></code>
With the same notation, one can also set attribute names dynamically:
<code><![CDATA[
let att_name = "id" in
<:pxp_tree< <book (att_name)=...> ... >>
]]></code>
Finally, it is also possible to include complete attribute lists
dynamically:
<code><![CDATA[
let att_list = [ "id", ("BOOK_" ^ b.isbn) ] in
<:pxp_tree< <book (: att_list :) > ... >>
]]></code>
</p>
<p>Typing: Depending on where a variable or O'Caml expression occurs,
different types are assumed. Compare the following examples:
<code><![CDATA[
<:pxp_tree< <element>x1 >>
<:pxp_tree< <element>[x2] >>
<:pxp_tree< <element><*>x3 >>
]]></code>
As a rule of thumb, the most general type is assumed that would make
sense at a certain location. As x1 could be replaced by a list
of children, its type is assumed to be a node list. As x2 could
be replaced by a single node, its type is assumed to be a node.
And x3 is a string, we had this case already.
</p>
</sect2>
<sect2>
<title>Character Encodings</title>
<p>As the preprocessor generates code that builds XML trees, it
must know two character encodings:</p>
<ul>
<li><p>Which encoding is used in the source code (in the .ml file)
</p></li>
<li><p>Which encoding is used in the XML representation, i.e.
in the O'Caml values representing the XML trees</p></li>
</ul>
<p>Both encodings can be set independently. The syntax is:
<code><![CDATA[
<:pxp_charset< source="ENC" representation="ENC" >>
]]></code>
The default is ISO-8859-1 for both encodings. For example, to set
the representation encoding to UTF-8, use:
<code><![CDATA[
<:pxp_charset< representation="UTF-8" >>
]]></code>
The "pxp_charset" notation is a constant expression that always
evaluates to "()". (A requirement by camlp4 that looks artificial.)
</p>
<p>When you set the representation encoding, it is required that the
encoding stored in the DTD object is the same. Remember that we
need a DTD object like
<code>
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
</code>
Of course, we must change this to the representation encoding, too,
in our example:
<code>
let dtd = Pxp_dtd.create_dtd `Enc_utf8;;
</code>
The preprocessor cannot check this at compile time, and for performance
reasons, a runtime check is not generated. So it is up to the programmer
that the character encodings are used in a consistent way.
</p>
</sect2>
<sect2>
<title>Validated Trees</title>
<p>In order to validate trees, you need a filled DTD object.
In principle, you can create this object by a number of methods.
For example, you can parse an external file:
<code><![CDATA[
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_file "sample.dtd")
]]></code>
It is, however, often more convenient to include the DTD literally
into the program. This works by
<code><![CDATA[
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string "...")
]]></code>
As the double quotes are often used inside DTDs, O'Caml string
literals are a bit impractical, as they are also delimited by
double quotes, and one needs to add backslashes as escape characters.
The "pxp_text" notation is often more readable here:
&lt;:pxp_text&lt;STRING>> is just another way of writing
"STRING". In our DTD, we have
<code><![CDATA[
let dtd_text =
<:pxp_text<
<!ELEMENT book (title,author)>
<!ATTLIST book id CDATA #REQUIRED>
<!ELEMENT title (#PCDATA)>
<!ATTLIST title lang CDATA "en">
<!ELEMENT author (#PCDATA)>
>>;;
let config = default_config;;
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string dtd_text);;
]]></code>
Note that "pxp_text" is not restricted to DTDs, as it can be used
for any kind of string.</p>
<p>After we have the DTD, we can validate the trees. One
option is to call the "validate" function:
<code><![CDATA[
let book =
<:pxp_tree<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>;;
Pxp_document.validate book;;
]]></code>
(This example is invalid, as the "id" attribute is missing.)</p>
<p>Note that it is a misunderstanding that "pxp_tree" builds XML trees in
well-formed mode. You can create any tree with it, and the fact is that
"pxp_tree" just does not invoke the validator. So if the DTD enforces
validation, the tree is validated when the "validate" function is
called. If the DTD is in well-formedness mode, the tree is effectively
not validated, even when the "validate" function is invoked. Btw,
the following statements would create a DTD in well-formedness mode:
<code>
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
dtd # allow_arbitrary;
</code>
As an alternative of calling the "validate" function, one can also
use "pxp_vtree" instead. It immediately validates every XML element it
creates. However, "injected" subtrees are not validated, i.e. validation
does not proceed recursively to subnodes as the "validate" function
does it.</p>
</sect2>
<sect2>
<title>Generating Events</title>
<p>As PXP has also an event model to represent XML, the preprocessor
can also produce such events. In particular, there are two modes: The
"pxp_evlist" notation outputs lists of events (type "event list")
representing the XML expression. The "pxp_evpull" notation creates
an automaton from which one can "pull" events (like from a pull
parser).</p>
<p>These two notations work very much like "pxp_tree". For example,
<code><![CDATA[
let book =
<:pxp_evlist<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
]]></code>
generates
<code><![CDATA[
[ E_start_tag ("book", [], None, <obj>);
E_start_tag ("title", [], None, <obj>);
E_char_data "The Lord of The Rings";
E_end_tag ("title", <obj>);
E_start_tag ("author", [], None, <obj>);
E_char_data "J.R.R. Tolkien";
E_end_tag ("author", <obj>);
E_end_tag ("book", <obj>)
]
]]></code>
Note that you neither need a "dtd" variable nor a "spec" variable.
There is one important difference, however: Both nodes and lists
of nodes are represented by the same type, "event list". That
has the consequence that in the following example x1 and x2
have the same type "event list":
<code><![CDATA[
<:pxp_evlist< <element>x1 >>
<:pxp_evlist< <element>[x2] >>
<:pxp_evlist< <element><*>x3 >>
]]></code>
In principle, it could be checked at runtime whether x1 and x2
have the right structure. However, this is not done because of
performance reasons.</p>
<p>As mentioned, "pxp_evpull" works like a pull parser.
After defining
<code><![CDATA[
let book =
<:pxp_evpull<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
]]></code>
"book" is a function 'a->event. One can call it to get the events
one after the other:
<code><![CDATA[
let e1 = book();; (* = Some(E_start_tag ("book", [], None, <obj>)) *)
let e2 = book();; (* = Some(E_start_tag ("title", [], None, <obj>)) *)
...