diff --git a/configure b/configure
index 743d05bbd4117b4eb3c51a96c137c30d4381edd9..452d0d24632a1fc15c313f525c1ad2ac69cf6c6d 100755
--- a/configure
+++ b/configure
@@ -7,16 +7,18 @@ with_lex=1
with_wlex=1
with_wlex_compat=1
with_ulex=1
+with_pp=1
lexlist="utf8,iso88591,iso88592,iso88593,iso88594,iso88595,iso88596,iso88597,iso88598,iso88599,iso885910,iso885913,iso885914,iso885915,iso885916"
-version="1.1.95test1"
+version="1.1.95test2"
exec_suffix=""
help_lex="Enable/disable ocamllex-based lexical analyzer for the -lexlist encodings"
help_wlex="Enable/disable wlex-based lexical analyzer for UTF-8"
help_wlex_compat="Enable/disable wlex-style compatibility package for UTF-8 and ISO-8859-1"
help_ulex="Enable/disable ulex-based lexical analyzer for UTF-8"
+help_pp="Enable/disable the build of the preprocessor (pxp-pp)"
-options="lex wlex wlex_compat ulex"
+options="lex wlex wlex_compat ulex pp"
lexlist_options="utf8 usascii iso88591 iso88592 iso88593 iso88594 iso88595 iso88596 iso88597 iso88598 iso88599 iso885910 iso885913 iso885914 iso885915 iso885916 koi8r windows1250 windows1251 windows1252 windows1253 windows1254 windows1255 windows1256 windows1257 windows1258 cp437 cp737 cp775 cp850 cp852 cp855 cp856 cp857 cp860 cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp1006 macroman"
print_options () {
@@ -189,6 +191,7 @@ if [ $with_wlex -gt 0 ]; then
echo "not found"
echo "wlex support is disabled"
with_wlex=0
+ with_wlex_compat=0
fi
fi
@@ -206,6 +209,12 @@ if [ $with_ulex -gt 0 ]; then
fi
fi
+# If ulex not found/disabled, also disable pxp-pp:
+
+if [ $with_ulex -eq 0 ]; then
+ with_pp=0
+fi
+
######################################################################
# Check Lexing.lexbuf type
@@ -229,6 +238,25 @@ fi
rm -f tmp.*
+######################################################################
+# Check type of camlp4 locations
+
+printf "%s" "Checking type of camlp4 location... "
+cat <
Since PXP-1.1.95, the namespace support has been extended. In +addition to prefix normalization, the parser now also stores the +scoping structure of the namespaces (in the namespace_scope +objects). More or less, this means that the parser remembers +which elements have which "xmlns" attributes. There are two +important applications of this feature:
+ +First, it is now possible to look up the namespace URI when
+only the original, non-normalized namespace prefix is known.
+A number of XML standards, e.g. XSchema, use namespace prefixes
+within data nodes. Of course, these prefixes are not normalized
+by PXP, but simply remain as they are when the XML text is
+parsed. To get the URI of such a prefix p in the context of node
+n, just call
+
+
+n # namespace_scope # uri_of_display_prefix p
+
+
+In PXP terminology, the non-normalized prefixes are now called
+"display prefixes".
The other application is that it is now even possible to
+retrieve the original "display" prefix of node names, e.g.
+
+
+n # display_prefix
+
+
+returns it. However, the display prefix is only guessed in the
+sense that when there are several prefixes bound to the same
+URI, one of the prefixes may be taken. For instance, in
+
+
+]]>
+
+both "x" and "y" are bound to the same URI "sample", and
+the display_prefix method selects now one of the prefixes
+at random.
It is now even possible to output the parsed XML text +with original namespace structure: The "display" method +outputs XML text where the namespaces are declared as in the +original XML text.
+Regarding the "xmlns" attributes, PXP treats them in a very special way. It is not only allowed not to declare them in the DTD, such declarations would be even not applied to the actual "xmlns" attributes. For example, diff --git a/doc/INSTALL.xml b/doc/INSTALL.xml index e83fdc3f7b1039df709f58f2398d84814adbb965..8a7c1a45b86c96d482be76a2e17b36298a35e3e6 100644 --- a/doc/INSTALL.xml +++ b/doc/INSTALL.xml @@ -58,6 +58,17 @@ the runtime part of wlex, and not the "wlex" command itself.
-with-wlex-compat
Creates a compatibility package pxp-wlex that includes lexers for UTF8 and ISO-8859-1 (may be required to build old software)
+ +-with-ulex
+Enables the lexical analyzer that works for UTF-8 as internal encoding, and that is based on Alain Frisch's ulex tool. It +is relatively small, but a bit slower than the ocamllex-based lexers. +ulex will supersede wlex soon.
+-with-pp
+Enables the PXP preprocessor (installed as package pxp-pp). +See the file PREPROCESSOR for details. The preprocessor also requires ulex.
-lexlist <list-of-encodings>
diff --git a/doc/Makefile b/doc/Makefile index 0422237e94a225f9011bdc1b16bd4dcac517696d..80a18f6feef7c30ba0512b6c0e663c2214a4f0c8 100644 --- a/doc/Makefile +++ b/doc/Makefile @@ -23,7 +23,7 @@ installrel = $$HOME/homepage/ocaml-programming.de/packages/documentation/pxp/ind .PHONY: all -all: README INSTALL ABOUT-FINDLIB SPEC EXTENSIONS +all: README INSTALL ABOUT-FINDLIB SPEC EXTENSIONS PREPROCESSOR README: README.xml common.xml config.xml readme.dtd $(readme) -text README.xml >README @@ -40,6 +40,9 @@ SPEC: SPEC.xml common.xml config.xml readme.dtd EXTENSIONS: EXTENSIONS.xml common.xml config.xml readme.dtd $(readme) -text EXTENSIONS.xml >EXTENSIONS +PREPROCESSOR: PREPROCESSOR.xml common.xml config.xml readme.dtd + $(readme) -text PREPROCESSOR.xml >PREPROCESSOR + DEV: DEV.xml common.xml config.xml readme.dtd $(readme) -text DEV.xml >DEV #$(readme) -html DEV.xml >$(installdev) diff --git a/doc/PREPROCESSOR.xml b/doc/PREPROCESSOR.xml new file mode 100644 index 0000000000000000000000000000000000000000..aab3bdf941b3034e90973949acd7470bdeb01724 --- /dev/null +++ b/doc/PREPROCESSOR.xml @@ -0,0 +1,745 @@ + + +%common; + +PXP"> + +]> + +Since PXP-1.1.95, there is a preprocessor as part of the PXP +distribution. It allows you to compose XML trees and event lists +dynamically, which is very handy to write XML transformations.
+ +To enable the preprocessor, compile your source files as in:
+
+
+ocamlfind ocamlc -syntax camlp4o -package pxp-pp,... ...
+
+
+The package pxp-pp contains the preprocessor. The -syntax option
+enables camlp4, on which the preprocessor is based. It is also
+possible to use it together with the revised syntax, use
+"-syntax camlp4r" in this case.
In the toploop, type
+
+
+ocaml
+# #use "topfind";;
+# #camlp4o;;
+# #require "pxp-pp";;
+# #require "pxp";;
+
+
The preprocessor defines the following new syntax notations,
+explained below in detail:
+
+>
+<:pxp_tree< EXPR >>
+<:pxp_vtree< EXPR >>
+<:pxp_evlist< EXPR >>
+<:pxp_evpull< EXPR >>
+<:pxp_text< TEXT >>
+]]>
+
+The basic notation is "pxp_tree" which creates a tree of PXP document
+nodes as described in EXPR. "pxp_vtree" is the variant where the tree
+is immediately validated. "pxp_evlist" creates a list of PXP events
+instead of nodes, useful together with the event-based parser.
+"pxp_evpull" is a variation of the latter: Instead of an event list
+an event generator is created that works like a pull parser.
The "pxp_charset" notation only configures the character sets to +assume. Finally, "pxp_text" is a notation for string literals.
+ +The following examples are all written for "pxp_tree". You can +also use one of the other XML composers instead, but see the notes +below.
+ +In order to use "pxp_tree", you must define two variables in
+the environment: "spec" and "dtd":
+
+
+let spec = Pxp_tree_parser.default_spec;;
+let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
+
+
+These variables occur in the code generated by the preprocessor.
+The "dtd" variable is the DTD object. Note that you need it even
+in well-formedness mode (validation turned off). The "spec" variable
+controls which classes are instantiated as node representation
+(see PXP manual).
Now you can create XML trees like in
+
+
+ [
+
+As you can see, the syntax is somehow XML-related but not really XML.
+(Many ideas are borrowed from CDUCE, by the way.) In particular,
+there are start tags like <title> but no end tags. Instead,
+we are using square brackets to denote the children of an XML
+element. Furthermore, character data must be put into double
+quotes.
You may ask why the well-known XML syntax has been modified for +this preprocessor. There are many reasons, and they will become +clearer in the following explanations. For now, you can see the advantage +that the syntax is less verbose, as you need not to repeat the +element names in end tags. Furthermore, you can exactly control +which characters are part of the data nodes without having to make +compromises with indentation.
+ +Attributes are written as in XML:
+
+
+ [
+
An element without children can be written
+
+[]
+]]>
+
+or slightly shorter:
+
+
+]]>
+
You can also create processing instructions and comment nodes:
+
+
+ [ "Now the list of books follows!"
+ >"formatter_directive" "one book per page"
+ book
+ ]
+ >>
+]]>
+
+The notation "<!>" creates a comment node with the following string
+as contents. The notation "<?>" needs two strings, first the target,
+then the value (here, this results in
+"<?formatter_directive one book per page?>".
Look again at the last example: The O'Caml variable "book" occurs, +and it inserts its tree into the list of books. Identifiers +without "decoration" just refer to O'Caml variables. We will see +more examples below.
+ +The preprocessor syntax knows a number of shortcuts and variations.
+First, you can omit the square brackets when an element has exactly
+one child:
+
+
+
+This is the same as
+
+[
+
+Second, you are already used to a common abbreviation: Strings are
+automatically converted to data nodes. The "expanded" syntax is
+
+"Data string"
+]]>
+
+where "<*>" denotes a data node, and the following string is
+used as contents. Usually, you can omit "<*>". However, there
+are a few occasions where this notation is still useful, see below.
In strings, the usual entity references can be used: +"Double quotes: "". For a newline character, +write .
+ +The preprocessor knows two operators: "^" concatenates strings,
+and "@" concatenates lists. Examples:
+
+[ "Word1" ^ "Word2" ]
+
Parentheses can be used to clarify precedence. For example:
+
+(l1 @ l2)
+]]>
+
+Here, the concatenation operator "@" could also be parsed as
+
+ l1) @ l2
+]]>
+
+Parentheses may be used in every expression.
Rarely used, there is also a notation for the
+"super root" nodes (see the PXP manual for their meaning):
+
+[
+
Let us begin with an example. The task is to convert
+O'Caml values of type
+
+
+
+to XML trees like
+
+
+
+
+(conventional syntax). When b is the book variable, the solution is
+
+
+ [
+
+First, we bind the simple O'Caml variables "title", "author", and
+"isbn". The reason is that the preprocessor syntax does not allow
+expressions like "b.title" directly in the XML tree (but see below
+for a better workaround).
The XML tree contains the O'Caml variables. The "id" attribute +is a concatenation of the fixed prefix "BOOK_" and the contents of +"isbn". The "title" and "author" elements contain a data node +whose contents are the O'Caml strings "title", and "author", +respectively.
+ +Why "<*>"? If we just wrote "<title>title", the +generated code would assume that the "title" variable is an XML node, +and not a string. From this point of view, "<*>" works like +a type annotation, as it specialises the type of the following +expression.
+ +Here is an alternate solution:
+
+
+ [
+
+The notation "(: ... :)" allows you to include arbitrary O'Caml
+expressions into the tree. In this solution it is no longer necessary
+to create artificial O'Caml variables for the only purpose of
+injecting values into trees.
+
It is possible to create XML elements with dynamic names:
+Just put parentheses around the expression. Example:
+
+ ... >>
+]]>
+
+With the same notation, one can also set attribute names dynamically:
+
+ ... >>
+]]>
+
+Finally, it is also possible to include complete attribute lists
+dynamically:
+
+ ... >>
+]]>
+
Typing: Depending on where a variable or O'Caml expression occurs,
+different types are assumed. Compare the following examples:
+
+x1 >>
+<:pxp_tree<
+
+As a rule of thumb, the most general type is assumed that would make
+sense at a certain location. As x1 could be replaced by a list
+of children, its type is assumed to be a node list. As x2 could
+be replaced by a single node, its type is assumed to be a node.
+And x3 is a string, we had this case already.
+
As the preprocessor generates code that builds XML trees, it +must know two character encodings:
+ +Which encoding is used in the source code (in the .ml file) +
Which encoding is used in the XML representation, i.e. +in the O'Caml values representing the XML trees
Both encodings can be set independently. The syntax is:
+
+>
+]]>
+
+The default is ISO-8859-1 for both encodings. For example, to set
+the representation encoding to UTF-8, use:
+
+>
+]]>
+
+The "pxp_charset" notation is a constant expression that always
+evaluates to "()". (A requirement by camlp4 that looks artificial.)
+
When you set the representation encoding, it is required that the
+encoding stored in the DTD object is the same. Remember that we
+need a DTD object like
+
+
+let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
+
+
+Of course, we must change this to the representation encoding, too,
+in our example:
+
+
+let dtd = Pxp_dtd.create_dtd `Enc_utf8;;
+
+
+The preprocessor cannot check this at compile time, and for performance
+reasons, a runtime check is not generated. So it is up to the programmer
+that the character encodings are used in a consistent way.
+
In order to validate trees, you need a filled DTD object.
+In principle, you can create this object by a number of methods.
+For example, you can parse an external file:
+
+
+
+It is, however, often more convenient to include the DTD literally
+into the program. This works by
+
+
+
+As the double quotes are often used inside DTDs, O'Caml string
+literals are a bit impractical, as they are also delimited by
+double quotes, and one needs to add backslashes as escape characters.
+The "pxp_text" notation is often more readable here:
+<:pxp_text<STRING>> is just another way of writing
+"STRING". In our DTD, we have
+
+
+
+
+
+
+ >>;;
+let config = default_config;;
+let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string dtd_text);;
+]]>
+
+Note that "pxp_text" is not restricted to DTDs, as it can be used
+for any kind of string.
After we have the DTD, we can validate the trees. One
+option is to call the "validate" function:
+
+
+ [
+
+(This example is invalid, as the "id" attribute is missing.)
Note that it is a misunderstanding that "pxp_tree" builds XML trees in
+well-formed mode. You can create any tree with it, and the fact is that
+"pxp_tree" just does not invoke the validator. So if the DTD enforces
+validation, the tree is validated when the "validate" function is
+called. If the DTD is in well-formedness mode, the tree is effectively
+not validated, even when the "validate" function is invoked. Btw,
+the following statements would create a DTD in well-formedness mode:
+
+
+let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
+dtd # allow_arbitrary;
+
+
+As an alternative of calling the "validate" function, one can also
+use "pxp_vtree" instead. It immediately validates every XML element it
+creates. However, "injected" subtrees are not validated, i.e. validation
+does not proceed recursively to subnodes as the "validate" function
+does it.
As PXP has also an event model to represent XML, the preprocessor +can also produce such events. In particular, there are two modes: The +"pxp_evlist" notation outputs lists of events (type "event list") +representing the XML expression. The "pxp_evpull" notation creates +an automaton from which one can "pull" events (like from a pull +parser).
+ +These two notations work very much like "pxp_tree". For example,
+
+
+ [
+
+generates
+
+);
+ E_start_tag ("title", [], None,
+
+Note that you neither need a "dtd" variable nor a "spec" variable.
+There is one important difference, however: Both nodes and lists
+of nodes are represented by the same type, "event list". That
+has the consequence that in the following example x1 and x2
+have the same type "event list":
+
+x1 >>
+<:pxp_evlist<
+
+In principle, it could be checked at runtime whether x1 and x2
+have the right structure. However, this is not done because of
+performance reasons.
As mentioned, "pxp_evpull" works like a pull parser.
+After defining
+
+
+ [
+
+"book" is a function 'a->event. One can call it to get the events
+one after the other:
+
+)) *)
+let e2 = book();; (* = Some(E_start_tag ("title", [], None,
+
+After the last event, "book" returns None to indicate the end of the
+event stream.
As for "pxp_evlist", it is not possible to distinguish between
+nodes and node lists. In this example, both x1 and x2 are assumed
+to have type 'a->event:
+
+x1 >>
+<:pxp_evlist<
+
+Note that "<element>x1" actually means to build a new pull automaton
+around the existing pull automaton x1: The children of "element" are
+retrieved by pulling events from x1 until "None" is returned.
A consequence of the pull semantics is that once an event
+is obtained from an automaton, the state of the automaton is modified
+such that it is not possible to get the same event again. If you need
+an automaton that can be reset to the beginning, just wrap the
+"pxp_evlist" notation into a functional abstraction:
+
+ ... >>;;
+let book1 = book_maker();;
+let book2 = book_maker();;
+]]>
+
+This way, "book1" and "book2" are independent event streams.
There is another implication of the nature of the
+automatons: Subexpressions are lazily evaluated. For example,
+in
+
+[ <*> (: get_data_contents() :) ] >>
+]]>
+
+the call of get_data_contents is performed just before the event
+for the data node is constructed.
By default, the preprocessor does not generate nodes or +events that support namespaces. It can, however, be configured +to create namespace-aware XML aggregations. +
+ +In any case, you need a namespace manager. This is an object
+that tracks the usage of namespace prefixes in XML nodes. For example,
+we can create a namespace manager that knows the "html" prefix:
+
+
+
+Here, we declare that we want to use the "html" prefix for the
+internal representation of the XML nodes. This kind of prefix is
+called normalized prefix, or normprefix for short. It is possible to configure
+different prefixes for the external representation, i.e. when the
+XML tree is printed to a file. This other kind of prefix is called
+display prefix. We will have a look at them later.
Next, we must tell the DTD object that we have a namespace manager:
+
+
+
For "pxp_evlist" and "pxp_evpull" we are now prepared (note that
+we need now a "dtd" variable, as the DTD object knows the namespace
+manager). For "pxp_tree" and "pxp_vtree", it is required to use
+a namespace-aware specification:
+
+
+
+(Normal specifications do not work, you would get "Namespace method
+not applicable" errors if you tried to use them.)
The special notation "<:autoscope>" enables namespace mode in
+this example:
+
+
+
+
+In particular, "<:autoscope>" defines a new O'Caml variable for
+its subexpression: "scope". This variable contains the namespace
+scope object, which contains the namespace declarations for the
+subexpression. "<:autoscope>" initialises this variable from the
+namespace manager such that it contains now a declaration for the
+"html" prefix.
In general, the namespace scope object contains the prefixes to use for the +external representation. For this simple example, we have chosen +to use the same prefixes as for the internal representation, +and "<:autoscope>" performs the right initialisations for this.
+ +Print the tree by
+
+
+
+The point is to call the "display" method and not the "write" method.
+The latter would not respect the display prefixes.
+
Alternatively, we can also create the "scope" variable manually:
+
+
+
+
+Note that we now use "<:scope>". In this simple form, this
+construct just enables namespace mode, and takes the "scope"
+variable from the environment.
Furthermore, the namespace scope contains now a different +namespace declaration: The display prefix "" is used for HTML. The +empty prefix just means to declare a default prefix +(by xmlns="URI"). The effect can be seen when the XML tree +is printed by calling the "display" method.
+ +Here is a third variant of the same example:
+
+
+
+
+The "scope" is now initially empty. The "<:scope>" notation is
+used to extend the scope for the time the subexpression is
+evaluated.
There is also a notation "<:emptyscope" that creates
+an empty scope object, so one could even write
+
+
+ <:scope ("")="http://www.w3.org/1999/xhtml">
+
+
It is recommended to create the "scope" variable manually with +a reasonable initial declaration, and to use "<:scope>" to +enable namespace processing, and to extend the scope when necessary. +The advantage of this approach is that the same scope object can be +shared by many XML nodes, so you need less memory.
+ +One tip: To get a namespace scope that is initialised with all
+prefixes of the namespace manager (as <:autoscope> does it), define
+
+
+let scope = create_namespace_scope ~decl: mng#as_declaration mng
+
+
For event-based processing of XML, the namespace mode works in +the same way as described here, there is no difference.
+