Newer Older
gerd's avatar
gerd committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE readme SYSTEM "readme.dtd" [

<!ENTITY % common SYSTEM "common.xml">

<!-- Special HTML config: -->
<!ENTITY % readme:html:up '<a href="../..">up</a>'>

<!ENTITY % config SYSTEM "config.xml">


<readme title="Extensions of the XML specification">

    <title>This document</title>
    <p>This parser has some options extending the XML specification. Here, the 
options are explained.

    <title>Optional declarations instead of mandatory declarations</title>

<p>The XML spec demands that elements, notations, and attributes must be
declared. However, there are sometimes situations where a different rule would
be better: <em>If</em> there is a declaration, the actual instance of the
element type, notation reference or attribute must match the pattern of the
declaration; but if the declaration is missing, a reasonable default declaration
should be assumed.</p> 

<p>I have an example that seems to be typical: The inclusion of HTML into a
meta language. Imagine you have defined some type of "generator" or other tool
working with HTML fragments, and your document contains two types of elements:
The generating elements (with a name like "gen:xxx"), and the object elements
which are HTML. As HTML is still evolving, you do not want to declare the HTML
elements; the HTML fragments should be treated as well-formed XML fragments. In
contrast to this, the elements of the generator should be declared and
validated because you can more easily detect errors.</p> 

<p>The following two processing instructions can be included into the DTD:</p>
      <li><p><code><![CDATA[<?pxp:dtd optional-element-and-notation-declarations?>]]></code>
	References to unknown element types and notations no longer cause an
	error. The element may contain everything, but it must be still
	well-formed. It may have arbitrary attributes, and every attribute is
	treated as an #IMPLIED CDATA attribute.</p>
      <li><p><code><![CDATA[<?pxp:dtd optional-attribute-declarations elements="x y ..."?>]]></code>
        References to unknown attributes inside one of the enumerated elements
        no longer cause an error. Such an attribute is treated as an #IMPLIED
        CDATA attribute.

<p>If there are several "optional-attribute-declarations" PIs, they are all
interpreted (implicitly merged).</p>
gerd's avatar
gerd committed
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87

    <title>Normalized namespace prefixes</title>
The XML standard refers to names within namespaces as <em>expanded
names</em>. This is simply the pair (namespace_uri, localname); the namespace
prefix is not included in the expanded name.</p>
PXP does not support expanded names, but it does support namespaces. However,
it uses a model that is slightly different from the usual representation of
names in namespaces: Instead of removing the namespace prefixes and converting
the names into expanded names, PXP prefers it to normalize the namespace
prefixes used in a document, i.e. the prefixes are transformed such that they
refer uniquely to namespaces.</p>
The following text is valid XML:

<x:a xmlns:x="namespace1">
  <x:a xmlns:x="namespace2">

The first element has the expanded name (namespace1,a) while the second element
has the expanded name (namespace2,a); so the elements have different types. As
gerd's avatar
gerd committed
88 89
already pointed out, PXP does not support the expanded names directly.
Alternatively, the
gerd's avatar
gerd committed
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
XML text is transformed while it is being parsed such that the prefixes become
unique. In this example, the transformed text would read:

<x:a xmlns:x="namespace1">
  <x1:a xmlns:x1="namespace2">

From a programmers point of view, this transformation has the advantage that
you need not to deal with pairs when comparing names, as all names are still
simple strings: here, "x:a", and "x1:a". However, the transformation seems to
be a bit random. Why not "y:a" instead of "x1:a"? The answer is that PXP allows
the programmer to control the transformation: You can simply demand that
namespace1 must use the normalized prefix "x", and namespace2 must use "y". The
declaration which normalized prefix to use can be programmed (by setting the
namespace_manager object), and it can be included into the DTD:

<?pxp:dtd namespace prefix="x" uri="namespace1"?>
<?pxp:dtd namespace prefix="y" uri="namespace2"?>

There is another advantage of using normalized prefixes: You can safely refer
to them in DTDs. For example, you could declare the two elements as

<!ELEMENT x:a (y:a)>

These declarations are applicable even if the XML text uses different prefixes,
because PXP normalizes any prefixes for namespace1 or namespace2 to the
preferred prefixes "x" and "y".

gerd's avatar
gerd committed
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172
<p>Since PXP-1.1.95, the namespace support has been extended. In
addition to prefix normalization, the parser now also stores the
scoping structure of the namespaces (in the namespace_scope
objects). More or less, this means that the parser remembers
which elements have which "xmlns" attributes. There are two
important applications of this feature:</p>

<p>First, it is now possible to look up the namespace URI when
only the original, non-normalized namespace prefix is known.
A number of XML standards, e.g. XSchema, use namespace prefixes
within data nodes. Of course, these prefixes are not normalized
by PXP, but simply remain as they are when the XML text is
parsed. To get the URI of such a prefix p in the context of node
n, just call

n # namespace_scope # uri_of_display_prefix p

In PXP terminology, the non-normalized prefixes are now called
"display prefixes".</p>

<p>The other application is that it is now even possible to
retrieve the original "display" prefix of node names, e.g.

n # display_prefix

returns it. However, the display prefix is only guessed in the
sense that when there are several prefixes bound to the same
URI, one of the prefixes may be taken. For instance, in

<x:a xmlns:x="sample" xmlns:y="sample"/>

both "x" and "y" are bound to the same URI "sample", and
the display_prefix method selects now one of the prefixes
at random.</p>

<p>It is now even possible to output the parsed XML text
with original namespace structure: The "display" method
outputs XML text where the namespaces are declared as in the
original XML text.</p>

173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
<p>Regarding the "xmlns" attributes, PXP treats them in a very special
way. It is not only allowed not to declare them in the DTD, such declarations
would be even not applied to the actual "xmlns" attributes. For example,
it is not possible to declare a default value for "xmlns:x", as in

<ATTLIST ... xmlns:x CDATA "mynamespaceuri">

The default value would be ignored. Furthermore, it is not possible to
declare "xmlns" attributes as being required - validation will always
fail even if the "xmlns" attribute is present.</p>

<p>The model behind this treatment is defined by the "XML information
set" standard. There are two kinds of attributes: normal attributes,
and namespace attributes. PXP validates only normal attributes.</p>

gerd's avatar
gerd committed
190 191

gerd's avatar
gerd committed