README.md 11.2 KB
Newer Older
Yorick Peterse's avatar
Yorick Peterse committed
1
2
# Oga

3
4
5
6
**NOTE:** my spare time is limited which means I am unable to dedicate a lot of
time on Oga. If you're interested in contributing to FOSS, please take a look at
the open issues and submit a pull request to address them where possible.

7
8
9
10
11
Oga is an XML/HTML parser written in Ruby. It provides an easy to use API for
parsing, modifying and querying documents (using XPath expressions). Oga does
not require system libraries such as libxml, making it easier and faster to
install on various platforms. To achieve better performance Oga uses a small,
native extension (C for MRI/Rubinius, Java for JRuby).
Yorick Peterse's avatar
Yorick Peterse committed
12

13
14
15
Oga provides an API that allows you to safely parse and query documents in a
multi-threaded environment, without having to worry about your applications
blowing up.
Yorick Peterse's avatar
Yorick Peterse committed
16
17
18
19
20

From [Wikipedia][oga-wikipedia]:

> Oga: A large two-person saw used for ripping large boards in the days before
> power saws. One person stood on a raised platform, with the board below him,
Yorick Peterse's avatar
Yorick Peterse committed
21
> and the other person stood underneath them.
Yorick Peterse's avatar
Yorick Peterse committed
22

23
24
The name is a pun on [Nokogiri][nokogiri].

25
26
27
28
29
30
31
32
33
34
35
## Versioning Policy

Oga uses the version format `MAJOR.MINOR` (e.g. `2.1`). An increase of the MAJOR
version indicates backwards incompatible changes were introduced. The MINOR
version is _only_ increased when changes are backwards compatible, regardless of
whether those changes are bugfixes or new features. Up until version 1.0 the
code should be considered unstable meaning it can change (and break) at any
given moment.

APIs explicitly tagged as private (e.g. using Ruby's `private` keyword or YARD's
`@api private` tag) are not covered by these rules.
36

37
38
39
40
41
42
## Examples

Parsing a simple string of XML:

    Oga.parse_xml('<people><person>Alice</person></people>')

43
44
45
46
47
Parsing XML using strict mode (disables automatic tag insertion):

    Oga.parse_xml('<people>foo</people>', :strict => true) # works fine
    Oga.parse_xml('<people>foo', :strict => true)          # throws an error

48
49
50
51
52
53
54
55
56
57
58
Parsing a simple string of HTML:

    Oga.parse_html('<link rel="stylesheet" href="foo.css">')

Parsing an IO handle pointing to XML (this also works when using
`Oga.parse_html`):

    handle = File.open('path/to/file.xml')

    Oga.parse_xml(handle)

59
60
61
62
63
64
65
66
67
68
69
Parsing an IO handle using the pull parser:

    handle = File.open('path/to/file.xml')
    parser = Oga::XML::PullParser.new(handle)

    parser.parse do |node|
      parser.on(:text) do
        puts node.text
      end
    end

70
71
72
73
74
75
76
77
78
79
Using an Enumerator to download and parse an XML document on the fly:

    enum = Enumerator.new do |yielder|
      HTTPClient.get('http://some-website.com/some-big-file.xml') do |chunk|
        yielder << chunk
      end
    end

    document = Oga.parse_xml(enum)

80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Parse a string of XML using the SAX parser:

    class ElementNames
      attr_reader :names

      def initialize
        @names = []
      end

      def on_element(namespace, name, attrs = {})
        @names << name
      end
    end

    handler = ElementNames.new

    Oga.sax_parse_xml(handler, '<foo><bar></bar></foo>')

    handler.names # => ["foo", "bar"]

100
101
Querying a document using XPath:

102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
    document = Oga.parse_xml <<-EOF
    <people>
      <person id="1">
        <name>Alice</name>
        <age>28</name>
      </person>
    </people>
    EOF

    # The "xpath" method returns an enumerable (Oga::XML::NodeSet) that you can
    # iterate over.
    document.xpath('people/person').each do |person|
      puts person.get('id') # => "1"

      # The "at_xpath" method returns a single node from a set, it's the same as
      # person.xpath('name').first.
      puts person.at_xpath('name').text # => "Alice"
    end
120

121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
Querying the same document using CSS:

    document = Oga.parse_xml <<-EOF
    <people>
      <person id="1">
        <name>Alice</name>
        <age>28</name>
      </person>
    </people>
    EOF

    # The "css" method returns an enumerable (Oga::XML::NodeSet) that you can
    # iterate over.
    document.css('people person').each do |person|
      puts person.get('id') # => "1"

      # The "at_css" method returns a single node from a set, it's the same as
      # person.css('name').first.
      puts person.at_css('name').text # => "Alice"
    end
141

142
143
144
145
146
147
148
149
150
Modifying a document and serializing it back to XML:

    document = Oga.parse_xml('<people><person>Alice</person></people>')
    name     = document.at_xpath('people/person[1]/text()')

    name.text = 'Bob'

    document.to_xml # => "<people><person>Bob</person></people>"

151
152
153
154
155
156
157
Querying a document using a namespace:

    document = Oga.parse_xml('<root xmlns:x="foo"><x:div></x:div></root>')
    div      = document.xpath('root/x:div').first

    div.namespace # => Namespace(name: "x" uri: "foo")

Yorick Peterse's avatar
Yorick Peterse committed
158
159
## Features

Yorick Peterse's avatar
Yorick Peterse committed
160
161
162
* Support for parsing XML and HTML(5)
  * DOM parsing
  * Stream/pull parsing
163
  * SAX parsing
164
* Low memory footprint
165
* High performance (taking into account most work happens in Ruby)
166
* Support for XPath 1.0
167
* CSS3 selector support
168
* XML namespace support (registering, querying, etc)
169
* Windows support
Yorick Peterse's avatar
Yorick Peterse committed
170
171
172

## Requirements

173
174
| Ruby     | Required      | Recommended |
|:---------|:--------------|:------------|
175
| MRI      | >= 2.3.0      | >= 2.6.0    |
176
| JRuby    | >= 1.7        | >= 1.7.12   |
Yorick Peterse's avatar
Yorick Peterse committed
177
| Rubinius | Not supported |             |
178
179
180
181
182
183
184
185
186
187
188
189
| Maglev   | Not supported |             |
| Topaz    | Not supported |             |
| mruby    | Not supported |             |

Maglev and Topaz are not supported due to the lack of a C API (that I know of)
and the lack of active development of these Ruby implementations. mruby is not
supported because it's a very different implementation all together.

To install Oga on MRI or Rubinius you'll need to have a working compiler such as
gcc or clang. Oga's C extension can be compiled with both. JRuby does not
require a compiler as the native extension is compiled during the Gem building
process and bundled inside the Gem itself.
Yorick Peterse's avatar
Yorick Peterse committed
190

191
192
## Thread Safety

193
194
195
196
Oga does not use a unsynchronized global mutable state. As a result of this you
can parse/create documents concurrently without any problems. Modifying
documents concurrently can lead to bugs as these operations are not
synchronized.
197

198
199
200
201
202
203
Some querying operations will cache data in instance variables, without
synchronization. An example is `Oga::XML::Element#namespace` which will cache an
element's namespace after the first call.

In general it's recommended to _not_ use the same document in multiple threads
at the same time.
204

205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
## Namespace Support

Oga fully supports parsing/registering XML namespaces as well as querying them
using XPath. For example, take the following XML:

    <root xmlns="http://example.com">
        <bar>bar</bar>
    </root>

If one were to try and query the `bar` element (e.g. using XPath `root/bar`)
they'd end up with an empty node set. This is due to `<root>` defining an
alternative default namespace. Instead you can query this element using the
following XPath:

    *[local-name() = "root"]/*[local-name() = "bar"]

Alternatively, if you don't really care where the `<bar>` element is located you
can use the following:

    descendant::*[local-name() = "bar"]

Yorick Peterse's avatar
Yorick Peterse committed
226
And if you want to specify an explicit namespace URI, you can use this:
227
228
229

    descendant::*[local-name() = "bar" and namespace-uri() = "http://example.com"]

230
231
Like Nokogiri, Oga provides a way to create "dynamic" namespaces.
That is, Oga allows one to query the above document as following:
232

233
    document = Oga.parse_xml('<root xmlns="http://example.com"><bar>bar</bar></root>')
234

235
    document.xpath('x:root/x:bar', namespaces: {'x' => 'http://example.com'})
236

237
Moreover, because Oga assigns the name "xmlns" to default namespaces you can use
238
239
240
241
242
243
244
245
246
247
this in your XPath queries:

    document = Oga.parse_xml('<root xmlns="http://example.com"><bar>bar</bar></root>')

    document.xpath('xmlns:root/xmlns:bar')

When using this you can still restrict the query to the correct namespace URI:

    document.xpath('xmlns:root[namespace-uri() = "http://example.com"]/xmlns:bar')

248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
## HTML5 Support

Oga fully supports HTML5 including the omission of certain tags. For example,
the following is parsed just fine:

    <li>Hello
    <li>World

This is effectively parsed into:

    <li>Hello</li>
    <li>World</li>

One exception Oga makes is that it does _not_ automatically insert `html`,
`head` and `body` tags. Automatically inserting these tags requires a
distinction between documents and fragments as a user might not always want
these tags to be inserted if left out. This complicates the user facing API as
well as complicating the parsing internals of Oga. As a result I have decided
266
that Oga _does not_ insert these tags when left out.
267
268

A more in depth explanation can be found here:
Yorick Peterse's avatar
Yorick Peterse committed
269
<https://gitlab.com/yorickpeterse/oga/issues/98#note_45443992>
270

271
272
273
274
## Documentation

The documentation is best viewed [on the documentation website][doc-website].

275
* {file:CONTRIBUTING Contributing}
276
277
* {file:changelog Changelog}
* {file:migrating\_from\_nokogiri Migrating From Nokogiri}
278
279
* {Oga::XML::Parser XML Parser}
* {Oga::XML::SaxParser XML SAX Parser}
280
* {file:xml\_namespaces XML Namespaces}
281

Yorick Peterse's avatar
Yorick Peterse committed
282
283
284
285
286
287
288
289
## Why Another HTML/XML parser?

Currently there are a few existing parser out there, the most famous one being
[Nokogiri][nokogiri]. Another parser that's becoming more popular these days is
[Ox][ox]. Ruby's standard library also comes with REXML.

The sad truth is that these existing libraries are problematic in their own
ways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works
fabon's avatar
fabon committed
290
because of the non concurrent nature of MRI, on JRuby it works because it's
Yorick Peterse's avatar
Yorick Peterse committed
291
292
293
294
295
implemented as Java. Nokogiri also uses libxml2 which is a massive beast of a
library, is not thread-safe and problematic to install on certain platforms
(apparently). I don't want to compile libxml2 every time I install Nokogiri
either.

296
297
298
To give an example about the issues with Nokogiri on Rubinius (or any other
Ruby implementation that is not MRI or JRuby), take a look at these issues:

299
300
301
302
303
* <https://github.com/rubinius/rubinius/issues/2957>
* <https://github.com/rubinius/rubinius/issues/2908>
* <https://github.com/rubinius/rubinius/issues/2462>
* <https://github.com/sparklemotion/nokogiri/issues/1047>
* <https://github.com/sparklemotion/nokogiri/issues/939>
304
305
306
307
308
309
310
311

Some of these have been fixed, some have not. The core problem remains:
Nokogiri acts in a way that there can be a large number of places where it
*might* break due to throwing around void pointers and what not and expecting
that things magically work. Note that I have nothing against the people running
these projects, I just heavily, *heavily* dislike the resulting codebase one
has to deal with today.

Yorick Peterse's avatar
Yorick Peterse committed
312
313
314
315
Ox looks very promising but it lacks a rather crucial feature: parsing HTML
(without using a SAX API). It's also again a C extension making debugging more
of a pain (at least for me).

316
I just want an XML/HTML parser that I can rely on stability wise and that is
Yorick Peterse's avatar
Yorick Peterse committed
317
318
319
written in Ruby so I can actually debug it. In theory it should also make it
easier for other Ruby developers to contribute.

Yorick Peterse's avatar
Yorick Peterse committed
320
321
## License

322
323
324
All source code in this repository is subject to the terms of the Mozilla Public
License, version 2.0 unless stated otherwise. A copy of this license can be
found the file "LICENSE" or at <https://www.mozilla.org/MPL/2.0/>.
Yorick Peterse's avatar
Yorick Peterse committed
325

Yorick Peterse's avatar
Yorick Peterse committed
326
327
328
[nokogiri]: https://github.com/sparklemotion/nokogiri
[oga-wikipedia]: https://en.wikipedia.org/wiki/Japanese_saw#Other_Japanese_saws
[ox]: https://github.com/ohler55/ox
329
[doc-website]: http://code.yorickpeterse.com/oga/latest/