Skip to content

New process plugin to extract a given xpath from a DOMDocument

Issue #3112559 on drupal.org by marvil07.

Problem/Motivation

In the context of dom process plugins, the general idea is to manipulate html before storing it into Drupal.
One possible use case is to extract information from a given part of the markup, and ignore the rest.

Last year at NEDCamp I suggested to write something to do this as someone was telling me his use case.
Lately I found myself needing something like this.

A nice related note: the current dom migrate process plugins inside migrate plus are more generic than html processing, since it is handling a \DOMDocument object, which can also be generated from other sources like xml; so it is useful not only for html processing.

Proposed resolution

Create a new migrate process plugin that can receive a \DOMDocument, e.g. generated with the dom plugin, and given some options extract a subset of the data and return it as a simple string.

Examples

A couple of examples.

process:
  'body/value':
    -
      plugin: dom
      method: import
      source: 'body/0/value'
    -
      plugin: dom_extract
      xpath: '/div/div'

This will extract the contents of the second nested <div> element.

process:
  field_data_type:
    -
      plugin: dom
      method: import
      source: 'body/0/value'
    -
      plugin: dom_extract
      xpath: '/h1'
      target: 'attribute:data-type'

This will extract the contents of the data-type attribute on the top level h1 element.

Remaining tasks

Provide patch.
Review.

User interface changes

N.A.

API changes

New dom_extract migrate process plugin.

Data model changes

N.A.

Release notes snippet

New dom_extract migrate process plugin to retrieve a subset of a given dom as a string.