Simple XML broken with UTF-16LE
Issue #3051858 on drupal.org by sonnykt.
All my migrations previously worked with XML files encoded in UTF-16LE but were suddenly broken after upgrading to Migrate Plus 4.2.
Drupal\migrate\MigrateException: Fatal Error 73: expected '>'
Line: 542
Column: 20
File: in Drupal\migrate_plus\Plugin\migrate_plus\data_parser\SimpleXml->openSourceUrl() (line 51 of modules/contrib/migrate_plus/src/Plugin/migrate_plus/data_parser/SimpleXml.php).
It turns out that the issue #3046753 Make XML parser more resilient introduced a call with trim()
before simplexml_load_string()
protected function openSourceUrl($url) {
// Clear XML error buffer. Other Drupal code that executed during the
// migration may have polluted the error buffer and could create false
// positives in our error check below. We are only concerned with errors
// that occur from attempting to load the XML string into an object here.
libxml_clear_errors();
$xml_data = $this->getDataFetcherPlugin()->getResponseContent($url);
$xml = simplexml_load_string(trim($xml_data));
foreach (libxml_get_errors() as $error) {
$error_string = self::parseLibXmlError($error);
throw new MigrateException($error_string);
}
$this->registerNamespaces($xml);
$xpath = $this->configuration['item_selector'];
$this->matches = $xml->xpath($xpath);
return TRUE;
}
The function trim()
is not safe when working with multibyte encoded string, whereas SimpleXML can perfectly handle multibyte data. I don't think it necessary to call trim()
before simplexml_load_string
. If your XML has an empty line before the openning tag, your XML is not well-formed and required special treatment. Adding trim()
to the generic parser will prevent it from working properly with Unicode data.