I would like to speed up the process of validating a batch of XML files against the same single XML schema (XSD). Only restrictions are that I am in a PHP environment.
My current problem is that the schema I would like to validate against includes the fairly complex xhtml schema of 2755 lines (http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd).
Even for very simple data this takes a long time (around 30 seconds pr. validation).
As I have thousands of XML files in my batch, this doesn’t really scale well.
For validating the XML file I use both of these methods, from the standard php-xml libraries.
- DOMDocument::schemaValidate
- DOMDocument::schemaValidateSource
I am thinking that the PHP implementation fetches the XHTML schema via HTTP and builds some internal representation (possibly a DOMDocument) and that this is thrown away when the validation is completed. I was thinking that some option for the XML-libs might change this behaviour to cache something in this process for reuse.
I’ve build a simple test setup which illustrates my problem:
<xs:schema attributeFormDefault="unqualified"
elementFormDefault="qualified"
targetNamespace="http://myschema.example.com/"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:myschema="http://myschema.example.com/"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xs:import
schemaLocation="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
namespace="http://www.w3.org/1999/xhtml">
</xs:import>
<xs:element name="Root">
<xs:complexType>
<xs:sequence>
<xs:element name="MyHTMLElement">
<xs:complexType>
<xs:complexContent>
<xs:extension base="xhtml:Flow"></xs:extension>
</xs:complexContent>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://myschema.example.com/" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://myschema.example.com/ test-schema.xsd ">
<MyHTMLElement>
<xhtml:p>This is an XHTML paragraph!</xhtml:p>
</MyHTMLElement>
</Root>
<?php
$data_dom = new DOMDocument();
$data_dom->load('test-data.xml');
// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
$start = time();
echo "schemaValidate: Attempt #$attempt returns ";
if (!$data_dom->schemaValidate('test-schema.xsd')) {
echo "Invalid!";
} else {
echo "Valid!";
}
$end = time();
echo " in " . ($end-$start) . " seconds.\n";
}
// Loading schema into a string.
$schema_source = file_get_contents('test-schema.xsd');
// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
$start = time();
echo "schemaValidateSource: Attempt #$attempt returns ";
if (!$data_dom->schemaValidateSource($schema_source)) {
echo "Invalid!";
} else {
echo "Valid!";
}
$end = time();
echo " in " . ($end-$start) . " seconds.\n";
}
Running this schematest.php file produces the following output:
schemaValidate: Attempt #1 returns Valid! in 30 seconds.
schemaValidate: Attempt #2 returns Valid! in 30 seconds.
schemaValidate: Attempt #3 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 32 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 30 seconds.
Any help and suggestions on how to solve this issue, are very welcomed!
You can safely substract 30 seconds from the timing values as overhead.
Remote requests to W3C servers are being delayed because most libraries do not reflect caching the documents (even the HTTP headers suggest that). But read your own:
W3.org tries to keep requests low. That is understandable. PHP’s
DomDocumentis based on libxml. And libxml allows to set an external entity loader. The whole Catalog support section is interesting in this case.To solve the issue in question, setup a
catalog.xmlfile:Save a copy of the two
.xsdfiles with the names given in that catalog file next to the catalog (relative as well as absolute pathsfile:///...do work if you prefer a different directory).Then ensure your systems environment variable
XML_CATALOG_FILESis set to the filename of thecatalog.xmlfile. When everything is setup, the validation just runs through:If it still takes long, it’s just a sign that the environment variable is not set to the right location. I have handled the variable as well as some edge cases as well in a blog post:
It should take care of diverse edge cases, like filenames containing spaces.
Alternatively it is possible to create a simple external entity loader callback function that uses a URL => file mapping for the local file-system in form of an array:
As this shows, I’ve placed a verbatim copy of these two XSD files into a subdirectory called
schema. The next step is to make use oflibxml_set_external_entity_loaderto activate the callback function with the mapping. Files that exist on disk already are preferred and loaded directly. If the routine encounters a non-file that has no mapping, aRuntimeExceptionwill be thrown with a detailed message:After setting this external entity loader, there isn’t any longer the delay for the remote-requests.
And that’s it. See Gist. Take care: This external entity loader has been written for loading the XML file to validate from disk and to "resolve" the XSD URIs to local filenames. Other kind of operations (e.g. DTD based validation) might need some code changes / extension. More preferable is the XML catalog. It also works for different tools.