I need to verify that a “simplified” DTD is really a subset of a larger DTD, I.e. that documents that are valid according to the “simplified” DTD will also always be valid according to the larger (or “master”) DTD.
The simplified DTD is being written now — it’s derived from the master DTD (were it the other way around, one could simply include the smaller DTD into the larger one).
How would I be able to determine if the simplified DTD is derived from the master DTD?
DTDs are really just context-free grammars in disguise. A grammar G represents the set of possible legal strings that comprise the unstated language L(G) that grammar represents.
What you are asking is tantamount to determining if you have G1 and G2, whether L(G1) is a subset of L(G2). My language theory is getting rusty and I don’t remember if this is computable in general or not, but my guess this is really hard, because you have to demonstrate that an arbitrary derivation in G1 always has a derivation in G2.
You might be able to answer the question of whether G1 is structured in such a way that you can demonstrate that L(G1) is a subset of L(G2) by demonstrating that each element of G1 is compatible with each element of G2, essentialy by showing that each grammar rule
in G1 has a corresponding rule in G2 with elements dropped. Your idea of diffing DTDs seems to be along this line, with the proviso that if the the diffs are large you are stuck with the general problem rather than the simpler one. At least the way you’ve characterized the problem (G2 is derived from the master DTD) I think you have chance.
The purpose of the diff would be to identify compatible rules by finding the least differences.
If you have grammar rule g2 = A ; and another g1 = A that you’d claim are related and
that you’d like to check,
you’d first have to demonstrate that the string tokens that A derived in G1 is a superset
of the string of tokens A derived in G2. This looks just like the original unconstrained problem of comparing two langauges; we’re now just comparing the sublanguages for the two rules g1 and g2.
So now I think you have to insist that each subrule reachable by g1 is compatible structurally with a corresponding subrule in g2 to make this practical.
I think you can probably write a recursive procedure to check this. What this procedure mostly needs as help is all the set operators (FirstOf, ..) that you tend to find in an LALR parser generator.
On a different front, my company makes Smart Differencer tools, that compute deltas over language constructs in terms of langauge elements and editing operations on those elements. It is parameterized by language definitions. The SmartDifference presently works for a variety of conventional languages (C, C++, C#, COBOL, Java, PHP, Python, ….). XML (and DTDs) are a language, too, for which we have a language definition, and we’ve built an experimental XML Smart Differencer tools. It ought to work on DTDs just fine. Contact me offline (see bio) if you have further direct interest.
EDIT: Just for grins, I tried the following two DTDs, one derived from the other:
orderform.xml:
and orderform2.xml:
[See if you can spot the differences yourself, first 🙂
And ran the XML SmartDifferencer:
Yep, that’s what I did to get the derived one. (The notation N.M means “line N, column M”).