Same inputs via the XML Catalog
- Date: 20 October 2022
- Time: 6:30 to 7:30pm (UK Time)
Jonathan Fine: I will show work-in-progress that uses an XML Catalog and Resolver to solve the problem (in XML) of ensuring identical inputs. The back-end which stores the inputs will be a Git object store, and the internal names will be Git object identifiers. I want TeX to be able to use the same method. No change would be needed in source documents and style files.
Background: style files
Style files are used when rendering documents. This is true for TeX, for XML to HTML, and for CSS in the browser. Ensuring identical inputs is crucial to ensure uniform outputs. In the old days it was the printing press that ensured uniform outputs (except the misprints beloved of stamp collectors).
Famously, Don Knuth’s TeX generates identical outputs from identical inputs. It is also as a program remarkable stable over time. This makes TeX something of an an archival format and typesetting program. He wanted TeX to give identical results not forever but for at least 100 years.
This enduring nature is very important in STEM publishing, which represents the accumulation of knowledge. In such areas XML documents and tools are highly valued for being consistent and standards-based. Publishers like them.
The XML catalog
We need identical inputs, but without repeatedly fetching the resource over the network. One way of thinking is to hold a locally cached copy.
To support this identical inputs need, XML built on the Catalog provided by SGML, and also introduced the XML Resolver. I’ll explain the Catalog now, and the Resolver later.
The catalog (both XML and SGML) both provide a mapping from names to external entities. An external entity can be a file, but it doesn’t need to be. It could be a URL.
This allows XML documents to refer to external resources (such as stylesheets and schema definition) without have to specify where they are to be found. The XML Catalog is bit like an old-fashioned phone book. You look up a name and get not a phone number but a local name for a file.
An example XML Catalog
Most Linux installations have an XML catalog. Here’s mine:
$ head -5 /etc/xml/catalog
<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<system systemId="http://www.w3.org/2001/xml.xsd" uri="file:///usr/share/xml/xml.xsd"/>
<system systemId="http://www.w3.org/2009/01/xml.xsd" uri="file:///usr/share/xml/xml.xsd"/>
The XML Resolver
The Catalog maps resource names in an XML document to file names (or other Uniform Resource Identifiers). The Resolver can do all this and more. For example, it can find a resource by looking up in a database. Or it can even do a computation on the fly. This is a way to get today’s weather into an XML document.
Resolvers in Python
The widely used Python module lxml
is a wrapper around the standard
libxml2
library. In lxml
a Resolver can return any of:
- a parsable string
- a filename
- an open file-like object that has a read() method
Work in progress
In lxml
you can, using Python code, easily write your own
resolver. In particular you can use as a Catalog a mapping from names
to Git object ids. You can then in the Resolver fetch the
corresponding objects from a Git object store. This is what I’ll
present at the TeX Hour.
You can see this work at:
For preliminary reading I suggest:
URLs
- https://xmlresolver.org
- https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home
- https://lxml.de/resolvers.html
For more information about the TeX Hour, including Zoom URL, see the About page.