TeX Hour

A weekly video meeting

Unlatex, texbox and the Access Tree

Date: 9 March 2023
Time: 6:30 to 7:30pm (UK Time)

Summary

Jonathan Fine latex is a document compiler. From a source file it produces PDF. However, unlatex goes the other way. Hence the name.

To produce a page of PDF, latex ships out an internal TeX box. The starting point of unlatex is a text representation of that box. This TeX Hour will demonstrate software to produce such files. It can process any source file that latex can process.

The resulting texbox files provide a very good starting point for reconstructing the original source file, and also the style files that created the PDF. The texbox files are also a good starting point for creating an Accessibility Tree. The access tree is the object the screen reader interacts with, when reading an accessible document.

For more information about the TeX Hour, including Zoom URL, see the About page.

The Access Tree

Most blind people prefer HTML read in the web browser to PDF read in a PDF reader. Here are four reasons for this.

They are accustomed to using the web browser.
The web browser provides better support for their preferred screen reader.
As a document standard, HTML supports the Access Tree better than PDF does.
As a reading tool, the web browser supports the Access Tree better than PDF readers do.

latex ships out a texbox to generate a page of PDF. The texbox contains important information, valuable for creating the Access Tree, that the PDF does not contain.

The big question is this: How successful will be the texbox to Access Tree transformation? Experiment, experience, examples and testing are needed to properly answer this question.

Two simple `texbox` files

Here are two examples. They are enough to learn from, but not enough to answer the big question. A future steps is to gather more small examples, and to create statistics on a wide range of important real world documents.

The first example is focussed on reconstructing the input. The input Hello world! produces:

\hbox(6.94444+0.0)x52.80565
.\OT1/cmr/m/n/10 H
.\OT1/cmr/m/n/10 e
.\OT1/cmr/m/n/10 l
.\OT1/cmr/m/n/10 l
.\OT1/cmr/m/n/10 o
.\glue 3.33333 plus 1.66666 minus 1.11111
.\OT1/cmr/m/n/10 w
.\kern-0.27779
.\OT1/cmr/m/n/10 o
.\OT1/cmr/m/n/10 r
.\OT1/cmr/m/n/10 l
.\OT1/cmr/m/n/10 d
.\OT1/cmr/m/n/10 !

The second example is focussed on reconstructing the style file macros. The input LaTeX produces:

\hbox(6.83331+2.15277)x25.66368
.\OT1/cmr/m/n/10 L
.\kern -3.6
.\vbox(6.83331+0.0)x5.90282, glue set 2.04997fil
..\hbox(4.78334+0.0)x5.90282
...\OT1/cmr/m/n/7 A
..\glue 0.0 plus 1.0fil minus 1.0fil
.\kern -1.49994
.\OT1/cmr/m/n/10 T
.\kern -1.66702
.\hbox(6.83331+0.0)x6.80557, shifted 2.15277
..\OT1/cmr/m/n/10 E
.\kern -1.25
.\OT1/cmr/m/n/10 X

Each reader is welcome to form their own opinion. I am optimistic, if only because the highly successful Lwarp converter from Brian Dunn uses a broadly similar architecture The big difference are:

I use the texbox files rather than the resulting dvi files.
Lwarp uses custom style files, which has benefits and costs.

It may be that each is better than the other on a favoured group of documents. And perhaps the two approaches can be unified, at least in the production of the resulting Access Tree or HTML. This would benefit users and developers.

Thanks

Dan Miner previously told me about the Access Tree. At last week’s TeX Hour together we came up with the key idea of going directly from a texbox to and Access Tree. Thank you, Dan.

URLs

Home About Accessibility