Unlatex, texbox and the Access Tree
- Date: 9 March 2023
- Time: 6:30 to 7:30pm (UK Time)
Summary
Jonathan Fine latex is a document compiler. From a source file it
produces PDF. However, unlatex goes the other way. Hence the name.
To produce a page of PDF, latex ships out an internal TeX box. The
starting point of unlatex is a text representation of that box. This
TeX Hour will demonstrate software to produce such files. It can
process any source file that latex can process.
The resulting texbox files provide a very good starting point for
reconstructing the original source file, and also the style files that
created the PDF. The texbox files are also a good starting point
for creating an Accessibility Tree. The access tree is the object the
screen reader interacts with, when reading an accessible document.
For more information about the TeX Hour, including Zoom URL, see the About page.
The Access Tree
Most blind people prefer HTML read in the web browser to PDF read in a PDF reader. Here are four reasons for this.
- They are accustomed to using the web browser.
- The web browser provides better support for their preferred screen reader.
- As a document standard, HTML supports the Access Tree better than PDF does.
- As a reading tool, the web browser supports the Access Tree better than PDF readers do.
latex ships out a texbox to generate a page of PDF. The texbox
contains important information, valuable for creating the Access Tree,
that the PDF does not contain.
The big question is this: How successful will be the texbox to Access
Tree transformation? Experiment, experience, examples and testing are
needed to properly answer this question.
Two simple texbox files
Here are two examples. They are enough to learn from, but not enough to answer the big question. A future steps is to gather more small examples, and to create statistics on a wide range of important real world documents.
The first example is focussed on reconstructing the input. The input
Hello world! produces:
\hbox(6.94444+0.0)x52.80565
.\OT1/cmr/m/n/10 H
.\OT1/cmr/m/n/10 e
.\OT1/cmr/m/n/10 l
.\OT1/cmr/m/n/10 l
.\OT1/cmr/m/n/10 o
.\glue 3.33333 plus 1.66666 minus 1.11111
.\OT1/cmr/m/n/10 w
.\kern-0.27779
.\OT1/cmr/m/n/10 o
.\OT1/cmr/m/n/10 r
.\OT1/cmr/m/n/10 l
.\OT1/cmr/m/n/10 d
.\OT1/cmr/m/n/10 !
The second example is focussed on reconstructing the style file
macros. The input LaTeX produces:
\hbox(6.83331+2.15277)x25.66368
.\OT1/cmr/m/n/10 L
.\kern -3.6
.\vbox(6.83331+0.0)x5.90282, glue set 2.04997fil
..\hbox(4.78334+0.0)x5.90282
...\OT1/cmr/m/n/7 A
..\glue 0.0 plus 1.0fil minus 1.0fil
.\kern -1.49994
.\OT1/cmr/m/n/10 T
.\kern -1.66702
.\hbox(6.83331+0.0)x6.80557, shifted 2.15277
..\OT1/cmr/m/n/10 E
.\kern -1.25
.\OT1/cmr/m/n/10 X
Each reader is welcome to form their own opinion. I am optimistic, if only because the highly successful Lwarp converter from Brian Dunn uses a broadly similar architecture The big difference are:
- I use the
texboxfiles rather than the resultingdvifiles. - Lwarp uses custom style files, which has benefits and costs.
It may be that each is better than the other on a favoured group of documents. And perhaps the two approaches can be unified, at least in the production of the resulting Access Tree or HTML. This would benefit users and developers.
Thanks
Dan Miner previously told me about the Access Tree. At last week’s
TeX Hour together we came up with
the key idea of going directly from a texbox to and Access
Tree. Thank you, Dan.