Unlatex, texbox and the Access Tree
- Date: 9 March 2023
- Time: 6:30 to 7:30pm (UK Time)
Summary
Jonathan Fine latex
is a document compiler. From a source file it
produces PDF. However, unlatex
goes the other way. Hence the name.
To produce a page of PDF, latex
ships out an internal TeX box. The
starting point of unlatex
is a text representation of that box. This
TeX Hour will demonstrate software to produce such files. It can
process any source file that latex
can process.
The resulting texbox
files provide a very good starting point for
reconstructing the original source file, and also the style files that
created the PDF. The texbox
files are also a good starting point
for creating an Accessibility Tree. The access tree is the object the
screen reader interacts with, when reading an accessible document.
For more information about the TeX Hour, including Zoom URL, see the About page.
The Access Tree
Most blind people prefer HTML read in the web browser to PDF read in a PDF reader. Here are four reasons for this.
- They are accustomed to using the web browser.
- The web browser provides better support for their preferred screen reader.
- As a document standard, HTML supports the Access Tree better than PDF does.
- As a reading tool, the web browser supports the Access Tree better than PDF readers do.
latex
ships out a texbox
to generate a page of PDF. The texbox
contains important information, valuable for creating the Access Tree,
that the PDF does not contain.
The big question is this: How successful will be the texbox
to Access
Tree transformation? Experiment, experience, examples and testing are
needed to properly answer this question.
Two simple texbox
files
Here are two examples. They are enough to learn from, but not enough to answer the big question. A future steps is to gather more small examples, and to create statistics on a wide range of important real world documents.
The first example is focussed on reconstructing the input. The input
Hello world!
produces:
\hbox(6.94444+0.0)x52.80565
.\OT1/cmr/m/n/10 H
.\OT1/cmr/m/n/10 e
.\OT1/cmr/m/n/10 l
.\OT1/cmr/m/n/10 l
.\OT1/cmr/m/n/10 o
.\glue 3.33333 plus 1.66666 minus 1.11111
.\OT1/cmr/m/n/10 w
.\kern-0.27779
.\OT1/cmr/m/n/10 o
.\OT1/cmr/m/n/10 r
.\OT1/cmr/m/n/10 l
.\OT1/cmr/m/n/10 d
.\OT1/cmr/m/n/10 !
The second example is focussed on reconstructing the style file
macros. The input LaTeX
produces:
\hbox(6.83331+2.15277)x25.66368
.\OT1/cmr/m/n/10 L
.\kern -3.6
.\vbox(6.83331+0.0)x5.90282, glue set 2.04997fil
..\hbox(4.78334+0.0)x5.90282
...\OT1/cmr/m/n/7 A
..\glue 0.0 plus 1.0fil minus 1.0fil
.\kern -1.49994
.\OT1/cmr/m/n/10 T
.\kern -1.66702
.\hbox(6.83331+0.0)x6.80557, shifted 2.15277
..\OT1/cmr/m/n/10 E
.\kern -1.25
.\OT1/cmr/m/n/10 X
Each reader is welcome to form their own opinion. I am optimistic, if only because the highly successful Lwarp converter from Brian Dunn uses a broadly similar architecture The big difference are:
- I use the
texbox
files rather than the resultingdvi
files. - Lwarp uses custom style files, which has benefits and costs.
It may be that each is better than the other on a favoured group of documents. And perhaps the two approaches can be unified, at least in the production of the resulting Access Tree or HTML. This would benefit users and developers.
Thanks
Dan Miner previously told me about the Access Tree. At last week’s
TeX Hour together we came up with
the key idea of going directly from a texbox
to and Access
Tree. Thank you, Dan.