Ideas on word processing

I have been thinking about these ideas for a while. I discussed this with lobte.rs 10 months ago.

Reuse of old control characters

UTF-8 is the most commonly used character set today, and it extends the US-ASCII character set.

The US-ASCII specification includes 33 non-printing control codes originating from teletype machines.

NUL SOH STX ETX EOT ENG ACK BEL
BS  HT  LF  VT  FF  CR  S0  S1
DLE DC1 DC2 DC3 DC4 NAK SYN ETB
CAN EM  SUB ESC FS  GS  RS  US
SP                          DEL

Some of them are still recognized by Linux terminals but majority of them are no longer used. Unless you happen to own an old teletype machine that runs like it used to, you are no longer using more than 6 of these codes. Rest of them are not produced or used by any computer system that processes plaintext.

They could be used for something else. For instance, they could be used with word processing as unique markers that are guaranteed to not appear anywhere else within the text.

I have been concerned about typing in symbol characters with markdown because they always trigger some formatting setting that then smears the whole file when I preview it. When the file gets long enough the syntax coloring in Vim glitches out, so even syntax coloring doesn't help there. If the formatting were behind control characters, there would be no chance that I keep typing them in by accident.

None of the normal programming languages use control characters. This means that formatting characters could be overlaid to build programs that either use them as comments or literals.

"Reveal codes" and compatibility with plaintext

File format for text formatting does not necessarily have to be inaccessible for plain text editors.

If you've used a really old text editor from CP/M or DOS, they often don't seem to stand far away from plain text editing in terms of what structure they provide.

WordPerfect 6 had a mode to reveal the markup codes to the user. I suppose the mode was provided because the editor would sometimes leave in garbage markup that you have to clean out from your file.

If you make a file format from control characters, and it's relatively context-free, you can still read and modify it with a plain text editor.

With the control code EOT Vim shows a character '^C' and you can insert those yourself. Given that format stays human-readable enough, you can edit it with a text editor without issues.

Technically a text editor could provide something similar to WordPerfect's "reveal codes" -feature. Here's a script you can run in Vim to re-represent control codes as symbols.

" This trick uses concealing in vim editor
" to represent few control characters for formatting.
set conceallevel=1
set concealcursor=ni

call matchadd('Conceal',nr2char(1),10,-1,{'conceal': ';'})
call matchadd('Conceal',nr2char(2),10,-1,{'conceal': '~'})
call matchadd('Conceal',nr2char(3),10,-1,{'conceal': '|'})
call matchadd('Conceal',nr2char(5),10,-1,{'conceal': '['})
call matchadd('Conceal',nr2char(6),10,-1,{'conceal': ']'})
call matchadd('Conceal',nr2char(0x11),10,-1,{'conceal': '('})
call matchadd('Conceal',nr2char(0x12),10,-1,{'conceal': ')'})
call matchadd('Conceal',nr2char(0x13),10,-1,{'conceal': '{'})
call matchadd('Conceal',nr2char(0x14),10,-1,{'conceal': '}'})
call matchadd('Conceal',nr2char(0x16),10,-1,{'conceal': ':'})

hi! link Conceal Special

If you like to try editing with escape keys, you can always insert them directly, eg. Control-V Control-A combination gives you SOH. If you want to insert the characters you can also try the following script that adds keybindings under Ctrl+space in insert mode:

inoremap <C-Space>; <C-v><C-a>
inoremap <C-Space>c <C-v><C-a>
inoremap <C-Space>~ <C-v><C-b>
inoremap <C-Space>t <C-v><C-b>
inoremap <C-Space>\| <C-v><C-c>
inoremap <C-Space><Tab> <C-v><C-c>
inoremap <C-Space>[ <C-v><C-b>
inoremap <C-Space>] <C-v><C-f>
inoremap <C-Space>( <C-v><C-q>
inoremap <C-Space>) <C-v><C-r>
inoremap <C-Space>: <C-v><C-v>
inoremap <C-Space><Space> <C-v><C-v>

Given that we still base the editing on something that resembles plain text, the files would remain backwards compatible with plain text.

Relational model from attributes

In HTML there's a beginning tag, then attributes within that tag. Then comes the text and ending tag.

<tag attr1="value1" attr2="value2">text</tag>

Why make it so involving? We could very well have just two symbols for beginning an element, then a way to describe attributes to that element.

This way the simplest element we'd have would be the following, containing only text:

[text]

In this case the group would obtain its meaning from the context. If you want to tag it, you would write (tag) somewhere inside it. The structure corresponding to the earlier HTML element would be:

[(tag)(attr1:value1)(attr2:value2)text]

The attributes would carry the same meaning independent of where they appear within the element:

[(tag)(attr1:value1)text(attr2:value2)]
[(tag)te(attr1:value1)xt(attr2:value2)]
[text(tag)(attr1:value1)(attr2:value2)]

Attributes would behave like propositions that claim something about the element where they appear.

It's perhaps interesting because it makes more sense to treat hypertext links, text blocks, images and such as propositional structures.

[(href:link)
 (img:image)
 [alt-text for an image]]

[(pre)(language:python3)
[print("this probably makes some sense")
print(5+int(input("give a number: ")))]]

In the previous example, note that I still assume these relations and elements would use their own characters reserved from the control codes.

If you're willing to treat the parts of the document as a program, you may produce ways to make the document shorter. For example consider you'd have many links with emphasis, like the following:

[(em)(href:link1)Link1]
[(em)(href:link2)Link2]
[(em)(href:link3)Link3]

Since the attributes have propositional shape, they can form a logic program. This means you can create a macro definition:

[(=:em-href:Text)
 (em)
 (href:Text)]

The macro declares a new proposition. If the element has this proposition, then the element is an emphasized hypertext reference.

The links can be rewritten as:

[(em-href:link1)Link1]
[(em-href:link2)Link2]
[(em-href:link3)Link3]

If you are given some control over what different attributes mean in the file, it allows versatile use of the document format. For example if you had control to creating your own attributes, this would be a practical method to represent initial chess board configurations:

[(board)
 (white:rook:1:1) (white:knight:2:1) (white:bishop:3:1)
 (white:queen:4:1) (white:king:5:1) (white:bishop:6:1)
 (white:knight:7:1) (white:rook:8:1) (white:pawn:*:2)
 (black:rook:1:8) (black:knight:2:8) (black:bishop:3:8)
 (black:queen:4:8) (black:king:5:8) (black:bishop:6:8)
 (black:knight:7:8) (black:rook:8:8) (black:pawn:*:7)]

To display this you would use the (board) -attribute to draw a chess board instead of text. For each piece you would draw an image to the board.

The point of SGML, the predecessor for HTML & XML, was to provide structure into information. It was argued that text documents that consists of drawing or typesetting commands do not convey what information they hold. At least this seemed to be the point if you read into Springer's "Practical SGML" a bit.

The intent was people would build their own document formats and describe them with document type definitions (DTD).

Nobody liked writing DTDs or using their own markups. For instance I don't have a DTD for writing blogposts. Instead I'm using markdown to type this document.

I guess I don't like to <p> every paragraph I write.

Problems with large files

There's a problem that I think SGML files didn't solve quite well. It's about the things that should be done if the document grows large and ends up having millions of lines of text in it. Ideally the document format would provide some support for this.

We would need some sort of running head that describe the structure of the file up to that point. In a bit same manner how you have them in the books.

Chapter 1, Section 2

I guess it'd be sufficient to repeat the attributes and maybe provide some shorthand text. Well, all the information you need in order to present the content.

I'm just not sure yet, how this should shape out.

Sample document

What would a file format look like that follows the ideas I just presented?

Here's a presentation. I'm using parentheses, colons and special symbols instead of control characters.

(dtd:document)
(language:en)
[(title)Small sample document]

[The topmost level of the document
would be treated as if it was a plaintext layer.
Writing to topmost would appear as plaintext within any editor.
Every formatted block of text should be placed within an element block.

Sequences of blank lines mark paragraphs.
Also the editor would have to assume
that paragraphs are broken into multiple lines
that each are shorter than 80 columns.

In a well-formed document there would be 
a running head not farther than 72 lines (3*24) apart.
I'm not sure what a running head should look like yet,
ideally it should be easy to spot.]

The (dtd:document) would describe that this file uses markers such as language and title, with topmost elements being sequences of paragraphs or text blocks separated by empty lines.

The (language:en) would describe that the author thinks he's writing English. (title) would state that the group of text is a title of the document.

Editor around the format

When designing a format I think it'd be important to consider how it's being used. It'd start with tools that produce the format.

To design an editor for the above format I'd start with a plain text editor. I'd implement it with a Piece table.

Piece table is a data structure that keeps an intact copy of the original document. The modifications are presented with a table that takes pieces from the original document and mixes them with new text.

Editing commands and modifications would be directed at the data structure.

The program would maintain a "view" into the document. This is an offset with some metadata in it collected from the document.

If the editor is in "reveal codes" -mode, the "view" would be rendered as plaintext exactly like it appears in the text file. Otherwise it'd be fed into an incremental parser.

There seem to be documentation in Berkeley's Harmonia project for building an incremental parser above LR parsing tables.

Tim A. Wagner and Susan L. Graham. Efficient and flexible incremental parsing. ACM Transactions on Programming Languages and Systems, 20(5):980-1013, September 1998.

The parser would produce a tree structured view from the document.

Rendering of the document

So once you have that tree structured view of your document, how do you render it? Well I'd guess it would go like this:

  1. Start from the root of the document.
  2. Look at the attributes in the current text group.
  3. If the context and all attributes in the group agree with some representation mode, select that presentation mode for the group. Otherwise present some sort of failure mode that is appropriate to the context.
  4. The rendering mode would produce a structure to be displayed, based on the currently expanded attributes.

The display of the document would start with a box model similar to that in the popular TeX digital typesetting system.

We'd have boxes, glue and frames. Frames would be boxes that contain other boxes. Each of these elements would have a 'paint' -field that describes what renderer is drawing into the box.

Drawing the page to the screen would happen in three passes through the rendering tree consisting of frames.

  1. The first pass would go top-down, build the rendering tree and decide what is being rendered and where, deciding every element's relative location on the page.
  2. The second pass would go bottom-up and pack the elements horizontally, vertically or however they need to be packed into frames, Frame's dimensions would not change after this step.
  3. The third pass would go top-down and feed the rendering tree to the compositor. It would also build a query structure to display and provide the user interaction elements and mark other active areas in the view.

For the overall input scheme, I'd likely copy from Kakoune's editing model. I tried this editor for 5 minutes, and I think the ideas there work quite well. The visual movement commands were very enjoyable to use.

Ideas welcome

All the things here haven't been through thought.

Also if you'd like to try this kind of an editor, you can go and star the wordprocessing github repository. It's an empty repository barely with anything in there. Though if I get to build an editor for this kind of format, I'll build it there.