Exploring language design

I was designing a new syntax for annotating context free grammars. Before this I had tried twice and was not satisfied to the results. This time I got it right and I understand why.

The problem that I was solving here is that it's really hard to readably describe what kind of mapping there should be from the syntax into the semantics.

I've found out that it's really hard to create languages that are readable, easy to explain and easy to understand. Even by following rules you may need a second and even a third try. But these may be a good rule of thumb for a starting point.

Expectations

Start from what the person expects to read. This depends on what kind of language is traditionally used to describe the concept you are describing. This can be found by searching and asking from the people.

In my case, the context free grammars are usually described with left-hand-side and right-hand-side, and they may be notated to the paper like lhs: rhs or lhs -> rhs. There's one symbol on the left hand side and many on the right.

Additionally the | is usually used for describing alternation, although it is mostly a shorthand. It is usually understood that there are multiple production rules if you write them like this:

multiplication:
    multiplication '*' addition
    addition

addition:
    addition '+' term
    term

term:
    int
    symbol
    '(' multiplication ')'

Btw. this describes an usual algebraic syntax that most people should be familiar with. Here's a visual explanation on how these production rules describe the language:

|-------------| addition
|-------------| addition '+' term
1 + 2 + (3 * 5)

|---|           addition '+' term
-               term
1 + 2
^   ^           int, int

        |-----| '(' multiplication ')'
        (3 * 5)

         |---|  multiplication '*' addition
         -      addition
         -      term
             -  term
         3 * 5
         ^   ^  int, int

The needs

The above way to notate grammars is very clean, but there are few things it doesn't do.

One thing is that in your usual grammar there are repeating patterns such as:

exprs0:
    ()
    exprs1
    exprs1 ","

exprs1:
    expr
    exprs1 "," expr

They aren't very convenient to name and they start to take a toll on readability after a while. To limit the effect I provided a way to quickly produce rules that base on a template. I provided some predefined templates such as:

exprs0: sep(expr, ',', ',')

expr: int+

Another thing is that this kind of grammar does not describe what the language means.

The another problem arises when I attempt to use the grammars. If you use it as-it, you will have to embed the semantics into the code that interprets the parse tree.

Also if you cannot compose the meaning, it means that you need a separate mechanism for describing the syntactic sugar.

To solve this I've been using an annotation that maps the production rule into an action in an interpreter or a compiler.

How to annotate the details is difficult. I got it twice wrong. At the first one I annotated it before the clause:

multiplication -> mul(1, 3): multiplication '*' addition

The indices as argument was mandatory and it was quite unsatisfactory to use. I phased this out in favor of a different way to annotate:

multiplication -> mul { multiplication '*' addition }

This also turned out to be bad. The braces were tedious to insert and I would occassionally confuse a label and a symbol.

The current approach I've taken looks like this:

multiplication: multiplication '*' addition / mul

This is better than the earlier ones because the resemblance to logic notation. The slash is associated with division, and division associated to the horizontal line. The horizontal line in logic means for deduction. Because the occurrence of the production rule leads to mul, this notation makes sense.

Some measures are taken to avoid positional arguments. By default the keywords such as the '*' are discarded. The can be grabbed by grouping with [] or retrieving it by index.

Picking good defaults that the language provides is important for readability.

Meaning of parentheses

The ASCII alphabet that is commonly used with computer languages have only three set of parentheses: (), [], {}. When you design your language, you should carefully consider the meaning of these primitives.

These parenthesis symbols are your best tool for grouping items together and specify a precedence in an expression. But often the things you'd like to do with them conflict with each other. It can get quite unfortunate at times.

Because it made sense to use parentheses for both grouping and as an element to macro calls, without care the following clause would become ambiguous:

expr (cmp_op expr)+

Given the rules so far, the reader has no way to decide whether the above clause is a template expansion with expr and then +, or the other way around. Also there's no way to tell whether there's expr expansion in the first place.

To resolve this, I favored the ()+ and made it bind tighter than an ordinary template expansion. I also replaced the ordinary use as grouping with [].

The reason for the choice was that I found the grouping to be quite rare outside annotation expressions, whereas the template invocations were really common.

What does compose together?

You can get really beautiful results if you find out the correct 'algebra' for what you are displaying. Careful considerations should be taken for how the structure gets grouped.

I found out that if I let the right-hand-side group with the annotation, it would turn out to be clean and powerful. For example:

statement: 'import' symbol 'from' sep(string / source, ',', ',')
    / import_from

Because the annotation inside expressions are possible we can add annotations inside groups. It also made sense in the program that interprets these additional rules in the grammar.

Power or simplicity?

I used to prefer simple designs over complicated ones. I still think it's a good idea to ensure there's nothing left to remove from your design when you're done. But the question of what you can remove in the first place seems to be hard.

The simplicity over the cost of everything else can lead to being a penny-wise and pound-foolish. In programming I've seen lot of software that could be described that way.

There is a charm in things that are reduced to their bare minimum. What I think people should do more often is to take things to their bare minimum, then reintroduce concepts by the order of what brings the most power.

An example of this could be the utilities I provided into the language. The use -directive and template declarations.

The use directive describes how the parser is constructed:

use indentation(indent, dedent, newline)
    slip = [")", "]", "}"]

Allows you to describe that the language uses significant layout. The expression describes three new terminals as a structure. These terminals are treated like keywords, and are not gathered implicitly from the input.

You can also define new templates:

alternating(a, b):
    / []
    a alternating(b, a) / (. append .)

It is expensive to check that the template recursion does not end up being infinite. At the same time it fits well into the language and solves a hard problem.

We need powerful tools that we can use.

You will eventually find further details of this language in the Lever's documentation, I haven't yet decided whether I keep calling it grammar_language like the others two earlier ones.