Studying Axiom with a Context-Free Grammar

Axiom has been implemented in common lisp and parsed with a modified Pratt parser. There's a preparsing step that inserts parentheses and semicolons to implement indentation sensitivity.

It is working, but it is difficult to find out how this language ought be parsed from the contents of a hand-written parser.

I wrote a context-free grammar for Axiom. Ultimately, I am attempting to interpret this language and learn some details about its innards.

The properties of the grammar

The language presented in this grammar is not equivalent with the one parsed by the original implementation. I ensured that it unambiguously parses the source code of the packages that come with Axiom.

I am not certain that the whole parse is correct because there are small details that were difficult to understand.

Preliminary

I assume that you understand what context-free grammars are. To figure out the syntax, you may want to have a short introduction. The following grammar includes the string "little mary had a lamb"

sentence: who "had" what

who:
    "little" "mary"
    "duke" "nukem"

what:
    "a" "lamb"

Grammar's purpose is to reveal the structure of the strings that are valid in a language it represents. The production rules are generative so a string may be ambiguous, eg. it may parse multiple ways.

Acquisition

Obtaining a grammar for a hand-written parser is difficult. I tried several ways to obtain a grammar for the pratt parser until I found a way that worked.

The method that worked was that I start with the following grammar:

expr:
    atom+

atom:
    symbol
    string
    int
    all_special_characters
    "(" expr? ")"
    "[" expr? "]"
    "{" expr? "}"

Next you start searching out for the topmost elements that exist in the language and place them to their respective places in the grammar. Eventually I figured out how to put everything together that appears below.

Semicolon separators

Semicolons divide the file into statements, additionally commas are doing the same within parentheses.

stmt_0:
    stmt_0 ";" stmt_1
    stmt_1

expr_0:
    expr_1 ["for" symbol "in" for_loop_head, "while" expr_100]+
    expr_1

expr_1:
    expr_1 ";" expr_10
    expr_10

expr_10:
    expr_10 "," statement
    statement

The parentheses may also enclose generator expressions. They are told apart by for or while without repeat inside them.

Where clause

Most of the toplevel statements in the .spad files have a where -clause. It divides the clause into two parts.

stmt_1:
    statement "where" statement
    statement

Later I changed the grammar further to make use of the differences at this point.

Statements - Dangling else

Statements present several difficulties for the grammar. The worst of them is the dangling else.

The s_prefix -macro implements the numerous assignment operators this language comes with.

statement:
    s_prefix(if_then)
    s_prefix(if_then_else)

if_then_else:
    "if" expr_100 "then" s_prefix(if_then_else) "else" s_prefix(if_then_else)
    iterator* "repeat" statement
    expr_100
    "return" expr_100?
    "leave" expr_100?

if_then:
    "if" expr_100 "then" statement
    "if" expr_100 "then" s_prefix(if_then_else) "else" s_prefix(if_then)

The above grammar ensures that the else expressions end up being associated to the innermost conditional in a statement.

The if_then_else coalescess the remaining statements because you find those other statements within these ones. Without the dangling else problem this would be a much simpler grammar.

Statements - loops

The syntax of Axiom presents an idea that generative clauses, streams and loops are related constructs. This is a nice viewpoint I will likely incorporate into my own language's syntax eventually.

The same idea is present in common lisp's loop macro, perhaps in a bit stronger form.

I am not sure how these iterators stack within each other. The repeat clause builds a loop block out of them.

The repeat may also appear alone, and in that case it's the same as "while true repeat ..."

iterator:
    "for" symbol "in" for_loop_head
    "while" expr_100
    "until" expr_100

for_loop_head:
    expr_100 (["|", "suchthat"] expr_100)?

Statements - assignment

The assignment ends up awkwardly inside the dangling else -problem. The result is two sub-grammars s_prefix(if_then_else) and s_prefix(if_then) that have a similar shape:

s_prefix(stmt):
    expr_100 "==>"         s_prefix(stmt)
    expr_100 "=="          s_prefix(stmt)
    expr_100 "with"        s_prefix(stmt)
    expr_100 "add"         s_prefix(stmt)
    expr_100 "==>"  "with" s_prefix(stmt)
    expr_100 "==>"  "add"  s_prefix(stmt)
    expr_100 "=="   "with" s_prefix(stmt)
    expr_100 "=="   "add"  s_prefix(stmt)
    expr_100 ":="          s_prefix(stmt)
    expr_100 ":-"          s_prefix(stmt)
    expr_100 "=>"          s_prefix(stmt)
    expr_100 "+->"         s_prefix(stmt) # likely not correct place for this.
    ":" s_prefix(stmt)
    stmt

The with and add are weird but important parts of this language. with is used to provide properties for domains whereas add provides implementations for operations in domains.

The category and domain constructors are allowed to contain conditionals and loops. Hinting that domains and categories are not parametric, instead they are always constructed by a function that has parameters.

Expressions

The "or" and "and" here are left-recursive. It may be necessary to see in practice which precedence rules the Axiom used.

expr_100:
    expr_100 "or"  expr_120
    expr_100 "\\/" expr_120
    expr_120

expr_120:
    expr_120 "and" expr_130
    expr_120 "/\\" expr_130
    expr_130

Original parser is likely accepting ^, not, ~ as a stacked prefix, but it did not appear in practice so I left recursion out from these groups.

expr_130:
    "^"   expr_150
    "not" expr_150
    "~"   expr_150
    expr_150

expr_150: # 399  400
    "has" expr_200
    "="   opsuffix? expr_200
    expr_200 "<"    opsuffix? expr_200
    expr_200 ">"    opsuffix? expr_200
    expr_200 "<="   opsuffix? expr_200
    expr_200 ">="   opsuffix? expr_200
    expr_200 "="    opsuffix? expr_200
    expr_200 "^="   opsuffix? expr_200
    expr_200 "~="   opsuffix? expr_200
    expr_200 "in"   expr_200
    expr_200 "case" expr_200
    expr_200 "has"  expr_200
    expr_200 "is"   expr_200
    expr_200 "isnt" expr_200
    expr_200

The .. forms a generator expression, I bet it only appears in some places but I'm not sure which. Anyway the original parser would appear to accept it about anywhere it appears.

expr_200:
    expr_200 ".."
    expr_200 ".." expr_300
    expr_300

There is an optional opsuffix spread across the grammar, it will be explained later what it is.

The precedence rules of Axiom seem to group neatly even if there are a lot of them.

expr_300:
    expr_300 "+" opsuffix? expr_400
    expr_300 "-" opsuffix? expr_400
    expr_400

expr_400:
    "-" opsuffix? expr_400
    "+" opsuffix? expr_400
    expr_500

expr_500: 
    expr_500 "*"     opsuffix? expr_800
    expr_500 "rem"   opsuffix? expr_800
    expr_500 "mod"   opsuffix? expr_800
    expr_500 "quo"   opsuffix? expr_800
    expr_500 "div"   opsuffix? expr_800
    expr_500 "/"     opsuffix? expr_800
    expr_500 "exquo" opsuffix? expr_800
    expr_800

Reducers

In some places I saw expressions such as +/[1,2,3]. I suppose that thing evaluates to the 1 + (2 + 3), as in this would be a syntax for reducing sequences.

I saw it being used with only few kind of operators. You can likely use any operator you want.

reducer:
    "+"
    "-"
    "*"
    "or"
    "and"

expr_800:
    reducer "/" expr_900
    expr_800 "**" opsuffix? expr_900
    expr_800 "^"  expr_900         
    expr_900

Various tight binding operators

It took a lot of effort to figure out what is the correct order in the following operators. I think I didn't get it right entirely. For example, sometimes the opsuffix appears on a place that hints it doesn't structure like I believe it does.

expr_900:
    expr_900 ":"  expr_910
    expr_900 "::" expr_910
    expr_900 "@"  expr_910
    expr_910

expr_910:
    expr_910 "pretend" expr_920
    expr_910 "."       expr_920
    expr_920

expr_920:
    expr_930

expr_930:
    "->" expr_950
    expr_930 "->" expr_950
    expr_930 "<-" expr_950
    expr_950

expr_950:
    "!" expr_955
    expr_955 "!!" expr_955
    expr_955

expr_955:
    "#" opsuffix? expr_955
    "'" expr_955
    expr_960

Trouble with the dollar sign

The interaction of the dollar sign is troublesome. It's there to select the source package for the operation when it may be vague. In practice it seems to be also used to denote when the expression should be evaluated from the common lisp environment, and it also means something when it appears alone or in the end of the expression.

Nud/Led naming comes from the pratt parser. It refers to the expression being a prefix vs. a subsequent item.

Additionally you can apply items without parentheses. a b is same as a(b). This easily causes ambiguities when the parentheses are also used for grouping items so the production rules for parentheses have to be carefully crafted.

expr_960:
    expr_1000 "$" expr_960
    expr_1000

expr_1000:
    "$"
    atom_nud expr_1000_led 
    atom_nud

expr_1000_led:
    "#"      expr_1000_led
    atom_led expr_1000_led
    atom_led
    "$"

atom_nud:
    atom
    atom_nud "(" expr_0? ")"
    "(" expr_0? ")"

atom_led:
    atom_led "(" expr_0? ")"
    atom

atom:
    symbol
    string
    int
    float
    "%"
    "[" expr_0? "]"
    "{" expr_0? "}"

Opsuffix

Opsuffix is one of those things I really don't like at all.

opsuffix:
    "$" "%"
    "$" symbol
    "$" "(" expr_0? ")"

Unused keywords

The parser described several keywords that aren't in the above grammar.

keywords:
    "exit"
    "from"
    "iterate"
    "yield"
    "otherwise"
    "when"
    "unless"

I just left them away because they aren't used anywhere in the source code. But wrote them up so that when they appear, it will be noticed.

Final thoughts

A well-preserved programming language that is more than 40 years old.

I didn't mention much about the tokenizer, but tokenizer in Axiom isn't looking like very interesting. The only odd-looking thing that it does is the use of underscore as an escape character.

I keep asking myself why did I find this, and why only now? I think the question that brought me here is:

What does A + B mean when A and B are not same specific type, or not identified as specific types of something?

I keep going until I find answers.