Studying Axiom with a Context-Free Grammar
Axiom has been implemented in common lisp and parsed with a modified Pratt parser. There's a preparsing step that inserts parentheses and semicolons to implement indentation sensitivity.
It is working, but it is difficult to find out how this language ought be parsed from the contents of a hand-written parser.
I wrote a context-free grammar for Axiom. Ultimately, I am attempting to interpret this language and learn some details about its innards.
The properties of the grammar
The language presented in this grammar is not equivalent with the one parsed by the original implementation. I ensured that it unambiguously parses the source code of the packages that come with Axiom.
I am not certain that the whole parse is correct because there are small details that were difficult to understand.
Preliminary
I assume that you understand what context-free grammars are. To figure out the syntax, you may want to have a short introduction. The following grammar includes the string "little mary had a lamb"
sentence: who "had" what
who:
"little" "mary"
"duke" "nukem"
what:
"a" "lamb"
Grammar's purpose is to reveal the structure of the strings that are valid in a language it represents. The production rules are generative so a string may be ambiguous, eg. it may parse multiple ways.
Acquisition
Obtaining a grammar for a hand-written parser is difficult. I tried several ways to obtain a grammar for the pratt parser until I found a way that worked.
The method that worked was that I start with the following grammar:
expr:
atom+
atom:
symbol
string
int
all_special_characters
"(" expr? ")"
"[" expr? "]"
"{" expr? "}"
Next you start searching out for the topmost elements that exist in the language and place them to their respective places in the grammar. Eventually I figured out how to put everything together that appears below.
Semicolon separators
Semicolons divide the file into statements, additionally commas are doing the same within parentheses.
stmt_0:
stmt_0 ";" stmt_1
stmt_1
expr_0:
expr_1 ["for" symbol "in" for_loop_head, "while" expr_100]+
expr_1
expr_1:
expr_1 ";" expr_10
expr_10
expr_10:
expr_10 "," statement
statement
The parentheses may also enclose generator expressions. They
are told apart by for
or while
without repeat
inside
them.
Where clause
Most of the toplevel statements in the .spad
files have a
where
-clause. It divides the clause into two parts.
stmt_1:
statement "where" statement
statement
Later I changed the grammar further to make use of the differences at this point.
Statements - Dangling else
Statements present several difficulties for the grammar. The worst of them is the dangling else.
The s_prefix
-macro implements the numerous assignment
operators this language comes with.
statement:
s_prefix(if_then)
s_prefix(if_then_else)
if_then_else:
"if" expr_100 "then" s_prefix(if_then_else) "else" s_prefix(if_then_else)
iterator* "repeat" statement
expr_100
"return" expr_100?
"leave" expr_100?
if_then:
"if" expr_100 "then" statement
"if" expr_100 "then" s_prefix(if_then_else) "else" s_prefix(if_then)
The above grammar ensures that the else
expressions end up
being associated to the innermost conditional in a
statement.
The if_then_else
coalescess the remaining statements
because you find those other statements within these ones.
Without the dangling else problem this would be a much simpler
grammar.
Statements - loops
The syntax of Axiom presents an idea that generative clauses, streams and loops are related constructs. This is a nice viewpoint I will likely incorporate into my own language's syntax eventually.
The same idea is present in common lisp's loop macro, perhaps in a bit stronger form.
I am not sure how these iterators stack within each other.
The repeat
clause builds a loop block out of them.
The repeat may also appear alone, and in that case it's the same as "while true repeat ..."
iterator:
"for" symbol "in" for_loop_head
"while" expr_100
"until" expr_100
for_loop_head:
expr_100 (["|", "suchthat"] expr_100)?
Statements - assignment
The assignment ends up awkwardly inside the dangling else
-problem. The result is two sub-grammars
s_prefix(if_then_else)
and s_prefix(if_then)
that have a
similar shape:
s_prefix(stmt):
expr_100 "==>" s_prefix(stmt)
expr_100 "==" s_prefix(stmt)
expr_100 "with" s_prefix(stmt)
expr_100 "add" s_prefix(stmt)
expr_100 "==>" "with" s_prefix(stmt)
expr_100 "==>" "add" s_prefix(stmt)
expr_100 "==" "with" s_prefix(stmt)
expr_100 "==" "add" s_prefix(stmt)
expr_100 ":=" s_prefix(stmt)
expr_100 ":-" s_prefix(stmt)
expr_100 "=>" s_prefix(stmt)
expr_100 "+->" s_prefix(stmt) # likely not correct place for this.
":" s_prefix(stmt)
stmt
The with
and add
are weird but important parts of this
language. with
is used to provide properties for domains
whereas add
provides implementations for operations in
domains.
The category and domain constructors are allowed to contain conditionals and loops. Hinting that domains and categories are not parametric, instead they are always constructed by a function that has parameters.
Expressions
The "or" and "and" here are left-recursive. It may be necessary to see in practice which precedence rules the Axiom used.
expr_100:
expr_100 "or" expr_120
expr_100 "\\/" expr_120
expr_120
expr_120:
expr_120 "and" expr_130
expr_120 "/\\" expr_130
expr_130
Original parser is likely accepting ^, not, ~ as a stacked prefix, but it did not appear in practice so I left recursion out from these groups.
expr_130:
"^" expr_150
"not" expr_150
"~" expr_150
expr_150
expr_150: # 399 400
"has" expr_200
"=" opsuffix? expr_200
expr_200 "<" opsuffix? expr_200
expr_200 ">" opsuffix? expr_200
expr_200 "<=" opsuffix? expr_200
expr_200 ">=" opsuffix? expr_200
expr_200 "=" opsuffix? expr_200
expr_200 "^=" opsuffix? expr_200
expr_200 "~=" opsuffix? expr_200
expr_200 "in" expr_200
expr_200 "case" expr_200
expr_200 "has" expr_200
expr_200 "is" expr_200
expr_200 "isnt" expr_200
expr_200
The ..
forms a generator expression, I bet it only appears
in some places but I'm not sure which. Anyway the original
parser would appear to accept it about anywhere it appears.
expr_200:
expr_200 ".."
expr_200 ".." expr_300
expr_300
There is an optional opsuffix spread across the grammar, it will be explained later what it is.
The precedence rules of Axiom seem to group neatly even if there are a lot of them.
expr_300:
expr_300 "+" opsuffix? expr_400
expr_300 "-" opsuffix? expr_400
expr_400
expr_400:
"-" opsuffix? expr_400
"+" opsuffix? expr_400
expr_500
expr_500:
expr_500 "*" opsuffix? expr_800
expr_500 "rem" opsuffix? expr_800
expr_500 "mod" opsuffix? expr_800
expr_500 "quo" opsuffix? expr_800
expr_500 "div" opsuffix? expr_800
expr_500 "/" opsuffix? expr_800
expr_500 "exquo" opsuffix? expr_800
expr_800
Reducers
In some places I saw expressions such as +/[1,2,3]
. I
suppose that thing evaluates to the 1 + (2 + 3)
, as in
this would be a syntax for reducing sequences.
I saw it being used with only few kind of operators. You can likely use any operator you want.
reducer:
"+"
"-"
"*"
"or"
"and"
expr_800:
reducer "/" expr_900
expr_800 "**" opsuffix? expr_900
expr_800 "^" expr_900
expr_900
Various tight binding operators
It took a lot of effort to figure out what is the correct order in the following operators. I think I didn't get it right entirely. For example, sometimes the opsuffix appears on a place that hints it doesn't structure like I believe it does.
expr_900:
expr_900 ":" expr_910
expr_900 "::" expr_910
expr_900 "@" expr_910
expr_910
expr_910:
expr_910 "pretend" expr_920
expr_910 "." expr_920
expr_920
expr_920:
expr_930
expr_930:
"->" expr_950
expr_930 "->" expr_950
expr_930 "<-" expr_950
expr_950
expr_950:
"!" expr_955
expr_955 "!!" expr_955
expr_955
expr_955:
"#" opsuffix? expr_955
"'" expr_955
expr_960
Trouble with the dollar sign
The interaction of the dollar sign is troublesome. It's there to select the source package for the operation when it may be vague. In practice it seems to be also used to denote when the expression should be evaluated from the common lisp environment, and it also means something when it appears alone or in the end of the expression.
Nud/Led naming comes from the pratt parser. It refers to the expression being a prefix vs. a subsequent item.
Additionally you can apply items without parentheses. a b
is same as a(b)
. This easily causes ambiguities when the
parentheses are also used for grouping items so the
production rules for parentheses have to be carefully
crafted.
expr_960:
expr_1000 "$" expr_960
expr_1000
expr_1000:
"$"
atom_nud expr_1000_led
atom_nud
expr_1000_led:
"#" expr_1000_led
atom_led expr_1000_led
atom_led
"$"
atom_nud:
atom
atom_nud "(" expr_0? ")"
"(" expr_0? ")"
atom_led:
atom_led "(" expr_0? ")"
atom
atom:
symbol
string
int
float
"%"
"[" expr_0? "]"
"{" expr_0? "}"
Opsuffix
Opsuffix is one of those things I really don't like at all.
opsuffix:
"$" "%"
"$" symbol
"$" "(" expr_0? ")"
Unused keywords
The parser described several keywords that aren't in the above grammar.
keywords:
"exit"
"from"
"iterate"
"yield"
"otherwise"
"when"
"unless"
I just left them away because they aren't used anywhere in the source code. But wrote them up so that when they appear, it will be noticed.
Final thoughts
A well-preserved programming language that is more than 40 years old.
I didn't mention much about the tokenizer, but tokenizer in Axiom isn't looking like very interesting. The only odd-looking thing that it does is the use of underscore as an escape character.
I keep asking myself why did I find this, and why only now? I think the question that brought me here is:
What does A + B mean when A and B are not same specific type, or not identified as specific types of something?
I keep going until I find answers.