Syntax is the last thing you should design
If someone listens me at a language design this is a valuable lesson I'd like to pass along: Avoid focus on syntax or grammar when you design a new programming language.
Temptation to create syntax first is great. It's common sight to find programming languages that only exist on paper as syntactic rules created by a novice author. It commonly reflects a lack of understanding that a syntax is a function of design.
How the language appears to the programmer should originate from the qualities of the whole programming language as a whole. Syntax is analogous to appearance and it's the most visible part of the language. Although it is critical to get the details of the syntax right, it will skew the whole language if it is prematurely done.
Syntax is a difficult and laborious part in a programming language project because many decisions you do demand changes to the syntax of the language.
It's tempting to design syntax up front. After all parsing of the language is the first task a compiler or interpreter takes after reading a file. Few people have proceeded beyond this point.
Many expeditions are doomed and officially terminated by misunderstandings related to syntax. But there are several foolproof ways to avoid the fate of about millions of former prospective language designers.
The general solution is to considerably decrease the time it takes to change the syntax. If you are designing a new language rather than reimplementing an existing one, you can discard handwritten parsers as a choice entirely.
Towards the right direction
Hand written parsers can be an extremely bad choice for a new programming language designer. Even slight changes into the syntax can produce large changes to the parser. The workload itself discourages you from creating improvements that can be seen to lead on changing the syntax in the language.
Worst problem in hand-written parsers for a new language comes from how a hand written parser may contain subtle errors. You end up to ask yourself whether a behavior the parser exhibits was intended or if it is an error.
One better choice would be to use a LR or LALR parser generator. It makes it easier to ensure correctness of the language but comes with its own problems.
LR parser generator attempts to convert a context-free grammar into a pushdown automaton. Parser implemented this way has a nice property that it can work through the input in one swoop and do a reduction as soon as the input is recognized.
Conversion from a context-free grammar into a pushdown automaton is a very strict and computation intensive operation. The effort to correct shift/reduce and reduce/reduce conflicts easily becomes a frustrating problem in itself.
Solution 0. Forth/Concatenative -syntax
Forth-like parsers retrieve words separated by whitespace and interpret the words themselves as commands.
The virtue of a Forth parser is that it simplifies parsing a lot and lets you focus on the difficult parts of your language.
The syntax achieved this way doesn't end up being attractive to many people but it also has a reduced capability to cripple your progress from the start.
Whole forth environment can be so simple that there have been forth-based operating systems that only consume few hundred bytes of space.
The following code shows out how a forth parser is implemented.
def parse():
word = get_word()
while word != "":
if word in commands:
commands[word]()
elif word.isdigit():
stack.append(int(word))
else:
raise Syn(word)
word = get_word()
def get_word():
ch = getch()
while ch != EOF and ch.isspace():
ch = getch()
string = ""
while ch != EOF and not ch.isspace():
string += ch
ch = getch()
return string
getch
- gets character from stream. If there is not characters remaining it returns EOF -value.commands
- associates words to functions.stack
- 'data stack' of the interpreter, used to pass values between the commands.Syn
- Syntax error exception. Raised on bad syntax.
Input parsed and interpreted this way can be thought as if it was a parse tree visited in the postorder.
I haven't provided complete examples in order to make the code simpler and easier to study. In a complete parser you would have to track the line, column and/or current word in the input. Also you may want to keep bookkeeping about the stack effects and raise an error on stack underflow.
Solution 1. Lisp -syntax
Lisp parsers parse the input into lists and strings that are itself basic constructs in the lisp language. The approach is similar to Forth and restricts your language into strict forms.
Parsing lisp conveniently requires you to look ahead one character during parsing. Here follows a short implementation of a lisp parser:
def main():
for expr in parse():
interpret(expr)
def parse():
get_expr = LispParser(getch)
out = []
expr = get_expr()
while expr != "":
out.append(expr)
expr = get_expr()
class LispParser:
def __init__(self, getch):
self.getch = getch
self.ch = getch()
def __call__(self):
while self.ch != EOF and self.ch.isspace():
self.ch = self.getch()
if self.ch == "(":
self.ch = self.getch()
out = []
word = self()
while word != "":
out.append(word)
word = self()
if self.ch != ")":
raise Syn("expected ')'")
self.ch = self.getch()
return out
string = ""
while (self.ch != EOF
and not self.ch.isspace()
and self.ch not in "()"):
string += self.ch
self.ch = self.getch()
return string
interpret
takes an expression and evaluates it. There is
no explicit data stack because the lists can contain their
own parameters. The remaining parts are the same as in the
Forth sample.
Solution 0-1b. Pattern matching
The above approaches rely on an idea of splitting parsing into two stages. The first stage is a simple algorithm for converting text input into structures. The second stage is achieved by doing pattern matching on the input.
In a Forth environment some of your commands grab additional characters or even words from the stream and gives them an alternative interpretation.
In a lisp environment the meaning of a list doesn't have to depend on items that come before or after it. Lisp expressions themselves form graphs you can pattern match on.
A list pattern matcher is in essence a variation of a regular expression matcher. Whereas regex matches on strings, a list pattern matches on lists.
Here's an illustration of how a list pattern matcher might be implemented:
def interpret_macro(expr):
if is_list(expr) and expr[0] in macros:
pattern, fn = macros[expr[0]]
return fn(*match(pattern, expr))
return None
def match(pattern, obj):
if pattern == "." and obj:
return [obj]
if is_list(pattern) and is_list(obj):
res = []
i = 0
L = len(obj)
for flavor, subpat in pattern:
if flavor == "1": # one
if i >= L:
return Mismatch(pattern, obj)
res.extend(match(subpat, obj[i]))
i += 1
if flavor == "*": # many
xr = []
while i < L:
try
xr.extend(match(subpat, obj[i]))
except Mismatch as _:
break
i += 1
res.append(xr)
if flavor == "?": # zero or one
r = []
if i < L:
r = match(subpat, obj[i])
i += 1
res.append(r)
if i == L:
return [res]
elif pattern == obj:
return []
raise Mismatch(pattern, obj)
macros
is a table of patterns with commands to run. The time and moment depends on your language.Mismatch
is an exception as before. This function either successfully matches the input or then raises an error.is_list
is a function which must recognize the lists in your language.flavor
in the pattern refers to how many the pattern is supposed to match.
A pattern matcher such as this allows you to describe the syntax for your language in terms of lists:
(if . .)
(cond (. .)* (else .)?)
(return .?)
(let . .)
(lambda (.*) .*)
Solution 2. Chart parsing
The earlier three solutions are useful starting points and useful research tools. They give you a kickstart around the parsing problem and treat the core challenges in designing your language. In an exchange they limit your syntax into forms.
There is a yet one approach that means for a little more work to allow grammar and syntax that is less restricted than it is in many modern programming languages, yet it still drops the labor of maintaining a syntax.
The approach is to use a chart parser. This kind of algorithm allows you to represent the whole language in a context-free grammar and directly use the grammar to parse a language.