Mona Compiler Development, part 3: AST construction and scope management

My previous MCD post dealt with the first part of syntax analysis, specifically, processing a stream of tokens according to rules of some formal grammar. This post deals with the output of that phase: the construction of abstract symbol trees and scope management. Also, I've been asked to complete the programming language checklist for Mona and it can be found at the end of this post!

To summarize the first section of the previous post, the intent of the syntax analysis phase (or parsing), is translating programmer-written (and tokenized) code into a more structured format called an abstract syntax tree (AST), which the compiler can process more easily. Additionally, while parsing, the compiler also collects information about identifiers in the code (names of functions, variables, types, and other constructs) and their relation to different locations in the code -- called scoping.

Monac AST model

Abstract syntax trees are the first layer of internal representation of the inputted code in the compiler (not counting the token stream). Their function is to capture the implied grammatical hierarchy of the token stream into an explicit, language-native form. The motivation for this is obvious: it is necessary in order to be able to process the code as anything more than a plain token stream.

ASTs are a data structure consisting of nodes -- individual members of the tree growing into subtrees -- which describe some part of the inputted code. Each node in the tree has a kind, which captures a specific code construct or part of syntax. For example, there are nodes which represent:

  1. Function calls
  2. Lists of statements to be executed sequentially (block expressions)
  3. Constants and identifiers
  4. Statements, including definitions and expression-statements
  5. etc.

In monac, ASTs are modeled as hierarchical case classes, with the top of the hierarchy being a simple construct called ASTNode which is the parent of all other kinds of AST nodes.

AST visualization

  • visualization of monac AST output

There are also syntax forms which are intermediate, a concept similar to (but wider than) syntax sugar. if expressions are an example of that: although there exists a specific ASTNode case class which captures this form, later it is converted into a more fundamental combination (using lambda functions). There are other, purely syntactical kinds of ASTNodes, such as a list of a function's arguments, which are processed right away during parsing.

Modelling scope

Scoping broadly refers to the availability of identifiers in certain locations in the code. For example, identifiers for bindings local to some function are only available inside the definition of that function, i.e. its scope.

Since parsing processes grammatical structures in the code, it is intuitive that it also determines the scope of the identifiers it encounters. It does so by recording them in a structure called a symbol table (symbol refers to identifiers). There are various ways to actually implement this process, but in monac, each scope (a file, function, or a block expression) is associated with its own table, and the tables exist in a hierarchy as do the actual scopes. In the following example, three symbol tables are created to model three layers of scope:

someFunction a = /*scope A*/ {  
    /*scope B*/

    {
        /*scope C*/
    }
}

Scope A is the scope created by the function expression (the piece of code after the = symbol), and it is created automatically by all function definitions. Generally the only identifiers recorded in definition scopes are the function's arguments, as it is difficult (due to the syntax) to introduce new identifiers to them. However, it serves as a parent scope to all subsequent scopes created inside the function's definition.

In functions with simple defining expressions (such as f x = 2*x, i.e. without block expressions), the definition scope is usually the only one created by the function (one could create a function which defines a single binding using the let keyword, but that wouldn't have a real purpose and I might remove the option).

Scope B is created by the first block expression, and scope C by the one inside it, and the two more clearly illustrate the hierarchical relation between scopes.

In the code, scopes are implemented as simple maps of type String -> Symbol, residing in a tree structure denoting their hierarchy. The Scope class is rather simple and can be seen here: Scope.scala. Other than having a single link to their parent, scopes are also embedded in relevant AST nodes.

Code fragments

As the parser processes the stream of tokens fed to it by the lexer, it tries to match production rules in the grammar to it, in order to determine the structures which the code is describing. For example, there could be a rule which specified that functions are defined by a name, followed by a string of parameters, then an = symbol, and finally an expression. Thus, if the parser expected a function definition to follow, it would match the concrete tokens to the elements defined by the production rule.

Each production rule is associated with a specific code fragment (implemented as a Scala function), which defines some side effect of applying the said production rule. Without code fragments, the parser merely recognizes the grammatical structures in the code by rewriting the nonterminal and terminal symbols specified by the formal language grammar (according to the input) -- code fragments allow for the creation of ASTs and the populating of symbol tables (scopes).

The fragments are referenced inside the grammar specification, here's a code example:

StatementSequenceNT -> List(StatementNT, StatementSequencePrimeNT) -> Fragments.statementSequence,  

Other examples can be found in the MonaGrammar.scala file.

The production rule is structured using the -> Scala syntax for creating tuples, and follows the form of NONTERMINAL -> LIST OF NONTERMINALS AND TERMINALS -> FRAGMENT FOR RULE. This is explained in more detail in the previous article in the section Specification of the Mona grammar. In the above example, StatementSequenceNT is the object denoting the nonterminal symbol for a sequence of statements (used in block expressions).

Once the parser selects the required production rule, it recursively descends and parses the elements specified in the rule. The fragment functions create and populate relevant scopes, and generate ASTs for the elements, which are then used during the parsing of the original nonterminal. The parse function accepts a single nonterminal argument, which it then expands by selecting the relevant production rule. Its return value is the AST of the parsed nonterminal (and its side effect the recording of identifiers).

What's next?

The frontend of the compiler is now mostly done. What remains of the frontend work revolves around the expansion of existing features, like the grammar specification and optimizations/reductions of the ASTs.

However, instead of sticking to that, I am moving to the backend, i.e. emitting runnable code from an intermediate representation created by the frontend. The reason for that is the fact that expansion is probably more work, but it's less complicated, and since I probably may realize some changes need to be made to the frontend, or I may currently hold some misconceptions about the backend, it makes more sense to just go as deep in the implementation as possible.

Programming language checklist

As I've mentioned, here's the programming language checklist for Mona :) Selections are bold and [my comments are in square brackets like this]

You appear to be advocating a new:

  • functional
  • imperative
  • object-oriented
  • procedural
  • stack-based
  • "multi-paradigm"
  • [by-choice] lazy
  • eager
  • statically-typed
  • dynamically-typed
  • [partially] pure
  • impure
  • non-hygienic
  • visual
  • beginner-friendly
  • non-programmer-friendly
  • completely incomprehensible

programming language. Your language will not work. Here is why it will not work.

You appear to believe that:

  • Syntax is what makes programming difficult
  • Garbage collection is free
  • Computers have infinite memory
  • Nobody really needs:
    • concurrency
    • a REPL
    • debugger support
    • IDE support
    • I/O [jk]
    • to interact with code not written in your language
  • The entire world speaks 7-bit ASCII
  • Scaling up to large software projects will be easy
  • Convincing programmers to adopt a new language will be easy
  • Convincing programmers to adopt a language-specific IDE will be easy
  • Programmers love writing lots of boilerplate
  • Specifying behaviors as "undefined" means that programmers won't rely on them
  • "Spooky action at a distance" makes programming more fun

Unfortunately, your language (has/lacks):

  • comprehensible syntax
  • [optional] semicolons
  • significant whitespace
  • macros
  • implicit type conversion
  • explicit casting
  • type inference
  • goto
  • exceptions
  • closures
  • tail recursion
  • coroutines
  • reflection
  • subtyping [ehh it uses a Haskell-like type system so this is hard to specify, type class instances can use class constraints on type variables which is sort-of like inheriting interfaces]
  • multiple inheritance [same as above but I dare not select it]
  • operator overloading [although not a free-for-all like in C++, but cleanly managed through type classes]
  • algebraic datatypes
  • recursive types
  • polymorphic types
  • covariant array typing [not exactly applicable again]
  • monads
  • dependent types [I'm still researching this but no]
  • infix operators [infix functions]
  • nested comments
  • multi-line strings
  • regexes [not first-class if that's what this means]
  • call-by-value
  • call-by-name
  • call-by-reference
  • call-cc

The following philosophical objections apply:

  • Programmers should not need to understand category theory to write "Hello, World!" [lol not really]
  • Programmers should not develop RSI from writing "Hello, World!"
  • The most significant program written in your language is its own compiler
  • The most significant program written in your language isn't even its own compiler
  • No language spec [nothing very formal but I do keep a document or two with important decisions and examples]
  • "The implementation is the spec"
    • The implementation is closed-source [never! but I might modify the license to be non-free in case the user is called Vedran Miletić if he annoys me too much in the future]
    • covered by patents
    • not owned by you
  • Your type system is unsound
  • Your language cannot be unambiguously parsed
    • a proof of same is attached
    • invoking this proof crashes the compiler
  • The name of your language makes it impossible to find on Google
  • Interpreted languages will never be as fast as C
  • Compiled languages will never be "extensible"
  • Writing a compiler that understands English is AI-complete
  • Your language relies on an optimization which has never been shown possible [like, probably]
  • There are less than 100 programmers on Earth smart enough to use your language
  • ____________________________ takes exponential time
  • ____________________________ is known to be undecidable

Your implementation has the following flaws:

  • CPUs do not work that way
  • RAM does not work that way [lol]
  • VMs do not work that way
  • Compilers do not work that way
  • Compilers cannot work that way
  • Shift-reduce conflicts in parsing seem to be resolved using rand()
  • You require the compiler to be present at runtime
  • You require the language runtime to be present at compile-time
  • Your compiler errors are completely inscrutable [whispers there aren't any]
  • Dangerous behavior is only a warning
  • The compiler crashes if you look at it funny
  • The VM crashes if you look at it funny
  • You don't seem to understand basic optimization techniques
  • You don't seem to understand basic systems programming
  • You don't seem to understand pointers
  • You don't seem to understand functions

Additionally, your marketing has the following problems:

  • Unsupported claims of increased productivity
  • Unsupported claims of greater "ease of use"
  • Obviously rigged benchmarks [soon]
    • Graphics, simulation, or crypto benchmarks where your code just calls handwritten assembly through your FFI
    • String-processing benchmarks where you just call PCRE
    • Matrix-math benchmarks where you just call BLAS
  • Noone really believes that your language is faster than:
    • assembly
    • C
    • FORTRAN
    • Java
    • Ruby
    • Prolog
  • Rejection of orthodox programming-language theory without justification
  • Rejection of orthodox systems programming without justification [x1000, basically Mona's raison d'etre]
  • Rejection of orthodox algorithmic theory without justification
  • Rejection of basic computer science without justification

Taking the wider ecosystem into account, I would [probably] like to note that:

  • Your complex sample code would be one line in: _______________________
  • We already have an unsafe imperative language
  • We already have a safe imperative OO language
  • We already have a safe statically-typed eager functional language
  • You have reinvented Lisp but worse
  • You have reinvented Javascript but worse
  • You have reinvented Java but worse
  • You have reinvented C++ but worse
  • You have reinvented PHP but worse
  • You have reinvented PHP better, but that's still no justification
  • You have reinvented Brainfuck but non-ironically [lol]

[Do this one yourself]

In conclusion, this is what I think of you:

  • You have some interesting ideas, but this won't fly.
  • This is a bad language, and you should feel bad for inventing it.
  • Programming in this language is an adequate punishment for inventing it.
comments powered by Disqus