Tuesday, March 27, 2012

5. Creating an AST for Wally


Parts: 1 2 3 4 5 6 7

In this part we'll mix tree rewrite rules to our parser rules to let our parser create a proper AST, and some rules will be rewritten in such a way that it would become easier for us to walk the tree. For example, take the shorthand assignment operator += in a += 42. We could handle such += cases separately, but it's far easier to rewrite a += 42 into a = a + 42 making it a simple assignment: on the left there's a lookup (a), and on the right side there's an expression.

In our parser, we'd have the rules (including rewrite rules):

assignment
 : lookup ( '=' expr   -> ^(ASSIGN lookup expr)
          | '||=' expr -> ^(ASSIGN lookup ^(OR   lookup expr))
          | '&&=' expr -> ^(ASSIGN lookup ^(AND  lookup expr))
          | '+=' expr  -> ^(ASSIGN lookup ^(ADD  lookup expr))
          | '-=' expr  -> ^(ASSIGN lookup ^(SUB  lookup expr))
          | '*=' expr  -> ^(ASSIGN lookup ^(MULT lookup expr))
          | '/=' expr  -> ^(ASSIGN lookup ^(DIV  lookup expr))
          | '%=' expr  -> ^(ASSIGN lookup ^(MOD  lookup expr))
          | '^=' expr  -> ^(ASSIGN lookup ^(POW  lookup expr))
          )
 ;

causing only the following rule to be present in out tree grammar:

assignment
 : ^(ASSIGN lookup expr)
 ;

since all the ^(OR ...), ^(AND ...), etc. are expressions themselves.

If you haven't already done so, download the most recent version of the project here.

So, without further a due, here is the grammar including the rewrite rules:

grammar Wally;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  VARS;
  FUNCTIONS;
  FUNCTION;
  IDS;
  BLOCK;
  ASSIGN;
  RETURN;
  VOID;
  U_MINUS;
  LOOKUP;
  EXPRESSIONS;
  INDEX;
  INVOKE;
  CALL;
  ADD;
  SUB;
  MULT;
  DIV;
  MOD;
  POW;
  NONE;
  OBJ;
  LIST;
  TERNARY;
  OR;
  AND;
  ALT;
  TRUE;
  ATTRIBUTE;
  FOR_IN;
}

@parser::header {
  package wally.parser;

  import java.io.File;
  import java.util.Map;
  import java.util.LinkedHashMap;
  import wally.*;
}

@lexer::header {
  package wally.parser;
  
  import java.util.Queue;
  import java.util.LinkedList;
}

@parser::members { 
  private File file;

  public WallyParser(File f, CommonTokenStream stream) {
    super(stream);
    file = f;
  }
}

@lexer::members {
  private Stack<Integer> indents = new Stack<Integer>();
  private Queue<Token> tokens = new LinkedList<Token>();
  private int openBrace = 0;
  
  @Override
  public void emit(Token t) {
    state.token = t;
    tokens.offer(t);
  }

  @Override
  public Token nextToken() {
    super.nextToken();
    return tokens.isEmpty() ? Token.EOF_TOKEN : tokens.poll();
  }
}

parse
 : EOL* use_stats? file_atom* return_stat EOF -> ^(BLOCK file_atom* return_stat)
 ;

use_stats
 : use_stat+
 ;

use_stat
 : Use (t=Id | t='*') From Str EOL+
 ;

file_atom
 : object     -> /* omit from AST */
 | function   -> /* omit from AST */
 | block_atom
 ;

object
 : Obj Id EOL INDENT object_atom* DEDENT EOL* 
 ;

object_atom
 : vars
 | function
 ;

vars
 : Id (',' Id)* EOL
 ;

function
 : any_id id_list EOL block
 ;

id_list
 : '(' (Id (',' Id)*)? ')' -> ^(IDS Id*)
 ;

block
 : INDENT block_atom* return_stat DEDENT -> ^(BLOCK block_atom* return_stat)
 ;

block_atom
 : (assignment)=> assignment EOL -> assignment
 | function_call EOL             -> function_call
 | Break EOL                     -> Break
 | for_stat
 | if_stat
 | while_stat
 ;

for_stat
 : For Id '=' expr To expr EOL block -> ^(For Id expr expr block)
 | For Id In expr EOL block          -> ^(FOR_IN Id expr block)
 ;

if_stat
 : If expr EOL block elif_stat* else_stat? 
   -> ^(If ^(ALT expr block) elif_stat* else_stat?)
 ;

elif_stat
 : Elif expr EOL block -> ^(ALT expr block)
 ;

else_stat
 : Else EOL block -> ^(ALT TRUE block)
 ;

while_stat
 : While expr EOL block -> ^(While expr block)
 ;

assignment
 : lookup ( '=' expr   -> ^(ASSIGN lookup expr)
          | '||=' expr -> ^(ASSIGN lookup ^(OR   lookup expr))
          | '&&=' expr -> ^(ASSIGN lookup ^(AND  lookup expr))
          | '+=' expr  -> ^(ASSIGN lookup ^(ADD  lookup expr))
          | '-=' expr  -> ^(ASSIGN lookup ^(SUB  lookup expr))
          | '*=' expr  -> ^(ASSIGN lookup ^(MULT lookup expr))
          | '/=' expr  -> ^(ASSIGN lookup ^(DIV  lookup expr))
          | '%=' expr  -> ^(ASSIGN lookup ^(MOD  lookup expr))
          | '^=' expr  -> ^(ASSIGN lookup ^(POW  lookup expr))
          )
 ;

function_call
 : lookup 
 | Print '(' expr ')'              -> ^(Print expr)
 | Println '(' expr? ')'           -> ^(Println expr?)
 | Assert '(' expr (',' expr)? ')' -> ^(Assert expr expr?)
 ;

return_stat
 : Return expr EOL -> ^(RETURN expr)
 |                 -> ^(RETURN VOID)
 ;

expr
 : (or -> or) ( '?' a=expr ':' b=expr -> ^(TERNARY or $a $b)
              | In expr               -> ^(In expr or)
              )?
 ;

or
 : and (Or^ and)*
 ;

and
 : rel (And^ eq)*
 ;

rel
 : eq ((Lt | Gt | LtEq | GtEq)^ eq)?
 ;

eq
 : add ((Eq | NEq)^ add)*
 ;

add
 : mult ((Add | Sub)^ mult)*
 ;

mult
 : unary ((Mult | Div | Mod)^ unary)*
 ;

unary
 : Sub pow   -> ^(U_MINUS pow)
 | Not unary -> ^(Not unary)
 | pow
 ;

pow
 : (dummy_atom -> dummy_atom) ((Pow atom)+ -> ^(Pow dummy_atom atom+))?
 ;

dummy_atom
 : atom
 ;

atom
 : '(' expr ')'       -> expr
 | New Id expr_params -> ^(New Id expr_params)
 | lookup
 | list
 | Param
 | Str
 | Num
 | True
 | False
 | None
 ;

lookup
 : Id tail*           
   -> ^(LOOKUP Id tail*)
 | This tail*         
   -> ^(LOOKUP {new CommonTree(new CommonToken(Id, "this"))} tail*) 
 | any_id expr_params 
   -> ^(
         LOOKUP {new CommonTree(new CommonToken(Id, "this"))} 
         ^(CALL {new CommonTree(new CommonToken(Id, $any_id.text))} expr_params)
       )
 ;

tail
 : '.' Id                 -> ^(ATTRIBUTE Id)
 | '[' expr ']'           -> ^(INDEX expr)
 | '.' any_id expr_params -> ^(CALL {new CommonTree(new CommonToken(Id, $any_id.text))} expr_params)
 ;

any_id
 : ReservedId -> {new CommonTree(new CommonToken(Id, $ReservedId.text))}
 | In         -> {new CommonTree(new CommonToken(Id, "in"))}
 | Id
 ;

expr_params
 : '(' (expr (',' expr)*)? ')' -> ^(EXPRESSIONS expr*)
 ;

list
 : OBrace (expr (',' expr)*)? CBrace -> ^(LIST expr*)
 ;

/* lexer rules */
ReservedId 
 : 'add'    // + 
 | 'and'    // &&
 | 'div'    // /
 | 'eq'     // ==
 | 'get'    // ID[index]
 | 'gt'     // > 
 | 'gteq'   // >=
 | 'lt'     // <
 | 'lteq'   // <=
 | 'mod'    // %
 | 'mult'   // *
 | 'not'    // !
 | 'or'     // ||
 | 'pow'    // ^
 | 'set'    // ID[index] = expr
 | 'size'   // the size as a WNumber
 | 'str'    // str() is called when concatenating strings
 | 'sub'    // -
 | 'type'   // implemented by default: returns the type as a WString
 ;

In      : 'in';
New     : 'new';
For     : 'for';
To      : 'to';
Return  : 'return';
None    : 'none';
Obj     : 'obj';
Println : 'println';
Print   : 'print';
True    : 'true';
False   : 'false';
Assert  : 'assert';
If      : 'if';
Elif    : 'elif';
Else    : 'else';
While   : 'while';
Break   : 'break';
This    : 'this';
Use     : 'use';
From    : 'from';

AddAssign  : '+=';
SubAssign  : '-=';
MultAssign : '*=';
DivAssign  : '/=';
OrAssign   : '||=';
AndAssign  : '&&=';
PowAssign  : '^=';
ModAssign  : '%=';
Assign     : '=';
Add        : '+';
Sub        : '-';
Mult       : '*';
Div        : '/';
Eq         : '==';
NEq        : '!=';
Or         : '||';
And        : '&&';
Not        : '!';
Mod        : '%';
Pow        : '^';
Lt         : '<';
Gt         : '>';
LtEq       : '<=';
GtEq       : '>=';
OBrace     : '{' {openBrace++;};
CBrace     : '}' {openBrace--;};

Num
 : '0'..'9'+ ('.' '0'..'9'*)?
 ;

Param
 : '$' '0'..'9'+
 ;

Id
 : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
 ;

Str
 : '"' ('""' | ~('\r' | '\n' | '"'))* '"'
   {
     String s = getText().replace("\\n", "\n").replace(
                "\\r", "\r").replace("\\t", "\t");
     setText(s.substring(1, s.length()-1).replace("\"\"", "\""));
   }
 ;

EOL
 : ('\r'? '\n' | '\r') Spaces?
   {
     int next = input.LA(1);
     
     if(openBrace > 0 || next == '\r' || next == '\n' || next == '#') {
       // if we're inside a list or on a blank line, ignore 
       // all indents, dedents and line breaks
       skip();
     }
     else {
       emit(new CommonToken(EOL, "EOL"));

       int indent = $Spaces == null ? 0 : $Spaces.text.length();
       int previous = indents.isEmpty() ? 0 : indents.peek();

       if(indent == previous) {
         // skip indents of the same size as the present indent-size
         skip();
       }
       else if(indent > previous) {
         indents.push(indent);
         emit(new CommonToken(INDENT, "INDENT"));
       }
       else {
         while(!indents.isEmpty() && indents.peek() > indent) {
           emit(new CommonToken(DEDENT, "DEDENT"));
           indents.pop();
         }
       }
     }
   }
 ;

Skip
 : Spaces  {skip();}
 | Comment {skip();}
 ;

Any
 : .
 ;

/* fragments */
fragment Spaces
 : (' ' | '\t')+ {setText(getText().replace("\t", "    "));}
 ;

fragment Comment
 : '#' ~('\r' | '\n')*
 ;

fragment DEDENT : ;
fragment INDENT : ;
As you might have noticed, objects and functions are being removed from the AST:
file_atom
 : object     -> /* omit from AST */
 | function   -> /* omit from AST */
 | block_atom
 ;

In the next part of this tutorial, we'll be storing these types separately.

In order to test if the parser creates a proper AST, open the file src/main/wally/Wally.java and paste the following in it:

package wally;

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
import wally.parser.*;
import java.io.File;
import java.util.Scanner;

public class Wally {

    public static WallyParser getParser(File file) throws Exception {
        WallyLexer lexer = new WallyLexer(new ANTLRStringStream(contentsOf(file)));
        return new WallyParser(file, new CommonTokenStream(lexer));
    }

    private static String contentsOf(File file) throws Exception {
        StringBuilder b = new StringBuilder();
        Scanner scan = new Scanner(file);
        while(scan.hasNextLine()) {
            String line = scan.nextLine();
            b.append(line).append('\n');
        }
        return b.toString();
    }

    public static void main(String[] args) {

        if(args.length == 0) {
            System.err.println(
                "usage: java -jar Wally-0.1.jar SCRIPT-FILE [PARAMS]");
            System.exit(1);
        }

        try {
            WallyParser parser = getParser(new File(args[0]));
            CommonTree tree = (CommonTree) parser.parse().getTree();
            DOTTreeGenerator gen = new DOTTreeGenerator();
            StringTemplate st = gen.toDOT(tree);
            System.out.println(st);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

The lines:

CommonTree tree = (CommonTree) parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);

cause a DOT file being printed to the console that represents the AST the parser created from the input source.

Now edit the file src/scripts/Test.wy and paste the following in it:

use Tuple from "Tuple.wy"

list = {1,2,3,4}

# comment
for x in list
  print(x + "*" + x + " = " + square(x))

square(n)
  return n*n

Now recreate a JAR file, execute the test script and pipe the output to a file called ast.dot:

ant jar
java -jar Wally.jar src/scripts/Test.wy > ast.dot

If you now open the file ast.dot, you will see a DOT representation of the AST. If you paste this in a DOT viewer, like this one, you will see an image that looks like this:

As you can see, the function square is not present in the AST as is the use statement at the start of the file. You can edit the test script a bit and test various input to see if the proper AST is created.

Download the project with the addition from this part of the tutorial here.

Next up, we'll be doing some ground work by creating custom Java objects that represent functions and objects of Wally and store these objects during the parsing phase, and we'll also handle the use statements. Continue reading here

3 comments:

Anonymous said...

That's really helpful, thank you! Looking forward to your future updates.

code terminator said...

great!

I'm looking forward the next parts

Bart Kiers said...

Thanks!

It's been too long: I'm working on part 6. I hope to finish it in the next week or so.