Wednesday, February 15, 2012

3. Lexical analysis of Wally


Parts: 1 2 3 4 5 6 7

In this part we'll be creating the lexer of Wally. Before writing any code, or grammar, let's set up a small project structure with an Ant build file that will handle the generating of the parser and lexer, the compilation of the source files and creation of an executable JAR file.

Create the following folder structure:

+-- build/
|
+-- lib/
|    |
|    +-- antlr-3.3-complete.jar
|
+-- src/
|    |
|    +-- grammar/
|    |    |
|    |    +-- Wally.g
|    |
|    +-- main/
|    |    |
|    |    +-- wally/
|    |         |
|    |         +-- parser/
|    |         |
|    |         +-- Wally.java
|    |
|    +-- scripts/
|         |
|         +-- Test.wy
|
+-- build.xml

Download version 3.3 of ANTLR here: http://www.antlr.org/download/antlr-3.3-complete.jar

Wally.g

Initialize Wally.g with the following:

grammar Wally;
 
@parser::header {
  package wally.parser;
}
 
@lexer::header {
  package wally.parser;
}
 
parse
 : .* EOF
 ;
 
Any
 : .
 ;

This is just a simple stub, of course: we'll be filling it later on.

build.xml

Copy the following contents inside the build.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<project name="Wally">
 
    <path id="classpath">
        <pathelement location="build/classes"/>
        <fileset dir="lib">
            <include name="*.jar"/>
        </fileset>
    </path>
 
    <target name="clean">
        <delete file="Wally.jar"/>
        <delete dir="build/classes/"/>
        <mkdir dir="build/classes/"/>
    </target>
 
    <target name="compile" depends="clean">
        <javac srcdir="src/main/" destdir="build/classes" includeantruntime="false">
            <classpath refid="classpath"/>
        </javac>
    </target>
 
    <target name="generate" depends="clean">
        <echo>Generating the lexer and parser...</echo>
        <java classname="org.antlr.Tool" fork="true" failonerror="true">
            <arg value="-fo"/>
            <arg value="src/main/wally/parser/"/>
            <arg value="src/grammar/Wally.g"/>
            <classpath refid="classpath"/>
        </java>
        <antcall target="compile"/>
    </target>
 
    <target name="jar" depends="generate, compile">
        <jar destfile="Wally.jar" whenmanifestonly="fail">
            <fileset dir="build/classes">
            </fileset>
            <manifest>
                <attribute name="Class-Path" value="lib/antlr-3.3.jar"/>
                <attribute name="Main-Class" value="wally.Wally"/>
            </manifest>
        </jar>
    </target>
 
</project>

It contains 4 targets:

  • clean - removes all compiled files;
  • compile - compiles all Java source files;
  • generate - generates the lexer and parser from the grammar file (Wally.g);
  • jar - bundles all compiled Java files in a JAR file to be able to run the main class more easily.

Wally.java

And the main Java file, Wally.java, will look like this:

package wally;
 
import org.antlr.runtime.*;
import wally.parser.*;
import java.io.File;
import java.util.Scanner;
 
public class Wally {
 
  public static WallyParser getParser(String fileName) throws Exception {
    WallyLexer lexer = new WallyLexer(new ANTLRStringStream(contentsOf(fileName)));
    return new WallyParser(new CommonTokenStream(lexer));
  }
 
  // Note that the input must have a trailing line break: otherwise the lexer 
  // will not be able to produce DEDENT tokens once we reach the EOF as you 
  // will see later. So don't trim() the String returned by this method!
  private static String contentsOf(String fileName) throws Exception {
    StringBuilder b = new StringBuilder();
    Scanner file = new Scanner(new File(fileName));
    while(file.hasNextLine()) {
      String line = file.nextLine();
      b.append(line).append('\n');
    }
    return b.toString();
  }
 
  public static void main(String[] args) {
 
    if(args.length == 0) {
      System.err.println("usage: java -jar Wally-0.1.jar SCRIPT-FILE [PARAMS]");
      System.exit(1);
    }
 
    try {
      WallyParser parser = getParser(args[0]);
      parser.parse();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

Test.wy

The file Test.wy will hold the Wally source. Copy the following into it:

a = 42

Test run

Okay, the project structure is in place, let's execute the jar target with Ant. Open a shell/command, navigate to the project's root directory and do:

ant jar

You should see the following output:

Buildfile: .../build.xml

clean:
   [delete] Deleting: .../Wally.jar
   [delete] Deleting directory .../build/classes
    [mkdir] Created dir: .../build/classes

generate:
     [echo] Generating the lexer and parser...

clean:
   [delete] Deleting directory .../build/classes
    [mkdir] Created dir: .../build/classes

compile:
    [javac] Compiling 3 source files to .../build/classes

compile:

jar:
      [jar] Building jar: .../Wally.jar

BUILD SUCCESSFUL

The jar target did all the dull stuff: it generated a parser and lexer from our grammar and put it in the folder src/main/wally/parser, compiled all source files, and packaged it all in a JAR file whose classpath was properly set, and it also points to the class containing the main method.

Now run the JAR file and provide the Test.wy script as a parameter:

java -jar Wally.jar src/scripts/Test.wy

If all goes well, nothing happens. This means the lexer and parser had no problem tokenizing and parsing the input file. Not surprising of course: our grammar tokenizes every character as an Any token, and the parser rule parse accepts zero or more of such Any tokens. No matter what the input is, it would always pass.

in- and dedent tokens

Let's immediately dive into the deep end of the pool by implementing the rule for in- and dedent tokens. An in- or dedent will always occur right after a line break. But we'll also want to create line break tokens, and since an ANTLR lexer rule can always create a single token, we'll override some methods that will let us emit more than a single token inside a single lexer rule. The following should come directly after the lexer::header section:

@lexer::members {
  private Queue<Token> tokens = new LinkedList<Token>();
  private Stack<Integer> indents = new Stack<Integer>();
  private int openBrace = 0;
 
  @Override
  public void emit(Token t) {
    state.token = t;
    tokens.offer(t);
  }
 
  @Override
  public Token nextToken() {
    super.nextToken();
    return tokens.isEmpty() ? Token.EOF_TOKEN : tokens.poll();
  }
}
 
// more rules...
 
OBrace     : '{' {openBrace++;};
CBrace     : '}' {openBrace--;};
 
EOL : /* todo */ ;

Notice that I also added a stack that keep track of the size of the indentation levels and a counter that will keep track of the number of open list-braces we've encountered: we won't be creating any in-, dedent and line break tokens when openBrace > 0.

Now the EOL (end-of-line) rule:

EOL
 : ('\r'? '\n' | '\r') Spaces? // match a line break followed by zero or more spaces
   {
     // Custom code here to inspect what the `Spaces` fragment matched
     // and produce EOL, INDENT or DEDENT tokens (or nothing at all).
     // Scroll down to see what Java code to put here.
   }
 ;
 
fragment Spaces  // replace tabs with 4 spaces
 : (' ' | '\t')+ {setText(getText().replace("\t", "    "));}
 ;

The custom code follows (I put it outside the ANTLR code for nicer formatting):

// Look at the next character in the char-stream
int next = input.LA(1);
 
if(openBrace > 0 || next == '\r' || next == '\n' || next == '#') {
  // If we're inside a list or on a blank line (a commented line is blank 
  // as well!), ignore all indents, dedents and line breaks
  skip();
}
else {
  // Push an EOL token on the token-queue by invoking our overridden 
  // `emit(...)` method
  emit(new CommonToken(EOL, "EOL"));
 
  // Get the length of the indents
  int indent = $Spaces == null ? 0 : $Spaces.text.length();
 
  // Find out how much indents the previous INDENT token was
  int previous = indents.isEmpty() ? 0 : indents.peek();
 
  if(indent == previous) {
    // Skip indents of the same size as the present indent-size
    skip();
  }
  else if(indent > previous) {
    // We encountered an INDENT, add it in the token-queue and remember the number 
    // of indents by pushing it on the stack
    indents.push(indent);
    emit(new CommonToken(INDENT, "INDENT"));
  }
  else {
    // If we get here, it must be a dedent token. Keep emitting DEDENT tokens
    // until either the queue is empty, or the last value on the stack is more than
    // the current indent-value
    while(!indents.isEmpty() && indents.peek() > indent) {
      emit(new CommonToken(DEDENT, "DEDENT"));
      indents.pop();
    }
  }
}

That should do the trick. Here's the grammar with all lexer rules and a parse rule that will print all the tokens it encounters on the console:

grammar Wally;
 
@parser::header {
  package wally.parser;
}
 
@lexer::header {
  package wally.parser;
 
  import java.util.Queue;
  import java.util.LinkedList;
}
 
@lexer::members {
  private Stack<Integer> indents = new Stack<Integer>();
  private Queue<Token> tokens = new LinkedList<Token>();
  private int openBrace = 0;
 
  @Override
  public void emit(Token t) {
    state.token = t;
    tokens.offer(t);
  }
 
  @Override
  public Token nextToken() {
    super.nextToken();
    return tokens.isEmpty() ? Token.EOF_TOKEN : tokens.poll();
  }
}
 
parse
 : (t=. 
     {System.out.printf("\%-20s \%s\n", 
          tokenNames[$t.type], $t.text.replace("\n", "\\n"));}
   )* EOF
 ;
 
ReservedId
 : 'add'    // + 
 | 'and'    // &&
 | 'div'    // /
 | 'eq'     // ==
 | 'get'    // ID[index]
 | 'gt'     // > 
 | 'gteq'   // >=
 | 'lt'     // <
 | 'lteq'   // <=
 | 'mod'    // %
 | 'mult'   // *
 | 'not'    // !
 | 'or'     // ||
 | 'pow'    // ^
 | 'set'    // ID[index] = expr
 | 'size'   // the size as a number
 | 'str'    // str() is called when concatenating strings
 | 'sub'    // -
 | 'type'   // implemented by default: returns the type as a string
 ;
 
In      : 'in';
New     : 'new';
For     : 'for';
To      : 'to';
Return  : 'return';
None    : 'none';
Obj     : 'obj';
Println : 'println';
Print   : 'print';
True    : 'true';
False   : 'false';
Assert  : 'assert';
If      : 'if';
Elif    : 'elif';
Else    : 'else';
While   : 'while';
Break   : 'break';
This    : 'this';
Use     : 'use';
From    : 'from';
 
AddAssign  : '+=';
SubAssign  : '-=';
MultAssign : '*=';
DivAssign  : '/=';
OrAssign   : '||=';
AndAssign  : '&&=';
PowAssign  : '^=';
ModAssign  : '%=';
Assign     : '=';
Add        : '+';
Sub        : '-';
Mult       : '*';
Div        : '/';
Eq         : '==';
NEq        : '!=';
Or         : '||';
And        : '&&';
Not        : '!';
Mod        : '%';
Pow        : '^';
Lt         : '<';
Gt         : '>';
LtEq       : '<=';
GtEq       : '>=';
OBrace     : '{' {openBrace++;};
CBrace     : '}' {openBrace--;};
 
Num
 : '0'..'9'+ ('.' '0'..'9'*)?
 ;
 
Param
 : '$' '0'..'9'+
 ;
 
Id
 : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
 ;
 
Str
 : '"' ('""' | ~('\r' | '\n' | '"'))* '"'
   {
     String s = getText().replace("\\n", "\n")
                         .replace("\\r", "\r")
                         .replace("\\t", "\t");
     setText(s.substring(1, s.length()-1).replace("\"\"", "\""));
   }
 ;
 
EOL
 : ('\r'? '\n' | '\r') Spaces?
   {
     int next = input.LA(1);
 
     if(openBrace > 0 || next == '\r' || next == '\n' || next == '#') {
       skip();
     }
     else {
       emit(new CommonToken(EOL, "EOL"));
 
       int indent = $Spaces == null ? 0 : $Spaces.text.length();
       int previous = indents.isEmpty() ? 0 : indents.peek();
 
       if(indent == previous) {
         skip();
       }
       else if(indent > previous) {
         indents.push(indent);
         emit(new CommonToken(INDENT, "INDENT"));
       }
       else {
         while(!indents.isEmpty() && indents.peek() > indent) {
           emit(new CommonToken(DEDENT, "DEDENT"));
           indents.pop();
         }
       }
     }
   }
 ;
 
Skip
 : Spaces  {skip();}
 | Comment {skip();}
 ;
 
fragment Spaces
 : (' ' | '\t')+ {setText(getText().replace("\t", "    "));}
 ;
 
fragment Comment
 : '#' ~('\r' | '\n')*
 ;
 
fragment DEDENT : ;
fragment INDENT : ;

To test if the INDENT and DEDENT tokens are properly created, paste the following inside Test.wy:

a
  b
    c
    cc
  bb
    ccc

where we'd expect the following INDENT (>) and DEDENT (<) tokens to be created:

a
> b
  > c
    cc
< bb
  > ccc
< <

Recreate the JAR file and execute the test script:

ant jar
java -jar Wally.jar src/scripts/Test.wy

which would print the following to the console:

Id                   a
EOL                  EOL
INDENT               INDENT
Id                   b
EOL                  EOL
INDENT               INDENT
Id                   c
EOL                  EOL
Id                   cc
EOL                  EOL
DEDENT               DEDENT
Id                   bb
EOL                  EOL
INDENT               INDENT
Id                   ccc
EOL                  EOL
DEDENT               DEDENT
DEDENT               DEDENT

which corresponds to our expectations perfectly!

There are still some lexer rules missing, like parenthesis and square brackets. But since these will be discarded from the AST our parser will be creating, we'll create such tokens on the fly inside parser rules. Which brings us to the next part of this tutorial: the syntactic analysis of Wally.

Download the project created so far here.

No comments: