In this part we'll be creating the lexer of Wally. Before writing any code, or grammar, let's set up a small project structure with an Ant build file that will handle the generating of the parser and lexer, the compilation of the source files and creation of an executable JAR file.
Create the following folder structure:
+-- build/ | +-- lib/ | | | +-- antlr-3.3-complete.jar | +-- src/ | | | +-- grammar/ | | | | | +-- Wally.g | | | +-- main/ | | | | | +-- wally/ | | | | | +-- parser/ | | | | | +-- Wally.java | | | +-- scripts/ | | | +-- Test.wy | +-- build.xml
Download version 3.3 of ANTLR here: http://www.antlr.org/download/antlr-3.3-complete.jar
Wally.g
Initialize Wally.g with the following:
grammar Wally;
@parser::header {
package wally.parser;
}
@lexer::header {
package wally.parser;
}
parse
: .* EOF
;
Any
: .
;
This is just a simple stub, of course: we'll be filling it later on.
build.xml
Copy the following contents inside the build.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<project name="Wally">
<path id="classpath">
<pathelement location="build/classes"/>
<fileset dir="lib">
<include name="*.jar"/>
</fileset>
</path>
<target name="clean">
<delete file="Wally.jar"/>
<delete dir="build/classes/"/>
<mkdir dir="build/classes/"/>
</target>
<target name="compile" depends="clean">
<javac srcdir="src/main/" destdir="build/classes" includeantruntime="false">
<classpath refid="classpath"/>
</javac>
</target>
<target name="generate" depends="clean">
<echo>Generating the lexer and parser...</echo>
<java classname="org.antlr.Tool" fork="true" failonerror="true">
<arg value="-fo"/>
<arg value="src/main/wally/parser/"/>
<arg value="src/grammar/Wally.g"/>
<classpath refid="classpath"/>
</java>
<antcall target="compile"/>
</target>
<target name="jar" depends="generate, compile">
<jar destfile="Wally.jar" whenmanifestonly="fail">
<fileset dir="build/classes">
</fileset>
<manifest>
<attribute name="Class-Path" value="lib/antlr-3.3.jar"/>
<attribute name="Main-Class" value="wally.Wally"/>
</manifest>
</jar>
</target>
</project>
It contains 4 targets:
clean- removes all compiled files;compile- compiles all Java source files;generate- generates the lexer and parser from the grammar file (Wally.g);jar- bundles all compiled Java files in a JAR file to be able to run the main class more easily.
Wally.java
And the main Java file, Wally.java, will look like this:
package wally;
import org.antlr.runtime.*;
import wally.parser.*;
import java.io.File;
import java.util.Scanner;
public class Wally {
public static WallyParser getParser(String fileName) throws Exception {
WallyLexer lexer = new WallyLexer(new ANTLRStringStream(contentsOf(fileName)));
return new WallyParser(new CommonTokenStream(lexer));
}
// Note that the input must have a trailing line break: otherwise the lexer
// will not be able to produce DEDENT tokens once we reach the EOF as you
// will see later. So don't trim() the String returned by this method!
private static String contentsOf(String fileName) throws Exception {
StringBuilder b = new StringBuilder();
Scanner file = new Scanner(new File(fileName));
while(file.hasNextLine()) {
String line = file.nextLine();
b.append(line).append('\n');
}
return b.toString();
}
public static void main(String[] args) {
if(args.length == 0) {
System.err.println("usage: java -jar Wally-0.1.jar SCRIPT-FILE [PARAMS]");
System.exit(1);
}
try {
WallyParser parser = getParser(args[0]);
parser.parse();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Test.wy
The file Test.wy will hold the Wally source. Copy the following into it:
a = 42
Test run
Okay, the project structure is in place, let's execute the jar target with Ant. Open a shell/command, navigate to the project's root directory and do:
ant jar
You should see the following output:
Buildfile: .../build.xml
clean:
[delete] Deleting: .../Wally.jar
[delete] Deleting directory .../build/classes
[mkdir] Created dir: .../build/classes
generate:
[echo] Generating the lexer and parser...
clean:
[delete] Deleting directory .../build/classes
[mkdir] Created dir: .../build/classes
compile:
[javac] Compiling 3 source files to .../build/classes
compile:
jar:
[jar] Building jar: .../Wally.jar
BUILD SUCCESSFUL
The jar target did all the dull stuff: it generated a parser and lexer from our grammar and put it in the folder src/main/wally/parser, compiled all source files, and packaged it all in a JAR file whose classpath was properly set, and it also points to the class containing the main method.
Now run the JAR file and provide the Test.wy script as a parameter:
java -jar Wally.jar src/scripts/Test.wy
If all goes well, nothing happens. This means the lexer and parser had no problem tokenizing and parsing the input file. Not surprising of course: our grammar tokenizes every character as an Any token, and the parser rule parse accepts zero or more of such Any tokens. No matter what the input is, it would always pass.
in- and dedent tokens
Let's immediately dive into the deep end of the pool by implementing the rule for in- and dedent tokens. An in- or dedent will always occur right after a line break. But we'll also want to create line break tokens, and since an ANTLR lexer rule can always create a single token, we'll override some methods that will let us emit more than a single token inside a single lexer rule. The following should come directly after the lexer::header section:
@lexer::members {
private Queue<Token> tokens = new LinkedList<Token>();
private Stack<Integer> indents = new Stack<Integer>();
private int openBrace = 0;
@Override
public void emit(Token t) {
state.token = t;
tokens.offer(t);
}
@Override
public Token nextToken() {
super.nextToken();
return tokens.isEmpty() ? Token.EOF_TOKEN : tokens.poll();
}
}
// more rules...
OBrace : '{' {openBrace++;};
CBrace : '}' {openBrace--;};
EOL : /* todo */ ;
Notice that I also added a stack that keep track of the size of the indentation levels and a counter that will keep track of the number of open list-braces we've encountered: we won't be creating any in-, dedent and line break tokens when openBrace > 0.
Now the EOL (end-of-line) rule:
EOL
: ('\r'? '\n' | '\r') Spaces? // match a line break followed by zero or more spaces
{
// Custom code here to inspect what the `Spaces` fragment matched
// and produce EOL, INDENT or DEDENT tokens (or nothing at all).
// Scroll down to see what Java code to put here.
}
;
fragment Spaces // replace tabs with 4 spaces
: (' ' | '\t')+ {setText(getText().replace("\t", " "));}
;
The custom code follows (I put it outside the ANTLR code for nicer formatting):
// Look at the next character in the char-stream
int next = input.LA(1);
if(openBrace > 0 || next == '\r' || next == '\n' || next == '#') {
// If we're inside a list or on a blank line (a commented line is blank
// as well!), ignore all indents, dedents and line breaks
skip();
}
else {
// Push an EOL token on the token-queue by invoking our overridden
// `emit(...)` method
emit(new CommonToken(EOL, "EOL"));
// Get the length of the indents
int indent = $Spaces == null ? 0 : $Spaces.text.length();
// Find out how much indents the previous INDENT token was
int previous = indents.isEmpty() ? 0 : indents.peek();
if(indent == previous) {
// Skip indents of the same size as the present indent-size
skip();
}
else if(indent > previous) {
// We encountered an INDENT, add it in the token-queue and remember the number
// of indents by pushing it on the stack
indents.push(indent);
emit(new CommonToken(INDENT, "INDENT"));
}
else {
// If we get here, it must be a dedent token. Keep emitting DEDENT tokens
// until either the queue is empty, or the last value on the stack is more than
// the current indent-value
while(!indents.isEmpty() && indents.peek() > indent) {
emit(new CommonToken(DEDENT, "DEDENT"));
indents.pop();
}
}
}
That should do the trick. Here's the grammar with all lexer rules and a parse rule that will print all the tokens it encounters on the console:
grammar Wally;
@parser::header {
package wally.parser;
}
@lexer::header {
package wally.parser;
import java.util.Queue;
import java.util.LinkedList;
}
@lexer::members {
private Stack<Integer> indents = new Stack<Integer>();
private Queue<Token> tokens = new LinkedList<Token>();
private int openBrace = 0;
@Override
public void emit(Token t) {
state.token = t;
tokens.offer(t);
}
@Override
public Token nextToken() {
super.nextToken();
return tokens.isEmpty() ? Token.EOF_TOKEN : tokens.poll();
}
}
parse
: (t=.
{System.out.printf("\%-20s \%s\n",
tokenNames[$t.type], $t.text.replace("\n", "\\n"));}
)* EOF
;
ReservedId
: 'add' // +
| 'and' // &&
| 'div' // /
| 'eq' // ==
| 'get' // ID[index]
| 'gt' // >
| 'gteq' // >=
| 'lt' // <
| 'lteq' // <=
| 'mod' // %
| 'mult' // *
| 'not' // !
| 'or' // ||
| 'pow' // ^
| 'set' // ID[index] = expr
| 'size' // the size as a number
| 'str' // str() is called when concatenating strings
| 'sub' // -
| 'type' // implemented by default: returns the type as a string
;
In : 'in';
New : 'new';
For : 'for';
To : 'to';
Return : 'return';
None : 'none';
Obj : 'obj';
Println : 'println';
Print : 'print';
True : 'true';
False : 'false';
Assert : 'assert';
If : 'if';
Elif : 'elif';
Else : 'else';
While : 'while';
Break : 'break';
This : 'this';
Use : 'use';
From : 'from';
AddAssign : '+=';
SubAssign : '-=';
MultAssign : '*=';
DivAssign : '/=';
OrAssign : '||=';
AndAssign : '&&=';
PowAssign : '^=';
ModAssign : '%=';
Assign : '=';
Add : '+';
Sub : '-';
Mult : '*';
Div : '/';
Eq : '==';
NEq : '!=';
Or : '||';
And : '&&';
Not : '!';
Mod : '%';
Pow : '^';
Lt : '<';
Gt : '>';
LtEq : '<=';
GtEq : '>=';
OBrace : '{' {openBrace++;};
CBrace : '}' {openBrace--;};
Num
: '0'..'9'+ ('.' '0'..'9'*)?
;
Param
: '$' '0'..'9'+
;
Id
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Str
: '"' ('""' | ~('\r' | '\n' | '"'))* '"'
{
String s = getText().replace("\\n", "\n")
.replace("\\r", "\r")
.replace("\\t", "\t");
setText(s.substring(1, s.length()-1).replace("\"\"", "\""));
}
;
EOL
: ('\r'? '\n' | '\r') Spaces?
{
int next = input.LA(1);
if(openBrace > 0 || next == '\r' || next == '\n' || next == '#') {
skip();
}
else {
emit(new CommonToken(EOL, "EOL"));
int indent = $Spaces == null ? 0 : $Spaces.text.length();
int previous = indents.isEmpty() ? 0 : indents.peek();
if(indent == previous) {
skip();
}
else if(indent > previous) {
indents.push(indent);
emit(new CommonToken(INDENT, "INDENT"));
}
else {
while(!indents.isEmpty() && indents.peek() > indent) {
emit(new CommonToken(DEDENT, "DEDENT"));
indents.pop();
}
}
}
}
;
Skip
: Spaces {skip();}
| Comment {skip();}
;
fragment Spaces
: (' ' | '\t')+ {setText(getText().replace("\t", " "));}
;
fragment Comment
: '#' ~('\r' | '\n')*
;
fragment DEDENT : ;
fragment INDENT : ;
To test if the INDENT and DEDENT tokens are properly created, paste the following inside Test.wy:
a
b
c
cc
bb
ccc
where we'd expect the following INDENT (>) and DEDENT (<) tokens to be created:
a
> b
> c
cc
< bb
> ccc
< <
Recreate the JAR file and execute the test script:
ant jar java -jar Wally.jar src/scripts/Test.wy
which would print the following to the console:
Id a EOL EOL INDENT INDENT Id b EOL EOL INDENT INDENT Id c EOL EOL Id cc EOL EOL DEDENT DEDENT Id bb EOL EOL INDENT INDENT Id ccc EOL EOL DEDENT DEDENT DEDENT DEDENT
which corresponds to our expectations perfectly!
There are still some lexer rules missing, like parenthesis and square brackets. But since these will be discarded from the AST our parser will be creating, we'll create such tokens on the fly inside parser rules. Which brings us to the next part of this tutorial: the syntactic analysis of Wally.
Download the project created so far here.
No comments:
Post a Comment