You are on page 1of 77

TABLE OF CONTENT
1. Table of Content 2. Introduction 1. Summary 2. About The Author 3. Before We Begin 3. Overview 1. The Four Parts of a Language 2. Meet Awesome: Our Toy Language 4. Lexer 1. Lex (Flex) 2. Ragel 3. Python Style Indentation For Awesome 4. Do It Yourself I 5. Parser 1. Bison (Yacc) 2. Lemon 3. ANTLR 4. PEGs 5. Operator Precedence 6. Connecting The Lexer and Parser in Awesome 7. Do It Yourself II 6. Runtime Model 1. Procedural 2. Class-based 3. Prototype-based 4. Functional 5. Our Awesome Runtime 6. Do It Yourself III

7. Interpreter 1. Do It Yourself IV 8. Compilation 1. Using LLVM from Ruby 2. Compiling Awesome to Machine Code 9. Virtual Machine 1. Byte-code 2. Types of VM 3. Prototyping a VM in Ruby 10. Going Further 1. Homoiconicity 2. Self-Hosting 3. What’s Missing? 11. Resources 1. Books & Papers 2. Events 3. Forums and Blogs 4. Interesting Languages 12. Solutions to Do It Yourself 1. Solutions to Do It Yourself I 2. Solutions to Do It Yourself II 3. Solutions to Do It Yourself III 4. Solutions to Do It Yourself IV 13. Appendix: Mio, a minimalist homoiconic language 1. Homoicowhat? 2. Messages all the way down 3. The Runtime 4. Implementing Mio in Mio 5. But it’s ugly 14. Farewell!

Published November 2011. All right reserved. You may not share it in any way unless you have written permission of the author. Cover background image © Asja Boros Content of this book is © Marc-André Cournoyer. . This eBook copy is for a single user.

please send me an email at macournoyer@gmail. I hope you’ll enjoy reading this book. You’re creating a way to express yourself. it can still be hard to create a fully functional language because it’s impossible to predict all the ways in which someone will use it. Coding my first language was one of the most amazing experiences in my programming career. SUMMARY This book is divided into ten sections that will walk you through each step of language-building. So create. you become defined by your tastes rather than ability. . I hope you’ll write your own programming language. Nevertheless.INTRODUCTION When you don’t create things. Your tastes only narrow & exclude people. but at the same time applying computer science principles to implement it. . You never know what someone else might create with it! I’ve written this book to help other developers discover the joy of creating a programming language. If you find an error or have a comment or suggestion while reading the following pages. Since we aren’t the first ones to create a programming language. You’ll find solutions to those at the end of this book. some well established tools are around to ease most of the exercise.com. That’s why making your own language is such a great experience. Each section will introduce a new concept and then apply its principles to a language that we’ll build together throughout the book. but mostly.Why the Lucky Stiff Creating a programming language is the perfect mix of art and science. All technical chapters end with a Do It Yourself section that suggest some language-extending exercises.

8. Thin. You can find most of my projects on GitHub. but usually not at the same time. to recompile the parser in the exercices). BEFORE WE BEGIN You should have received a code. Min. Instead.7 or 1. Other versions might work. and a bunch of other stuff.Our language will be dynamic and very similar to Ruby and Python.9. install with: gem install racc -v=1. All of the code will be in Ruby.2 ▪ Racc 1. I coded tinyrb.6. Québec passionate about programming languages and tequila. a coder from Montréal. the high performance Ruby web server. a toy language running on the JVM.4. it should serve as an introduction in building your first toy language. but the code was tested with those. the smallest Ruby Virtual Machine. but I’ve put lots of attention to keep the code as simple as possible so that you can understand what’s happening even if you don’t know Ruby. The focus of this book is not on how to build a production-ready language.6 (optional. blogging or tweeting and offline. You can find me online.4.zip file with this book including all the code samples. ABOUT THE AUTHOR I’m Marc-André Cournoyer. snowboarding and learning guitar. To run the examples you must have the following installed: ▪ Ruby 1. .

A chapter of this book is dedicated to each part. most of the core principles haven’t changed. the parser. Contrary to Web Applications. a parser and a compiler. Each one transforms the input of its predecessor until the code is run. Some of the best books in this sphere have been written a long time ago: Principles of Compiler Design was published in 1977 and Smalltalk-80: The Language and its Implementation in 1983. Figure 1 shows an overview of this process. Figure 1 . the interpreter and the runtime. THE FOUR PARTS OF A LANGUAGE Most dynamic languages are designed in four parts that work in sequence: the lexer. where we’ve seen a growing number of frameworks. languages are still built from a lexer.OVERVIEW Although the way we design a language has evolved since its debuts.

Some parts will be incomplete. blocks of code are delimited by their indentation. much like in Ruby. ▪ Everything is an object. ▪ Identifiers starting with a capital letter are constants which are globally accessible. . ▪ Methods can be defined anywhere using the def keyword. ▪ If a method takes no arguments.new print(awesome.awesomeness) A couple of rules for our language: ▪ As in Python. parenthesis can be skipped.name) print(awesome. but the goal with a toy language is to educate.MEET AWESOME: OUR TOY LANGUAGE The language we’ll be coding in this book is called Awesome. ▪ Classes are declared with the class keyword. because it is! It’s a mix of Ruby syntax and Python’s indentation: 1 2 3 4 5 6 7 8 9 10 class Awesome: def name: "I'm Awesome" def awesomeness: 100 awesome = Awesome. ▪ The last value evaluated in a method is its return value. ▪ Lower-case identifiers are local variables or method names. experiment and set the foundation on which to build more.

It has been ported to several target languages. the code you want to execute. or tokenizer is the part of a language that converts the input. CEO of Google. into tokens the parser can understand. \t. Lexers can be implemented using regular expressions. Let’s say you have the following code: 1 2 3 print "I ate". etc. LEX (FLEX) Flex is a modern version of Lex (that was coded by Eric Schmidt. Lex is the most commonly used lexer for parsing. Along with Yacc.). by the way) for generating C lexers. 3.LEXER The lexer. or scanner. it will look something like this: 1 2 3 [IDENTIFIER print] [STRING "I ate"] [COMMA] [NUMBER 3] [COMMA] [IDENTIFIER pies] What the lexer does is split the code and tag each part with the type of token it contains. pies Once this code goes through the lexer. . This makes it easier for the parser to operate since it doesn’t have to bother with details such as parsing a floating point number or parsing a complex string with escape sequences (\n. but more appropriate tools exists.

return T_NUMBER. using the rexical gem (a port of Lex to Ruby).. yylval = yytext. the action to take. They are lexers compilers.. On the right side. return T_END. A Ruby equivalent. On the left side a regular expression defines how the token is matched. [ \t\n]+ /* ignore */ yylval = atoi(yytext). Here’s what that grammar looks like: 1 2 3 4 5 6 7 8 9 10 11 12 13 BLANK %% // Whitespace {BLANK} // Literals [0-9]+ // Keywords "end" // . More details in the Flex manual.▪ Rex for Ruby ▪ JFlex for Java Lex and friends are not lexers per se. You supply it a grammar and it will output a lexer. text. The value of the token is stored in yylval and the type of token is returned.to_i] } # ignore [\ \t]+ . would be: 1 2 3 4 5 6 7 8 9 macro BLANK rule # Whitespace {BLANK} # Literals [0-9]+ { [:NUMBER.

.te]... text] } Rexical follows a similar grammar as Lex. Mode details on the rexical project page. = " ". Regular expression on the left and action on the right.upcase.te]] }.. RAGEL A powerful tool for creating a scanner is Ragel. like regular expressions. => { tokens << [data[ts. are state machines. # Machine number whitespace keyword # Actions main := |* whitespace. Here’s what a Ragel grammar looks like: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 %%{ machine lexer. Being very flexible. data[ts.to_sym..to_i] }. number keyword *|. = "end" | "def" | "class" | "if" | "else" | "true" | "false" | "nil". class Lexer def initialize # ignore => { tokens << [:NUMBER. an array of two items is used to return the type and value of the matched token. they can handle grammars of varying complexities and output parser in several languages. . = [0-9]+. data[ts.te].10 11 12 # Keywords end { [:END. It’s described as a State Machine Compiler: lexers. However.

you should use one of the two previous tools. end def run(data) eof = data. } is no different from parsing indentation when you know how to do it. Tokenizing the following Python code: .18 19 20 21 22 23 24 25 26 27 28 29 30 %% write data.size line = 1 tokens = [] %% write init. %% write exec. All of indentation magic takes place within the lexer. Here are a few real-world examples of Ragel grammars used as language lexers: ▪ Min’s lexer (Java) ▪ Potion’s lexer (C) PYTHON STYLE INDENTATION FOR AWESOME If you intend to build a fully-functionning language. tokens end end }%% More details in the Ragel manual (PDF). as in Python. we will build the lexer from scratch using regular expressions. Parsing blocks of code delimited with { ... we’ll use indentation to delimit blocks in our toy language. Since Awesome is a simplistic language and we want to illustrate the basic concepts of a scanner. To make things more interesting.

we can # check if we're on the correct level. "if". "class". "nil"] def tokenize(code) # Cleanup code by remove extra line breaks code.1 2 if tasty == True: print "Delicious!" will yield these tokens: 1 2 3 [IDENTIFIER if] [IDENTIFIER tasty] [EQUAL] [IDENTIFIER True] [INDENT] [IDENTIFIER print] [STRING "Delicious!"] [DEDENT] The block is wrapped in INDENT and DEDENT tokens instead of { and }. current_indent = 0 lexer. The indentation-parsing algorithm is simple. When you encounter a line break followed by spaces. value] tokens = [] # Current indent level is the number of spaces in the last indent.rb # We keep track of the indentation levels we are in so that when we dedent. indent_stack = [] # This is how to implement a very simple scanner. # Scan one character at the time until you find something to parse. "false". You need to track two things: the current indentation level and the stack of indentation levels. . Here’s our lexer for the Awesome language: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 class Lexer KEYWORDS = ["def".chomp! # Current character position we're parsing i = 0 # Collection of all parsed tokens in the form [:TOKEN_TYPE. "true". you update the indentation level.

method names. 1] tokens << [:STRING. 1] tokens << [:CONSTANT.size # Matching class names and constants starting with a capital letter. else tokens << [:IDENTIFIER.22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 while i < code. if identifier = chunk[/\A([a-z]\w*)/. "if"] token if KEYWORDS.to_i] i += number.upcase. if true: line 1 line 2 continue # 2) new line inside a block # 3) dedent # 1) the block is created .to_sym.-1] # Matching standard tokens. identifier] # Non-keyword identifiers include method and variable names. string] i += string. The number of spaces will determine # the indent level. # # Matching if. constant] i += constant.size chunk = code[i.. identifier] end # skip what we just parsed i += identifier. etc. elsif constant = chunk[/\A([A-Z]\w*)/. 'if' will result # in an [:IF. print.size elsif string = chunk[/\A"(. number.*?)"/.include?(identifier) tokens << [identifier. 1] # Keywords are special identifiers tagged with their own name. 1] tokens << [:NUMBER.size + 2 # Here's the indentation magic! # # We have to take care of 3 cases: # # # # # # # This elsif takes care of the first case.size elsif number = chunk[/\A([0-9]+)/.

size + 2 # This elsif takes care of the two last cases: # Case 2: We stay in the same block if the indent level (number of spaces) is the # same as current_indent. " + "expected > #{current_indent}" end # Adjust the current indentation level.size] end tokens << [:NEWLINE.64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 elsif indent = chunk[/\A\:\n( +)/m. elsif operator = chunk[/\A(\|\||&&|==|!=|<=|>=)/. # One character long operators are matched by the catch all `else` at the bottom. ==.size + 1 # Match long operators such as ||.size indent_stack.size < current_indent indent_stack. 1] # Matches "<newline> <spaces>" if indent. 1] tokens << [operator. <= and >=. "\n"] elsif indent. 1] # Matches ": <newline> <spaces>" # When we create a new block we expect the indent level to go up.size] i += indent.size # Ignore whitespace elsif chunk. &&. we're still in the same block tokens << [:NEWLINE. current_indent = indent. # Case 3: Close the current block. if indent level is lower than current_indent. !=. elsif indent = chunk[/\A\n( *)/m.pop current_indent = indent_stack. so this is an error.size == current_indent # Case 2 # Nothing to do. error! # Cannot increase indent level without using ":".size < current_indent # Case 3 while indent.push(current_indent) tokens << [:INDENT.match(/\A /) i += 1 .first || 0 tokens << [:DEDENT.size} indents.size > current_indent. got #{indent. indent. "\n"] else # indent. operator] i += operator. if indent.size <= current_indent raise "Bad indent level. raise "Missing ':'" end i += indent. indent.

[:INDENT. ". 1]. 0 errors. Here’s an excerpt from that test file. [:FALSE. . [:INDENT. [:NUMBER.. [:STRING.. "print"].rb from the code directory and it should output 0 failures." if false: pass print "done!" print "The End" CODE tokens = [ [:IF. "if"]. [:IDENTIFIER. test/lexer_test.pop tokens << [:DEDENT. indent_stack. ! + . Eg.< else value = chunk[0.first || 0] end tokens end end You can test the lexer yourself by running the test file included with the book. 4]..: ( ) ."]. [:IF. value] i += 1 end end # Close all open blocks while indent = indent_stack.rb . "\n"]. "false"]. [:NEWLINE.106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 # Catch all single characters # We treat all other single characters as a token.1] tokens << [value. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 code = <<-CODE if 1: print ". Run ruby -Itest test/lexer_test. 2].. "if"].

. } instead of indentation.. Solutions to Do It Yourself I. DO IT YOURSELF I a. [:DEDENT. "pass"]. [:DEDENT. 2]. [:STRING. "print"]. "\n"].new. [:IDENTIFIER. b. control structures. Modify the lexer to delimit blocks with { .. [:NEWLINE.. "\n"]. [:IDENTIFIER. "The End"] ] assert_equal tokens. 0].15 16 17 18 19 20 21 22 [:IDENTIFIER. [:NEWLINE. "print"].tokenize(code) Some parsers take care of both lexing and parsing in their grammar. "done!"]. [:STRING. Lexer.. Modify the lexer to parse: while condition: . We’ll see more about those in the next section.

<Number value=3>. It’s a tree of nodes that represents what the code means to the language. The previous lexer tokens will produce the following: 1 2 3 4 5 [<Call name=print. the tokens output by the lexer are just building blocks. The lexer produces an array of tokens. <Local name=pies>] >] Or as a visual tree: Figure 2 . The parser contextualizes them by organizing them in a structure. or AST. Lets take those tokens from previous section: 1 2 3 [IDENTIFIER print] [STRING "I ate"] [COMMA] [NUMBER 3] [COMMA] [IDENTIFIER pies] The most common parser output is an Abstract Syntax Tree. arguments=[<String value="I ate">.PARSER By themselves. the parser produces a tree of nodes.

$3. it has been ported to several target languages. between brackets is the action to execute when the rule matches. a programming language needs a grammar to define its rules. $1 $2 $3 $4 $5 $6 { $$ = CallNode_new($1.' IDENTIFIER '(' ArgList ')' /* . ▪ Racc for Ruby ▪ Ply for Python ▪ JavaCC for Java Like Lex. like Ruby. because it compiles the grammar to a compiler of tokens. from the previous chapter. $3.The parser found that print was a method call and the following tokens are the arguments. Yacc compiles a grammar into a parser. */ On the left is defined how the rule can be matched using tokens and other rules. Yacc stands for Yet Another Compiler Compiler. The parser generator will convert this grammar into a parser that will compile lexer tokens into AST nodes. $5). Much like the English language. BISON (YACC) Bison is a modern version of Yacc. the most widely used parser. Here’s how a Yacc grammar rule is defined: 1 2 3 4 5 6 Call: /* Name of the rule */ Expression '. Parser generators are commonly used to accomplish the otherwise tedious task of building a parser. It’s used in several mainstream languages. } { $$ = CallNode_new($1. . On the right side. Most often used with Lex.' IDENTIFIER | Expression '. } <= values from the rule are stored in these variables. NULL).

▪ The parser generated by Lemon is both re-entrant and thread-safe. It has been ported to several target languages. PEGS Parsing Expression Grammars. with a few differences. OPERATOR PRECEDENCE One of the common pitfalls of language parsing is operator precedence. Parsing x + y * z should not produce the same result as (x + y) * z. same for all other . Treetop is an interesting Ruby tool for creating PEG. ▪ Lemon includes the concept of a non-terminal destructor. Finally. This one let’s you declare lexing and parsing rules in the same grammar. ANTLR ANTLR is another parsing tool. LEMON Lemon is quite similar to Yacc.In that block. we store the result in $$. we can reference tokens being matched using $1. refer to the the manual or check real examples inside Potion. or PEGs. For more information. are very powerful at parsing complex languages. I’ve used a PEG generated from peg/leg in tinyrb to parse Ruby’s infamous syntax with encouraging results (tinyrb’s grammar). $2. From its website: ▪ Using a different grammar syntax which is less prone to programming errors. which makes it much easier to write a parser that does not leak memory. etc.

Resulting in a = (b = c). often based on mathematics order of operations. the conflict is resolved using associativity. Here’s the operator precedence table for our language. the part b * c will be parsed first since * has higher precedence than +. b = c. declared with the left and right keyword before the token. Since = has right-to-left associativity. it will start parsing from the right. If the line a + b * c is being parsed. the sooner the operator will be parsed.operators. based on the C language operator precedence: 1 2 3 4 5 6 7 8 9 10 left left left left left left left left '. Now. Yacc-based parsers implement the Shunting Yard algorithm in which you give a precedence level to each kind of operator. Operators are declared in Bison and Yacc with %left and %right macros. Simply declaring the grammar rules in the right order will produce the desired result: .' '*' '/' '+' '-' '>' '>=' '<' '<=' '==' '!=' '&&' '||' '. Each language has an operator precedence table. For example. if several operators having the same precedence are competing to be parsed all the once.' right '!' right '=' The higher the precedence (top is higher). For other types of parsers (ANTLR and PEG) a simpler but less efficient alternative can be used. Several ways to handle this exist. Read more in Bison’s manual. with the expression a = b = c.

y . CONNECTING THE LEXER AND PARSER IN AWESOME For our Awesome parser we’ll use Racc. the Ruby version of Yacc. it will have greater precedence. most languages end up writing their own parser because the result is faster and provides better error reporting. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 class Parser # Declare tokens produced by the lexer token IF ELSE token DEF token CLASS token NEWLINE token NUMBER token STRING token TRUE FALSE NIL token IDENTIFIER token CONSTANT token INDENT DEDENT grammar.1 2 3 4 5 expression: equality: additive: multiplicative: primary: equality additive ( ( '==' | '!=' ) additive )* multiplicative ( ( '+' | '-' ) multiplicative )* primary ( ( '*' | '/' ) primary )* '(' expression ')' | NUMBER | VARIABLE | '-' primary The parser will try to match rules recursively. starting from expression and finding its way to primary. The input file you supply to Racc contains the grammar of your language and is very similar to a Yacc grammar. It’s much harder to build a parser from scratch than it is to create a lexer. Since multiplicative is the last rule called in the parsing process. However.

.' '*' '/' '+' '-' '>' '>=' '<' '<=' '==' '!=' '&&' '||' '.15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 # Precedence table # Based on http://en.Use val[index of expression] to reference expressions on the left.' right '!' right '=' # All parsing will end in this rule.new([]) } { result = val[0] } .} on the right): # . class or method body. Root: /* nothing */ | Expressions . RuleName: OtherRule TOKEN AnotherRule | OtherRule { code to run when this matches } { .. } '.new([]) } { result = Nodes.new(val) } { result = val[0] << val[2] } { result = Nodes.wikipedia. # .org/wiki/Operators_in_C_and_C%2B%2B#Operator_precedence prechigh left left left left left left left left preclow rule # All rules are declared in this format: # # # # # # # In the code section (inside the {.. being the trunk of the AST. # Any list of expressions. . { result = val[0] } { result = Nodes. seperated by line breaks..Assign to "result" the value returned by the rule. Expressions: Expression | Expressions Terminator Expression # To ignore trailing line breaks | Expressions Terminator | Terminator .

[]) } { result = NumberNode.new } { result = FalseNode.method | Expression ". # All hard-coded values Literal: NUMBER | STRING | TRUE | FALSE | NIL ." . val[2]. # All tokens that can terminate an expression Terminator: NEWLINE | ".new } { result = val[1] } .new(val[0]) } { result = StringNode.new(val[0]. val[4]) } { result = CallNode.new(nil. # A method call Call: # method IDENTIFIER # method(arguments) | IDENTIFIER "(" ArgList ")" # receiver.57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 # All types of expressions in our language Expression: Literal | Call | Operator | Constant | Assign | Def | Class | If | '(' Expression ')' .new } { result = NilNode.method(arguments) | Expression ".new(val[0]) } { result = TrueNode.new(val[0]. val[2]) } { result = CallNode. val[0]. val[0]. val[2]. { result = CallNode. []) } { result = CallNode." IDENTIFIER "(" ArgList ")" ." IDENTIFIER # receiver.new(nil.

[val[2]]) } { result = CallNode. val[2]) } { result = SetLocalNode. val[1].new(val[0].new(val[0]. val[1]. Operator: # Binary operators Expression '||' Expression | Expression '&&' Expression | Expression '==' Expression | Expression '!=' Expression | Expression '>' Expression | Expression '>=' Expression | Expression '<' Expression | Expression '<=' Expression | Expression '+' Expression | Expression '-' Expression | Expression '*' Expression | Expression '/' Expression . val[1]. val[1]. val[1]. [val[2]]) } { result = CallNode. val[2]) } { result = GetConstantNode. [].new(val[0]. val[5]) } { result = DefNode.new(val[0]. [val[2]]) } { result = CallNode. [val[2]]) } { result = CallNode. [val[2]]) } { result = CallNode.new(val[0]. val[1]. [val[2]]) } { result = CallNode. [val[2]]) } { result = CallNode. val[3]. [val[2]]) } { result = CallNode.new(val[1].new(val[0]. # Assignment to a variable or constant Assign: IDENTIFIER "=" Expression | CONSTANT "=" Expression .new(val[0].new(val[0]. # Method definition Def: DEF IDENTIFIER Block | DEF IDENTIFIER "(" ParamList ")" Block . [val[2]]) } { result = [] } { result = val } { result = val[0] << val[2] } . val[1].new(val[0].new(val[0]. val[1].new(val[0].new(val[0]. val[1]. [val[2]]) } { result = CallNode.new(val[0]) } { result = CallNode." Expression .new(val[1]. Constant: CONSTANT .99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 ArgList: /* nothing */ | Expression | ArgList ". [val[2]]) } { result = CallNode. val[1]. [val[2]]) } { result = CallNode. ParamList: /* nothing */ | IDENTIFIER { result = [] } { result = val } { result = DefNode. val[1].new(val[0]. val[1].new(val[0]. val[2]) } { result = SetConstantNode.

141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | ParamList ".header require "lexer" require "nodes" ---. end ---.shift end { replace = val[1] } . # if block If: IF Expression Block . You'll also need to remove the # indentation magic section in the lexer. val[2]) } # A block of indented code. You see here that all the hard work was done by the # lexer.new." IDENTIFIER . # Class definition Class: CLASS CONSTANT Block .inspect if show_tokens do_parse # Kickoff the parsing process end def next_token @tokens. # "{" Expressions "}" . { result = val[0] << val[2] } { result = ClassNode. Block: INDENT Expressions DEDENT { result = val[1] } # If you don't like indentation you could replace the previous rule with the # following one to separate blocks w/ curly brackets. def parse(code.inner # This code will be put as-is in the Parser class. show_tokens=false) @tokens = Lexer.tokenize(code) # Tokenize the code using our lexer puts @tokens.new(val[1]. val[2]) } { result = IfNode.new(val[1].

Add a grammar rule to handle the ! unary operators. Parser.new]) ) ]) assert_equal nodes. Making the following test pass (test_unary_operator): Solutions to Do It Yourself II. This will create a Parser class that we can use to parse our code.new([TrueNode. eg. Run ruby -Itest test/parser_test. 1 2 3 4 5 6 7 8 9 10 11 12 code = <<-CODE def method(a. The root node will always be of type Nodes which contains children nodes.: !x.new("method".new([ DefNode.rb grammar. ["a". b): true CODE nodes = Nodes. Nodes. "b"]. .y.rb Parsing code will return a tree of nodes. Here’s an excerpt from this file. DO IT YOURSELF II a. b. Add a rule in the grammar to parse while blocks.new.rb from the code directory to test the parser.We then generate the parser with: racc -o parser.parse(code) test/parser_test.

like C and PHP (before version 4).RUNTIME MODEL The runtime model of a language is how we represent its objects. all of this while using as little memory as possible. ▪ Memory footprint: of course. Designing a language is always a game of give-and-take. there are three factors you will want to consider: ▪ Speed: most of the speed will be due to the efficiency of the runtime. the runtime defines how the language behaves. its types. its methods. With those considerations in mind. the more powerful it is. there are several ways you can model your runtime. When designing your runtime. Two languages could share the same parser but have different runtimes and be very different. Everything is centered around methods (procedures). ▪ Flexibility: the more you allow the user to modify the language. If the parser determines how you talk to the language. There aren’t any objects and all methods often share the same namespace. PROCEDURAL One of the simplest runtime models. As you might have noticed. these three constraints are mutually conflictual. It gets messy pretty quickly! . its structure in memory.

Extensible Object Model that allows the language’s users to modify its behavior at runtime. It might be the easiest model to understand for the users of your language. The following code defines how objects. Look at the appendix at the end of this book for a sample prototype-based language: Appendix: Mio. The AwesomeObject class is the central object of our runtime. Think of Java. everything we will put in the runtime needs to be an . Ruby. PROTOTYPE-BASED Except for Javascript. FUNCTIONAL The functional model. OUR AWESOME RUNTIME Since most of us are familiar with Class-based runtimes. This model has its roots in Lambda Calculus. methods and classes are stored and how they interact together. treats computation as the evaluation of mathematical functions and avoids state and mutable data. a minimalist homoiconic language.CLASS-BASED The class-based model is the most popular at the moment. used by Lisp and other languages. Ian Piumarta describes how to design an Open. Since everything is an object in our language. etc. This model is the easiest one to implement and also the most flexible because everything is a clone of an object. I decided to use that for our Awesome language. no Prototype-based languages have reached widespread popularity yet. Python.

class AwesomeObject attr_accessor :runtime_class. def initialize @runtime_methods = {} # Check if we're bootstrapping (launching the runtime).: numbers and strings). ruby_value=self) @runtime_class = runtime_class @ruby_value = ruby_value end # Call a method on the object.object. Optionaly an object can hold a Ruby value (eg. Even classes are instances of the Class class. def call(method. arguments) end end Remember that in Awesome. arguments=[]) # Like a typical Class-based runtime model. @runtime_class. runtime/class. def initialize(runtime_class. thus an instance of this class. AwesomeClasses hold the methods and can be instantiated via their new method.rb # Each object have a class (named runtime_class to prevent errors with Ruby's class # method). AwesomeObjects have a class and can hold a ruby value. During this process the . everything is an object. we store methods in the class of the # object. This will allow us to store data such as a string or a number in an object to keep track of its Ruby representation.call(self. Classes are objects in Awesome so they # inherit from AwesomeObject.lookup(method). :ruby_value runtime/object. class AwesomeClass < AwesomeObject attr_reader :runtime_methods # Creates a new class. Number is an instance of Class for example.rb 1 2 3 4 5 6 7 8 9 10 # Represents a Awesome class in the Ruby world. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # Represents an Awesome object instance in the Ruby world.

if defined?(Runtime) runtime_class = Runtime["Class"] else runtime_class = nil end super(runtime_class) end # Lookup a method def lookup(method_name) method = @runtime_methods[method_name] unless method raise "Method not found: #{method_name}" end method end # Create a new instance of this class def new AwesomeObject. class AwesomeMethod def initialize(params. 1 2 3 # Represents a method defined in the runtime.rb . We can # initialize Class then set Class. value) end end And here’s the method object which will store methods defined from within our runtime.new(self) end # Create an instance of this Awesome class that holds a Ruby value.new(self.class = Class.11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 # runtime is not fully initialized and core classes do not yet exists. so we defer # using those once the language is bootstrapped. # This solves the chicken-or-the-egg problem with the Class class. def new_with_value(value) AwesomeObject. # Number or true. Like a String. body) runtime/method.

It will keep track of the following: ▪ Local variables.print.. context = Context.. the object on which methods with no receivers are called.locals[param] = arguments[index] end @body.. 2) # execute the block of code passed to proc (everything between do .call(1.eval(context) end end Notice that we use the call method for evaluating a method. there is one missing object we need to define and that is the context of evaluation. .. index| context. end p. We’ll see how this is used to defined runtime methods from Ruby in the bootstrapping section of the runtime. arg2| # . Before we bootstrap our runtime.: print is like self.new(receiver) # Assign arguments to local variables @params. eg. The Context object encapsulates the environment of evaluation of a specific block of code. end) Procs can be executed via their call method.4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 @params = params @body = body end def call(receiver. arguments) # Create a context of evaluation in which the method will execute.each_with_index do |param. That will allow us to define runtime methods from Ruby using Procs. ▪ The current value of self. Here’s why: 1 2 3 4 p = proc do |arg1.

the class on which methods are defined with the def keyword. current_class=current_self. runtime/bootstrap.. @@constants = {} def initialize(current_self.. # What's happening in the runtime: . classes) will be stored. no objects exist in the runtime. class Context attr_reader :locals. This is where we assemble all the classes and objects together # to form the runtime. value) @@constants[name] = value end end Finally.] def [](name) @@constants[name] end def []=(name. we bootstrap the runtime. If you want to # implement namespacing of constants. we need to populate that runtime with a few objects: Class. you could store it in the instance of this # class. Object. :current_class runtime/context. Runtime[. :current_self..runtime_class) @locals = {} @current_self = current_self @current_class = current_class end # Shortcuts to access constants. Before we can execute our first expression. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # The evaluation context. true.] instead of Runtime.rb 1 2 3 # Bootstrap the runtime.constants[. false. This is also where our constants (ie..rb # We store constants as class variable (class variables start with @@ and instance # variables start with @ in Ruby) since they are globally accessible. nil and a few core methods.▪ The current class. At first.

So they need # to have a class too.runtime_methods["print"] = proc do |receiver.ruby_value Runtime["nil"] end Runtime["Class"].new object_class.: print("hi there!") Runtime["Object"]. Runtime["TrueClass"] = AwesomeClass. even true.: Object. used to instantiate a class: # eg.new.runtime_methods["new"] = proc do |receiver. Runtime = Context.new end # Print an object to the console.new Object.new Runtime["String"] = AwesomeClass. false and nil.class = Class awesome_class.new # Everything is an object in our language. String. # Add the `new` method to classes. Number.new object_class = AwesomeClass.first.new. # eg. .runtime_class = awesome_class # # # Class Class. receiver. arguments| puts arguments.new_with_value(nil) # Add a few core methods to the runtime.new Runtime["NilClass"] = AwesomeClass.runtime_class = awesome_class # # Create the Runtime object (the root context) on which all code will start its # evaluation.new) Runtime["Class"] = awesome_class Runtime["Object"] = object_class Runtime["Number"] = AwesomeClass.4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 awesome_class = AwesomeClass. arguments| Now that we got all the pieces together we can call methods and create objects inside our runtime.new_with_value(false) Runtime["nil"] = Runtime["NilClass"].new_with_value(true) Runtime["false"] = Runtime["FalseClass"]. etc.new(object_class.new.new Runtime["true"] = Runtime["TrueClass"].new Runtime["FalseClass"] = AwesomeClass.class = Class Object = Class.

Solutions to Do It Yourself III.call("new") test/runtime_test. DO IT YOURSELF III a. . object. b. Implement inheritance by adding a superclass to each Awesome class. Add the method to handle x + 2.1 2 3 4 # Mimic Object.new in the language object = Runtime["Object"].rb assert_equal Runtime["Object"].runtime_class # assert object is an Object Can you feel the language coming alive? We’ll learn how to map that runtime to the nodes we created from our parser in the next section.

Finally. modifying the runtime. But for the purpose of this book. Figure 3 The lexer creates the token. ClassNode for a class definition? Here we’re reopening those classes and adding a new method to each one: eval. It reads the AST produced by the parser and executes each action associated with the nodes.INTERPRETER The interpreter is the module that evaluates the code. This method will be responsible for interpreting that particular node. . the parser takes those tokens and converts it into nodes. the interpreter evaluates the nodes. A common approach to execute an AST is to implement a Visitor class that visits all the nodes one by one. This make things even more modular and eases the optimization efforts on the AST. running the appropriate code. Remember the nodes we created in the parser: StringNode for a string. we’ll keep things simple and let each node handle its evaluation. Figure 3 recapitulates the path of a string in our language.

which we'll see in the next section. # The "context" variable is the environment in which the node is evaluated (local # variables. etc. Runtime["Number"]. return_value || Runtime["nil"] end end class NumberNode def eval(context) # Here we access the Runtime.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 require "parser" require "runtime" class Interpreter def initialize @parser = Parser.each do |node| return_value = node. All nodes know how to eval # itself and returns the result of its evaluation by implementing the "eval" method.parse(code).).new end def eval(code) @parser.eval(Runtime) end end class Nodes interpreter. def eval(context) return_value = nil nodes.new_with_value(value) end end . Or nil if none.eval(context) end # The last value evaluated in a method is the return value.rb # This method is the "interpreter" part of our language. current class.new_with_value(value) end end class StringNode def eval(context) Runtime["String"]. to create a new # instance of the Number class.

empty? context. if receiver. calling "print" is like # "self. then # it's a local variable access.current_self end eval_arguments = arguments.eval(context) } value.map { |arg| arg.locals[method] # Method call else if receiver value = receiver. This trick allows us to skip the () when calling a # method.eval(context) else # In case there's no receiver we default to self.call(method.print". value = context.43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 class TrueNode def eval(context) Runtime["true"] end end class FalseNode def eval(context) Runtime["false"] end end class NilNode def eval(context) Runtime["nil"] end end class CallNode def eval(context) # If there's no receiver and the method name is the name of a local variable.locals[method] && arguments. eval_arguments) end end end .nil? && context.

method = AwesomeMethod.eval(context) end end class SetLocalNode def eval(context) context.new(params. we add them to the newly created class.new # Register the class as a constant in the runtime.locals[name] = value. body) context.new(awesome_class. Providing a custom context allows # to control where methods are added when defined with the def keyword.runtime_methods[name] = method end end class ClassNode def eval(context) # Try to locate the class.current_class. In this # case. Allows reopening classes to add methods. awesome_class = context[name] unless awesome_class # Class doesn't exist yet awesome_class = AwesomeClass.eval(context) end end class DefNode def eval(context) # Defining a method is adding a method to the current class.85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 class GetConstantNode def eval(context) context[name] end end class SetConstantNode def eval(context) context[name] = value. context[name] = awesome_class end # Evaluate the body of the class in its context. awesome_class) . class_context = Context.

all children nodes are evaluated recursively.new if awesome_object: print(awesome_object.new. for Abstract Syntax Tree. And evaluating the top level node of that tree will have the cascading effect of evaluating each of its children. This is why we call the output of the parser an AST. It is a tree of nodes.does_it_work) CODE assert_prints("yeah!\n") { Interpreter.127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 body.eval(context). if condition.ruby_value body. Once we call eval on the root node.rb .eval(context) end end end The interpreter part (the eval method) is the connector between the parser and the runtime of our language.eval(code) } test/interpreter_test.eval(class_context) awesome_class end end class IfNode def eval(context) # We turn the condition node into a Ruby value to use Ruby's "if" control # structure. Let’s run our first full program! 1 2 3 4 5 6 7 8 9 10 11 code = <<-CODE class Awesome: def does_it_work: "yeah!" awesome_object = Awesome.

To complete our language we can create a script to run a file or an REPL (for readeval-print-loop), or interactive interpreter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

#!/usr/bin/env ruby # The Awesome language! # # usage: # # # # on Windows run with: ruby awesome [options] $:.unshift "." # Fix for Ruby 1.9 require "interpreter" require "readline" interpreter = Interpreter.new # If a file is given we eval it. if file = ARGV.first interpreter.eval File.read(file) # Start the REPL, read-eval-print-loop, or interactive interpreter else puts "Awesome REPL, CTRL+C to quit" loop do line = Readline::readline(">> ") Readline::HISTORY.push(line) value = interpreter.eval(line) puts "=> #{value.ruby_value.inspect}" end end ./awesome example.awm # to eval a file ./awesome # to start the REPL

awesome

Run the interactive interpreter by running ./awesome and type a line of Awesome code, eg.: print("It works!"). Here’s a sample Awesome session:
1 2

Awesome REPL, CTRL+C to quit >> m = "This is Awesome!"

3 4 5 6

=> "This is Awesome!" >> print(m) This is Awesome! => nil

Also, try running a file: ./awesome example.awm.

DO IT YOURSELF IV
a. Implement the WhileNode eval method. Solutions to Do It Yourself IV.

COMPILATION
Dynamic languages like Ruby and Python have many advantages, like development speed and flexibility. But sometimes, compiling a language to a more efficient format might be more appropriate. When you write C code, you compile it to machine code. Java is compiled to a specific byte-code that you can run on the Java Virtual Machine (JVM). These days, most dynamic languages also compile source code to machine code or byte-code on the fly, this is called Just In Time (JIT) compilation. It yields faster execution because executing code by walking an AST of nodes (like in Awesome) is less efficient then running machine code or byte-code. The reason why bytecode execution is faster is because it’s closer to machine code then an AST is. Bringing your execution model closer to the machine will always produce faster results. If you want to compile your language you have multiple choices, you can: ▪ compile to your own byte-code format and create a Virtual Machine to run it, ▪ compile it to JVM byte-code using tools like ASM, ▪ write your own machine code compiler, ▪ use an existing compiler framework like LLVM. Compiling to a custom byte-code format is the most popular option for dynamic languages. Python and new Ruby implementations are going this way. But, this only makes sense if you’re coding in a low level language like C. Because Virtual Machines are very simple (it’s basically a loop and switch-case) you need total control over the structure of your program in memory to make it as fast a possible. If you want more details on Virtual Machines, jump to the next chapter.

position_at_end(main. In fact. char const *argv[]) { puts("hello").ret(LLVM::Int(0)) This is the equivalent of the following C code. You can find instructions on ruby-llvm project page.functions. builder.create("awesome") # Creates the main function that will be called main = mod.global_string_pointer("hello")) # Return builder.USING LLVM FROM RUBY We’ll be using LLVM Ruby bindings to compile a subset of Awesome to machine code on the fly. To compile a full language. return 0. most parts of the runtime have to be rewritten to be accessible from inside LLVM.append) # Find the function we're calling in the module func = mod. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Creates a new module to hold the code mod = LLVM::Module.pointer(PCHAR)]. you’ll need to install LLVM and the Ruby bindings. INT) # Create a block of code to build machine code within builder = LLVM::Builder. } .basic_blocks. Installing LLVM can take quite some time. [INT. Here’s how to use LLVM from Ruby. First. LLVM::Type.add("main".create builder.call(func. 1 2 3 4 int main (int argc.functions. it will generate similar machine code.named("puts") # Call the function builder. make sure you got yourself a tasty beverage before launching compilation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 require "rubygems" require "parser" require "nodes" require 'llvm/core' require 'llvm/execution_engine' require 'llvm/transforms/scalar' require 'llvm/transforms/ipo' LLVM. function=nil) # Create the LLVM module in which to store the code @module = mod || LLVM::Module. instead of executing code.create("awesome") # To track local names during compilation @locals = {} # Function in which the code will be put @function = function || .pointer(LLVM::Int8) # equivalent to *char in C INT = LLVM::Int # equivalent to int in C attr_reader :locals def initialize(mod=nil. Then. class Compiler # Initialize LLVM types PCHAR = LLVM::Type. we’ll extend the nodes created by the parser to make them use the compiler. Because of that. we can create machine code from Awesome code.rb # Compiler is used in a similar way as the runtime.The difference is that with LLVM we can dynamically generate the machine code. it # will generate LLVM byte-code for later execution. we’ll create a Compiler class that will encapsulate the logic of calling LLVM to generate the byte-code.init_x86 compiler. But. COMPILING AWESOME TO MACHINE CODE To compile Awesome to machine code.

create_jit_compiler(@module) end # Initial header to initialize the module. LLVM::Type. [INT.append) @engine = LLVM::ExecutionEngine.basic_blocks.ret(LLVM::Int(0)) end # Create a new string.create @builder.store(value.named(func) @builder.functions. value) # Allocate the memory and returns a pointer to it ptr = @builder.pointer(PCHAR)].named("main") || @module.type) # Store the value insite the pointer @builder.functions. INT) # Create an LLVM byte-code builder @builder = LLVM::Builder.global_string_pointer(value) end # Create a new number.call(f. def preamble define_external_functions end def finish @builder.position_at_end(@function.alloca(value. args=[]) f = @module. def new_string(value) @builder. def new_number(value) LLVM::Int(value) end # Call a function.functions. ptr) .add("main". *args) end # Assign a local variable def assign(name.31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 # By default we create a main function as it's the standard entry point @module. def call(func.

dump end private def define_external_functions fun. [LLVM::Type. INT) generator = Compiler.linkage = :external # Promote Memory to Register # Dead Global Elimination fun = @module. def function(name) func = @module.functions. func) yield generator generator.functions. []. def load(name) @builder.73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 # Keep track of the pointer so the compiler can find it back name later.load(@locals[name]) end # Defines a function.finish end # Optimize the generated LLVM byte-code. @locals[name] = ptr end # Load the value of a local variable.add("printf".pointer(PCHAR)]. 0) end def dump @module.verify! pass_manager = LLVM::PassManager.run_function(@function.simplifycfg! # Simplify the CFG pass_manager.mem2reg! pass_manager. def optimize @module. INT) fun.new(@module.linkage = :external fun = @module.add("puts". INT.functions.new(@engine) pass_manager.gdce! end # JIT compile and run the LLVM byte-code. def run @engine. { :varargs => .add(name. 0. [PCHAR].

115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 fun = @module.add("read". compiled_arguments) . [INT].empty? && compiler.new_number(value) end end class StringNode def compile(compiler) compiler.call(method.linkage = :external end end # Reopen class supported by the compiler to implement how each node is compiled # (compile method).functions.load(method) # Method call else compiled_arguments = arguments. PCHAR.compile(compiler) }.map { |node| node.new_string(value) end end class CallNode def compile(compiler) raise "Receiver not supported for compilation" if receiver # Local variable access if receiver. [INT.map { |arg| arg.nil? && arguments.locals[method] compiler.compile(compiler) } compiler.functions.linkage = :external fun = @module.last end end class NumberNode def compile(compiler) compiler. INT]. INT) fun. INT) fun.add("exit". class Nodes def compile(compiler) nodes.

finish # Uncomment to output LLVM byte-code # compiler.compile(compiler) compiler.157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 end end end class SetLocalNode def compile(compiler) compiler.compile(compiler)) end end class DefNode def compile(compiler) raise "Parameters not supported for compilation" if !params.assign(name.dump test/compiler_test. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 code = <<-CODE def say_it: x = "This is compiled!" puts(x) say_it CODE # Parse the code node = Parser.new. we can now compile a simple program.preamble node.empty? compiler.new compiler.rb . value.function(name) do |function| body.compile(function) end end end With the compiler integrated with the nodes.parse(code) # Compile it compiler = Compiler.

the runtime and structures used to store the classes and objects have to be loaded from inside the LLVM module.optimize # JIT compile & execute compiler. object-oriented programming is not supported. either writing it in C and using the C-to-LLVM compiler shipped with LLVM or by writing your runtime in a subset of your language that can be compiled to LLVM byte-code.19 20 21 22 23 24 # Optimize the LLVM byte-code compiler. For example. .run Remember that the compiler only supports a subset of our Awesome language. You can do this by compiling your runtime to LLVM byte-code. To implemented this.

you need to introduce a VM (short for Virtual Machine) in your design. Figure 4 shows how you can introduce a byte-code compiler into your language design. When running your language on a VM you have to compile your AST nodes to byte-code. Despite adding an extra step.VIRTUAL MACHINE If you’re serious about speed and are ready to implement your language in C/C++. Figure 4 Look at tinypy and tinyrb in the Interesting Languages section for complete examples of small VM-based languages. execution of the code is faster because byte-code format is closer to machine code than an AST. .

. getlocal to push a local variable value to the stack.org. Python. An ideal byte-code is fast to compile from the source language and fast to execute while being very compact in memory.9 and the JVM are all stack-based VMs. dup to duplicate it. The execution of those is based around a stack in which values are stored to be passed around to instructions. Each instruction starts with an opcode. Common ones include. etc. followed by operands. which are the arguments of the instructions.BYTE-CODE The purpose of a VM is to bring the code as close to machine code as possible but without being too far from the original language making it hard (slow) to compile.nl. TYPES OF VM Stack-based virtual machines are most common and simpler to implement.pluskid. The byte-code of a program consists of instructions. Instructions will often pop values from the stack. putstring to push a string to the stack.9 (YARV) instruction set at lifegoo. modify it and push the result back to the stack. Figure 5 Most VMs share a similar set of instructions. You can see Ruby 1. specifying what this instruction does. Ruby 1. pop to remove an item from the stack. And the JVM instruction set at xs4all.

Each line of code in a Virtual Machine is optimized to require as little machine instructions as possible to execute.Register-based VMs are becoming increasingly popular. They are closer to how a real machine work but produce bigger instructions with lots of operands but often fewer total instructions then a stack-based VM.pop when PRINT puts stack. PROTOTYPING A VM IN RUBY For the sake of understanding the inner working of a VM.pop when RETURN return end . ip = 0 while true case bytecode[ip] when PUSH stack. It should be noted that. never in any case.unshift bytecode[ip+=1] when ADD stack. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # Bytecode PUSH ADD PRINT = 0 = 1 = 2 vm/vm.rb RETURN = 3 class VM def run(bytecode) # Stack to pass value between instructions.pop + stack. stack = [] # Instruction Pointer.unshift stack. we’ll look at a prototype written in Ruby. high level language do not provide that control but C and C++ do. It defeats the purpose of bringing the code closer to the machine. a real VM should be written in a high level languages like Ruby. index of current instruction being executed in bytecode. Lua is register-based.

the equivalent of: print 1 + 2. 1] # stack = [3] # stack = [] As you can see. # stack = [1] # stack = [2.new. PUSH. The byte-code that we execute is the equivalent of print 1 + 2. . To achieve this. 2. RETURN ] 1. Operand PUSH.run [ # Here is the bytecode of our program.25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 # Continue to next intruction ip += 1 end end end VM. a Virtual Machine is simply a loop and a switch-case. ADD. PRINT. # Status of the stack after execution of the instruction. Look in the code/vm directory for the full language using this VM. # Opcode. we push 1 and 2 to the stack and execute the ADD instruction which push the result of summing everything on the stack.

There’s so much to discover in that field. Look in the Interesting Languages section of the References chapter for the Io language and in the Appendix: Mio. Here are a few things I’ve been playing with in the last years. While it sounds obscure and complex. some parts of the runtime are still not written in Ruby. This is very tedious since you need to implement an interpreter first to run the language. but it’s only the tip of the iceberg. This gives you godlike powers.GOING FURTHER Building your first language is fun. which causes a circular dependency between the two. or metacircular interpreter aims to implement the interpreter in the target language. it means that the primary representation of your program (the AST) is accessible as a data structure inside the runtime of the language. . SELF-HOSTING A self-hosting. You can inspect and modify the program as it’s running. At the moment. CoffeeScript is a little language that compiles into JavaScript. The CoffeeScript compiler is itself written in CoffeeScript. a minimalist homoiconic language of this book for a sample homoiconic language implementing if and boolean logic in itself. HOMOICONICITY That’s the word you want to ostentatiously throw in a geek conversation. Rubinius is a Ruby implementation that aims to be self-hosted in the future.

Now get out there and make your own awesome language! . WHAT’S MISSING? If you’re serious about building a real language (real as in production-ready). or C/C++. which gives you total control over what you’re doing. then you should consider implementing it in a faster and more robust environment. Ruby is nice for quick prototyping but horrible for language implementation.PyPy is trying to achieve this in a much simpler way: by using a restrictive subset of the Python language to implement Python itself. The two obvious choices are Java on the JVM. which gives you a garbage collector and a nice collection of portable libraries.

runtime engineers. Systems. by Kein-Hong Man. Smalltalk-80: The Language and its Implementation by Adele Goldberg and al. by Terence Parr. compiler writers. research papers and various programming language topics. discuss about new trends.0. FORUMS AND BLOGS Lambda the Ultimate. The International Conference on Object Oriented Programming. A No-Frills Introduction to Lua 5. May 1983. The Implementation of Lua 5.1 VM Instructions. and VM architects for sharing experiences as creators of programming languages for the JVM. tool builders. EVENTS OOPSLA. Languages and Applications.RESOURCES BOOKS & PAPERS Language Implementation Patterns. INTERESTING LANGUAGES Io: .. by Roberto Ierusalimschy et al. The Programming Languages Weblog. from The Programmatic Programmers. The JVM Language Summit is an open technical collaboration among language designers. is a gathering of many programming language authors. published by Addison-Wesley.

Need inspiration for your awesome language? Check out Wikipedia’s programming lists: (List of programming languages)[http://en. Factor is a concatenative programming language where references to dynamicallytyped values are passed between words (functions) on a stack. NewtonScript (differential inheritance). lightweight. Lua is dynamically typed. embeddable scripting language. The ideas in Io are mostly inspired by Smalltalk (all values are objects. If you want to introduce a VM in your language design. those are good starting points. Lua: Lua is a powerful. Self (prototype-based). (Esoteric programming languages)[http://en. embeddable). only a lexer that converts the code to Message objects. scripting.Io is a small.wikipedia. This language is Homoiconic. LISP (code is a runtime inspectable/modifiable tree) and Lua (small. making it ideal for configuration.org/wiki/ List_of_programming_languages]. A few things to note about Io.org/wiki/Esoteric_programming_languagee] . tinypy and tinyrb are both subsets of more complete languages (Python and Ruby respectively) running on Virtual Machines inspired by Lua. all messages are dynamic). runs by interpreting bytecode for a register-based virtual machine. Act1 (actors and futures for concurrency). and has automatic memory management with incremental garbage collection. Their code is only a few thousand lines long. prototype-based programming language. fast.wikipedia. It doesn’t have any parser. Lua combines simple procedural syntax with powerful data description constructs based on associative arrays and extensible semantics. and rapid prototyping.

). If you’re going to design your own language.In addition to the languages we know and use every day (C. You’ll also find some languages that are voluntarily impossible to use like Malbolge and BrainFuck and some amusing languages like LOLCODE. you’ll find many lesser-known languages. they can widen your horizons. a language that is programmed using images that look like abstract art. Perl. C++. whose sole purpose is to be funny. etc. Java. and alter your conception of what constitutes a computer language. that can only be a good thing. While some of these languages aren’t practical. many of which are very interesting. . You’ll even find some esoteric languages such as Piet.

1 KEYWORDS = ["def".. "while"] b. control structures. "false".size bracket_lexer. "else".. constant] i += constant. identifier] else tokens << [:IDENTIFIER. Modify the lexer to delimit blocks with { .size chunk = code[i. "class". identifier] end i += identifier. "class". "false".upcase. "else".rb . "if"...SOLUTIONS TO DO IT YOURSELF SOLUTIONS TO DO IT YOURSELF I a.size elsif constant = chunk[/\A([A-Z]\w*)/.to_sym. "true". "if". 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 class BracketLexer KEYWORDS = ["def". "nil".. 1] tokens << [:CONSTANT. "nil"] def tokenize(code) code.chomp! i = 0 tokens = [] while i < code. Modify the lexer to parse: while condition: .include?(identifier) tokens << [identifier. "true". } instead of indentation. Remove all indentation logic and add an elsif to parse line breaks. 1] if KEYWORDS. Simply add while to the KEYWORD array on line 2.-1] if identifier = chunk[/\A([a-z]\w*)/.

elsif chunk. "\n"] i += 1 ###### elsif chunk. 1] tokens << [:STRING.match(/\A\n+/) tokens << [:NEWLINE.to_i] i += number.size + 2 ###### # All indentation magic code was removed and only this elsif was added.1] tokens << [value. value] i += 1 end end tokens end end SOLUTIONS TO DO IT YOURSELF II a. .23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 elsif number = chunk[/\A([0-9]+)/. Add a rule in the grammar to parse while blocks. This rule is very similar to If. 1] tokens << [:NUMBER.match(/\A /) i += 1 else value = chunk[0. number.*?)"/.size elsif string = chunk[/\A"(. string] i += string.

new(:condition. end b. Implement inheritance by adding a superclass to each Awesome class. []) } SOLUTIONS TO DO IT YOURSELF III a. :body).new(val[1]. val[0]... Expression: # .. # ... Similar to the binary operator.!. Calling !x is like calling x. 1 2 3 4 class AwesomeClass < AwesomeObject # ..rb file. While: WHILE Expression Block ..1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # At the top add: token WHILE # . def initialize(superclass=nil) . val[2]) } And in the nodes.. | While ... 1 2 3 4 Operator: # .new(val[1]. { result = CallNode. | '!' Expression . you will need to create the class: 1 class WhileNode < Struct. Add a grammar rule to handle the ! unary operators. { result = WhileNode.

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 @runtime_methods = {} @runtime_superclass = superclass # .. Runtime["Number"] = AwesomeClass. 1 2 3 class WhileNode def eval(context) while @condition. Implement the WhileNode.ruby_value Runtime["Number"].runtime_methods["+"] = proc do |receiver. 1 2 3 4 Runtime["Number"]. while is very similar to if..first..eval(context)..ruby_value . Add the method to handle x + 2. arguments| result = receiver.new(Runtime["Object"]) b.ruby_value + arguments.lookup(method_name) else raise "Method not found: #{method_name}" end end method end end # . end def lookup(method_name) method = @runtime_methods[method_name] unless method if @runtime_superclass return @runtime_superclass.new_with_value(result) end SOLUTIONS TO DO IT YOURSELF IV a.

4 5 6 7 @body.eval(context) end end end .

Messages are a data type in Mio and also how programs are represented and parsed. We’ll again implement the core of our language in Ruby. 1 object method1 method2(argument) Is the semantic equivalent of the following Ruby code: 1 object. The central component of our language will be messages. A MINIMALIST HOMOICONIC LANGUAGE HOMOICOWHAT? Homoiconicity is a hard concept to grasp. is simply a series of messages. The best way to understand it fully is to implement it. We’ll build a tiny language called Mio (for mini-Io). everything is an object in Mio. a program being method calls and literals. MESSAGES ALL THE WAY DOWN Like in Awesome. And messages are separated by spaces not dots.method2(argument) . but this one will take less than 200 lines of code.method1. thus its homoiconicity.APPENDIX: MIO. which makes our language looks a lot like plain english. It is derived from the Io language. Additionally. It should also give you glimpse at an unconventional language. That is the purpose of this section.

THE RUNTIME Unlike Awesome but like Javascript. :value def initialize(proto=nil. Thus. Mio is prototype-based. # By default objects eval to themselves. Objects don’t have classes. Mio objects are like dictionaries or hashes (again. numbers and other objects.each { |proto| return message if message = proto[name] } raise Mio::Error.key?(name) message = nil @protos. We create new objects by cloning existing ones.rb . They contain slots in which we can store methods and values such as strings. def call(*) self end mio/object.compact @value = value @slots = {} end # Lookup a slot in the current object and protos. message) @slots[name] = message end # The call method is used to eval an object. value=nil) @protos = [proto]. it doesn’t have any classes or instances. def [](name) return @slots[name] if @slots. "Missing slot: #{name. their parent objects.inspect}" end # Set a slot def []=(name. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 module Mio class Object attr_accessor :slots. :protos. but prototypes (protos). much like Javascript).

new(self. val) end end end Mio programs are a chain of messages.29 30 31 32 33 34 35 def clone(val=nil) val ||= @value && @value. Message.new('"hello"'. Messages . Message.new("to_s". When executed. This results in the same behaviour as in languages such as Awesome. they simply reset the receiver of the message. 1 2 3 self print # <= Line break resets the receiver to self self print # So now it looks as if we're starting a new expression # with the same receiver as before.new("1". The unification of all types of expression into one data type makes our language extremely easy to parse (see parse_all method in the code bellow). where each line is an expression. Each message being a token. Message.dup rescue TypeError Object.new("print".new("print")))))) Notice line breaks (and dots) are also messages. Message. The following piece of code: 1 2 "hello" print 1 to_s print is parsed as the following chain of messages: 1 2 3 4 5 6 Message. Message.new("\n".

we can eval them right # away and cache the value.".include?(@name) super(Lobby["Message"]) end # Call (eval) the message on the +receiver+. thus our parsing code will be similar to the one of our lexer in Awesome. # # # 1 print. # eg.new("print")) # is parsed to: mio/message. :cached_value def initialize(name. *args) if @terminator # reset receiver to object at begining of the chain. Message.new("1".are much like tokens.clone($1) end @terminator = [". @cached_value = case @name when /^\d+/ Lobby["Number"].*)"$/ Lobby["String"]." resets back to the receiver here \________________________________________________/ value = context . Message.: # # # ^ hello there. context=receiver. "\n"]. :args. yo ^__ ". :line. We don’t even need a grammar with parsing rules! 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 module Mio # Message is a chain of tokens produced when parsing. :name. def call(receiver. class Message < Object attr_accessor :next.rb # You can then +call+ the top level Message to eval it. line) @name = name @args = [] @line = line # Literals are static values.to_i) when /^"(.clone(@name.

context.to_s(level + 1) if @next s + ">" end # Parse a string into a chain of messages def self. *@args) end # Pass to next message if some if @next @next.strip i = 0 message = nil . 1). @next=\n" + @next.parse(code) parse_all(code.parse_all(code.current_message ||= self raise end def to_s(level=0) s = " " * level s << "<Message @name=#{@name}" s << ".call(receiver.39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 elsif @cached_value # We already got the value value = @cached_value else # Lookup the slot on the receiver slot = receiver[name] # Eval the object in the slot value = slot. context) else value end rescue Mio::Error => e # Keep track of the message that caused the error to output # line number and such.call(value. @args=" + @args. e.last end private def self.empty? s << ". line) code = code.inspect unless @args.

concat parse_all(code[i+1.81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 messages = [] # Marrrvelous parsing code! while i < code.size .count("\n") messages.size case code[i.new($1.)+/.size i += 1 level += 1 if code[i] == ?\( level -= 1 if code[i] == ?\) end line += $1.1 else raise "Unknown char #{code[i].empty? messages << m else message.size level = 1 while level > 0 && i < code.args = parse_all(code_chunk. line) line += code_chunk. /\A(\n)+/.i-1] message.count("\n") message = m i += $1. # ignore whitespace /\A(#.inspect} at line #{line}" end # number # dot # line break # name m = Message.next = m end line += $1.-1] when /\A("[^"]*")/.size .. line) break when /\A(\s+)/. # string /\A(\d+)/. /\A(\w+)/ if messages. /\A(\. line) .*$)/ # ignore comments line += $1..count("\n") code_chunk = code[start.-1]..count("\n") i += $1.1 when /\A(\(\s*)/ # arguments start = i + $1.(\s*)/ line += $1.count("\n") when /\A.

. For example. This is called lazy evaluation.123 124 125 126 127 128 i += 1 end messages end end end The only missing part of our language at this point is a method. you have to eval them explicitly. It will allow us to implement control structure right from inside our language.. calling_context. calling method(x) won’t evaluate x when calling the method.. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 module Mio class Method < Object def initialize(context. Our little language only has lazy argument # evaluation. If you pass args to a method. When an argument needs to be evaluated. They won’t be implicitly evaluated. we do so explicitly by calling the method eval_arg(arg_index). ..clone method_context["self"] = receiver method_context["arguments"] = Lobby["List"]. lots of contexts here.rb method_context: where the method body (message) is executing method_context = @definition_context. This will allow us to store a block of code and execute it later in its original context and on the receiver. there will be one special thing about our method arguments.clone(args) # Note: no argument is evaluated here. *args) # Woo. But. message) @definition_context = context @message = message super(Lobby["Method"]) end def call(receiver. lets clear that up: # # # @definition_context: where the method was defined calling_context: where the method was called mio/method. it will pass it as a message. # using the following method.

value| receiver[name. which served as the environment of execution.call(context). context| receiver. In Mio.call(context).clone Lobby["Lobby"] Lobby["Object"] Lobby["nil"] Lobby["true"] Lobby["false"] = Lobby = object = object. (Actually.clone(nil) = object.rb # Introducing the Lobby! Where all the fantastic objects live and also the root context # of evaluation.clone(false) . at| (args[at.clone } object["set_slot"] = proc do |receiver. we’ll simply use an object as the context of evaluation.call(context) end object["print"] = proc do |receiver. Our Awesome language had a Context object. name. The root object is called the Lobby.call(calling_context) end @message. context. context. context| puts receiver.new object["clone"] = proc { |receiver. Because … it’s where all the objects meet. in the lobby.value] || Lobby["nil"]). Local variables will be stored in the slots of that object.value] = value.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 module Mio # Bootstrap object = Object. Lobby = object. the term is taken from Io.20 21 22 23 24 25 26 method_context["eval_arg"] = proc do |receiver.call(method_context) end end end Now that we have all the objects in place we’re ready to bootstrap our runtime.value Lobby["nil"] end mio/bootstrap.clone(true) = object.

. "Bob") # Call the slot to retrieve it's value dude name print # => Bob # Define a method dude set_slot("say_name". First.clone(0) = object."> # Eval the first argument eval_arg(0) print # => hello.clone("") = object. Object clone) # Set a slot on it dude set_slot("name".. # Access the receiver via `self` self name print # => Bob )) test/mio/oop. here’s what we’re already able to do: cloning objects. context.clone # The method we'll use to define methods.clone([]) = object..clone Lobby["Message"] = object. by cloning the master Object set_slot("dude".23 24 25 26 27 28 29 30 31 Lobby["Number"] Lobby["String"] Lobby["List"] Lobby["Method"] = object. message) } end IMPLEMENTING MIO IN MIO This is all we need to start implementing our language in itself.. method( # Print unevaluated arguments (messages) arguments print # => <Message @name="hello. Lobby["method"] = proc { |receiver. setting and getting slot values. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # Create a new object.new(context. message| Method.mio .

nil) nil set_slot("or". except nil and false which are false nil set_slot("and"...23 24 25 # Call that method dude say_name("hello. method( eval_arg(0) )) false set_slot("and".mio . We’re able to implement the and and or operators from inside our language.") Here’s where the lazy argument evaluation comes in. method( self )) # . method( eval_arg(0) )) "yo" or("hi") print # => yo nil or("hi") print # => hi "yo" and("hi") print # => hi 1 and(2 or(3)) print # => 2 mio/boolean. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # An object is always truish Object set_slot("and". false) false set_slot("or".mio 1 2 3 4 5 6 7 8 9 10 11 test/mio/boolean. method( eval_arg(0) )) Object set_slot("or"...

mio And now… holy magical pony! 1 2 3 4 5 6 7 8 9 10 11 12 13 if(true. "condition is true" print. we can implement if. 1 2 3 4 5 6 7 8 9 10 11 12 # Implement if using boolean logic set_slot("if". method( # eval condition set_slot("condition".Using those two operators. # else "condition is false" print ) # => condition is false test/mio/if.mio if defined from inside our language! . "nope" print. eval_arg(0)) condition and( # if true eval_arg(1) ) condition or( # if false (else) eval_arg(2) ) )) mio/if. # else "nope" print ) # => condition is true if(false.

This introduces syntactic sugar into our language without impacting its homoiconicity and awesomness. 1). This simply means reordering operators. Same goes for ternary operators such as assignment.rb. and we need to use set_slot for assignment. . During the parsing phase. An addition would be written as: 1 +(2) for example. x = 1 would be rewritten as =(x. we can again borrow from Io and implement operator shuffling. we would turn 1 + 2 into 1 +(2). nothing to impress your friends and foes. but the syntax is not as nice as Awesome. To solve this problem. You can find all the source code for Mio under the code/mio directory and run its unit tests with the command: ruby -Itest test/mio_test.BUT IT’S UGLY All right… it’s working.

please send me an email at macournoyer@gmail.FAREWELL! That is all for now. I hope you enjoyed my book! If you find an error or have a comment or suggestion.Marc . .com. If you end up creating a programming language let me know. I’d love to see it! Thanks for reading.