Sunday, June 1, 2008

Simple way of AST transformation using Scheme

Abstract
This article demonstrates simple way of transforming abstract syntax trees represented in S-expressions using Scheme programming language [1, 2]. Arithmetical expression evaluation domain is used as an example.

Problem definition
Imagine that we have following problem:
We need to build simple arithmetical expression processing engine. Expressions in our domain consists of integer numbers and operations of addition, subtraction and multiplication. We required to represent expression in some unified (canonical) way. We need to allow direct evaluation to show value. We are are required to generate C code for evaluating and printing the expression. We are required to print formula in infix way. Additionally we required to generate code for some stack machine that allows operations of pushing integer numbers to the stack, adding subtracting and multiplying and printing pushed values. Moreover, we don't have specification of machine codes or assembler for this machine right now, but we are required to start solving the problem...

Representations of arithmetical expressions
There are different possible ways of representation of expressions. Let's take into attention classical infix notation of some expression is 2+3*4-5. To interpret this expression we need information about operation priority and order of evaluation. This expression could be represented by following equivalent abstract syntax tree:
    expression
|
sub
/ \
add 5
/ \
2 mul
/ \
3 4
Where add/sub/mul is just selected shorthands for addition/substration/multiplication. To understand this tree we don't need information about order/priorities. This tree could be represented in alternative sequential form using S-expression:
(expression (sub (add 2 (mul 3 4)) 5))
For complex expressions indentation is commonly used to keep it human-readable:
(expression
(sub
(add
2
(mul 3 4))
5))
Let this notation will be our DSL to represent expressions.

Calculating value
Now let's try to solve problem of calculating expression value. We will use Scheme programming language [1,2]. Scheme was selected by following reasons:
  • Scheme allows its syntax extension; directly supports homogeneous metaprogramming
  • Scheme is very simple to learn and use
  • Scheme is minimal dialect of LISP that is standardised and high-portable
Because Scheme programs are also written in S-expressions we could say that programming in Scheme is performing directly in abstract syntax tree.
Basic idea of method of handling AST is following: Let define mul/add/sub/expression as scheme procedures and evaluate our DSL as scheme program:
(define (expression v) v)
(define (add x y) (+ x y))
(define (sub x y) (- x y))
(define (mul x y) (* x y))

(define (calculate-value expr)
(eval expr))

We've just got procedure calculate-value that calculates value of provided expression in run-time. Result 9 will be evaluated.if one call it like this:
(calculate-value '(expression (sub (add 2 (mul 3 4)) 5)) )
Same way we can solve other parts of our problem.

Generating printable infix representation
Here is code for generation of printable infix representation:
(define (expression v) v)
(define (add x y) (format "(~a+~a)" x y))
(define (sub x y) (format "(~a/~a)" x y))
(define (mul x y) (format "(~a*~a)" x y))

(define (to-infix expr)
(eval expr))
(to-infix '(expression (sub (add 2 (mul 3 4)) 5))) produces string "((2+(3*4))-5)"

Generating C code
Here is generation of C code:
(define (expression v)
(format
#include
int main()
{
printf(\"%d\",~a);
return 0;
}" v))
(define (add x y) (format "(~a+~a)" x y))
(define (sub x y) (format "(~a/~a)" x y))
(define (mul x y) (format "(~a*~a)" x y))

(define (generate-c expr)
(eval expr))
For our expression this code produces following code that could be directly passed to C-compiler:
#include 
int main()
{
printf("%d",((2+(3*4))/5));
return 0;
}
Note: in non-toys problems it may be better to consider of two-stage transformation: at first stage we can generate AST of C language. At second stage this AST could be transformed to plain C code. By implementing and testing two stages separately we are modularizing our meta-system and make it more maintainable.

Targeting to the stack machine
Target machine codes or assembler syntax not yet specified, so let's transform our expression to abstract stack machine AST. For our expression we expect to get something like this:
(stack-machine-code
(push-integer 2)
(push-integer 3)
(push-integer 4)
(mul)
(add)
(push-integer 5)
(sub)
(print)
)
When stack machine specification is available we can create transformation from this AST to representation required by specification.
Here is a code:
(define (expression v) `(stack-machine-code ,@(walk v) (print)))
(define (walk v)
(if (number? v)
`((push-integer ,v))
v))
(define (add x y) `(,@(walk x) ,@(walk y) (add)))
(define (sub x y) `(,@(walk x) ,@(walk y) (sub)))
(define (mul x y) `(,@(walk x) ,@(walk y) (mul)))

(define (to-stack-machine expr)
(eval expr))
alternative implementation without using splicing (@) is here:
(define (expression v) (append '((stack-machine-code)) (walk v) '((print))))
(define (walk v)
(if (number? v)
`((push-integer ,v))
v))
(define (add x y) (append (walk x) (walk y) '((add))))
(define (sub x y) (append (walk x) (walk y) '((sub))))
(define (mul x y) (append (walk x) (walk y) '((mul))))

(define (to-stack-machine expr)
(eval expr))
Making it practical and... more simple
Above four transformations are defined in obvious way by mapping AST nodes to Scheme procedures. However this solution is practically unusable: if we use these transformations simultaneously we get into trouble of overlapping of different definitions of add/sub/mul/expression.
Possible solution is to make definition local:
(define (calculate-value expr) ;; INVALID CODE
(define (expression v) v))
(define (add x y) (+ x y))
(define (sub x y) (- x y))
(define (mul x y) (* x y))
(eval expr))
Unfortunately (or fortunately?) this doesn't work: by Scheme conventions eval "doesn't see" local definitions. That's not big deal: it is allowed to directly specify environment as a second argument of eval and create isolated environments. But doing such low-level things every time is not a good idea. So I created simple macro define-ast-handler that is intended to do all this stuff. Final code is here:
(load "define-ast-handler.scm")

(define-ast-handler (calculate-value)
(define (expression v) v)
(define (add x y) (+ x y))
(define (sub x y) (- x y))
(define (mul x y) (* x y))
)

(define-ast-handler (to-infix)
(define (expression v) v)
(define (add x y) (format "(~a+~a)" x y))
(define (sub x y) (format "(~a/~a)" x y))
(define (mul x y) (format "(~a*~a)" x y))
)
;; ... generate-c and to-stack-machine are defined similar way
This code declares transformation procedures named calculate-value, to-infix, etc. There is no more overlapping of names. As you can see there is no more "eval" there. So final version became more simpler in comparison with original.

Substituting of variable
Now Imagine that our requirements was extended. Now we are required to handle 'x' variable in expression notation and allow its substitution during evaluation.
All what we need is to add argument x to calculate-value procedure:
(define-ast-handler (calculate-value x)
(define (expression v) v))
(define (add a b) (+ a b))
(define (sub a b) (- a b))
(define (mul a b) (* a b))
)
Now we can use it by passing value of x in runtime:
(calculate-value 3 '(add 2 x)) ;; --> evaluates to 5
For purposes of transforming AST to printable infix notation we might not want to substitute variable. We just want to print x symbol. All what we need is to define it's printable value:
(define-ast-handler (to-infix)
(define x "x")
(define (expression v) v)
(define (add a b) (format "(~a+~a)" a b))
(define (sub a b) (format "(~a/~a)" a b))
(define (mul a b) (format "(~a*~a)" a b))
)
Limitations
Not any AST of arbitrary DSL could be transformed easily this way. Here is discussion of some limitations:

We might be required to walk tree other way but walking order of syntax tree is locked by Scheme.

Syntax is not directly mappable to procedure application. In this situation we may consider to map nodes not to procedures but macros using local define-syntax definitions.

Not all nodes can be allowed to be included to each other. In this case define-syntax is able to help define rules.

We have no access to the parent nodes while evaluating child nodes. In this case closures might be helpful. Method is following: for child node we create closure and when parent node is parsed child's closure is evaluated with passing necessary info via parameters.

We use scheme-report-environment to evaluate user's AST. That's mean we haven't clean DSL. For our implementation of arithmetic expression DSL folowing expression is sill valid: (expression (plus 3 (if (eq? x 1) 2 3))). In some cases this can be feature, not bug ;). But in some cases we might need to restrict syntax by some reason. We can try overcome this issue by modifying define-ast-handler macro and using null-environment with some additional clean-up instead of scheme-report-environment. However this is still non-ideal solution: in this case bodies of node-handling procedures don't have access to scheme primitives.

Conclusion
Method of transforming abstract syntax tree has just been presented. Direct reusing of Scheme's infrastructure allows error diagnostic and debugging for free. Method has significant limitations but primary its advantage is simplicity.

References
1. Wikipedia article: http://en.wikipedia.org/wiki/Scheme_(programming_language)
2. For this article PLT Scheme (mzscheme) was used: http://www.plt-scheme.org/

No comments: