Sunday, November 15, 2009

Runtime IL code generation made simple

An introductory-level article about dynamic runtime code generation in .NET using System.Reflection.Emit. Starting with some examples of successful stories article gives simple informal introduction to Common Intermediate Language and practical hints how to learn it fast. Article gives examples of code, links to useful tools and libraries, DOs and DON’Ts, highlights some interesting use cases and finally compares System.Reflection.Emit with System.CodeDom.

Download article:

Sunday, June 22, 2008

Distributed systems design guidelines

Building high performance high available fault tolerant and scalable distributed system is a challenge problem.

Following are three primary design guidelines:

1. Avoid centralized components
2. Avoid centralized data
3. Avoid centralized processes

Sounds obviously, doesn't it?

Other design issues (from the "Distributed Systems: Principles and Paradigms" book):

1. No machine has complete information about the system state
2. Machines make decision based only on local information
3. Failure of one machines does not ruin the algorithm
4. There is no explicit assumption that a global clock exists (it's impossible to get all the clocks exactly synchronized)

Tuesday, June 3, 2008

Using .NET as a target plaform in staged metaprogramming

Abstract
This article is a brief proof-of-concept which illustrates how to use .NET as a target platform in staged metaprogramming. Arithmetic expression DSL defined in previous article is used as an example. Process of building of simple compiler of our DSL is presented.

Definition of problem
Our arithmetical expression DSL allows to specify expressions like this:
(expression (sub (add 2 (mul 3 4)) 5))
Let's make compiler of our DSL to the CLR (runtime of the .NET platform).

Quick introduction to the CLR
AFAIK CLR core language is the Intermediate Language (IL). It can be treated as a stack-based assembler.
Here is a short introduction to the subset of IL we are interested in:

IL program text consists of instructions.
ldc.i8 value instruction is used to push 64-bit integer constant value to the stack.
call instruction is used to call functions: callee gets arguments from the stack and if need it lefts return value at the top at exit.
We can use add/sub/mul instructions to add, subtract and multiply values.
void[ mscorlib]System.Console::WriteLine(int64) function allows us to display result.
ret instruction is used to return from the function.

IL code can be compiled to an executable (called assembly) using ilasm.exe utility.

If you have experience with C# (or other .NET high-level language) then easiest way to get idea what IL looks like is to write some C# code and look what IL code is produced. This can be done using IL disassembler utility (ildasm.exe) or .NET Reflector.

Strategy of compiling
What is easiest way to deliver our expression DSL to CLR? In previous article we defined transformation of our DSL to the abstract stack machine AST. What we need is to transform it to the IL.

Let's do this in two steps: first step will be transformation of abstract stack machine AST to IL AST. Second step will be transformation of IL AST to IL text.
After that we could use ilasm.exe to generate executable file (.NET assembly) from IL text.

Of course one could implement all of this in single step. However different steps is better solution because it allows us to separate aspects of transforming and text-generation. These aspects could be implemented and tested separately. Moreover implementation of the second step has independent value: it can be reused in the future for the purposes of other tasks.

First step: transformation of stack machine AST to AST of IL:
We need transform from the abstract stack machine code like this:
(stack-machine-code
(push-integer 2)
(push-integer 3)
(push-integer 4)
(mul)
(add)
(push-integer 5)
(sub)
(print)
)
to the IL AST like this:
(assembly Example
(method 'main
(entrypoint)
(body
(il
(ldc.i8 2)
(ldc.i8 3)
(ldc.i8 4)
(mul)
(add)
(ldc.i8 5)
(sub)
(call "void [mscorlib]System.Console::WriteLine(int64)")
(ret)))))

Here we've specified that we need to create an executable assembly with method (function) main which is entry point and which contains specified IL code as a body. Note: this is not standard representation but just-invented AST of the IL which reflects our current thinking about a subset of the CLR platform.

We can peform this transformation using following code:
(define-ast-handler (stack-machine-to-il)
(define (stack-machine-code . instructions)
`(assembly Example
(method 'main
(entrypoint)
(body
(il ,@instructions (ret))))))
(define (push-integer value) `(ldc.i8 ,value))
(define (mul) '(mul))
(define (add) '(add))
(define (sub) '(sub))
(define (print) '(call "void [mscorlib]System.Console::WriteLine(int32)"))
)
Second step: generating IL text
Generating IL text is simply transforming fragments of IL AST to appropriate strings. With the input of the IL AST shown in above section we expect to got following text:
.assembly extern mscorlib {}
.assembly Example { .ver 1:0:0:0 }
.method static void main() {
.entrypoint
ldc.i8 2
ldc.i8 3
ldc.i8 4
mul
add
ldc.i8 5
sub
call void [mscorlib]System.Console::WriteLine(int32)
ret
}
Here is a code that does all this stuff:
(require (lib "list.ss"))
(define (concat-list l) (foldr string-append "" l))

(define-ast-handler (to-il-text)
(define concat-list ,concat-list)
(define (assembly name . rest)
(format
".assembly extern mscorlib {}\n.assembly ~a { .ver 1:0:0:0 }\n~a"
name
(concat-list rest)))
(define (method name . rest)
(format ".method static void ~s() {\n~a\n}" name (concat-list rest)))
(define (body il) il)
(define (il . instructions) (concat-list instructions))
(define (ldc.i8 value) (format "ldc.i8 ~a\n" value))
(define (mul) (format "mul\n"))
(define (add) (format "add\n"))
(define (sub) (format "sub\n"))
(define (call name) (format "call ~a\n" name))
(define (ret) (format "ret\n"))
(define (entrypoint) ".entrypoint\n")
)
Compiling
Compiling is simply performed by invocation of ilasm.exe utility.
This can be done like this:
(define (compile il-text)
(if (file-exists? "test.il") (delete-file "test.il"))
(with-output-to-file example.il" (lambda () (write-string il-text)))
(system "ilasm.exe example.il"))
Now we have executable file example.exe.

Conclusion
We've just implemented proof-of-concept compiler of custom domain specific language to the .NET platform. Simple method of transformation of abstract syntax trees was used to map our language to IL. Speaking other words we've actively used staged metaprogramming.
Doing this in top-down way we haven't needed to know all details and all features of the target platform. For large set of real-word tasks we need just small subset of target platform.
Of course to build effective implementation we need deep understanding of target platform and underlying hardware. Top-down way can help us in this challenge:
  • It allows us to learn target platform in systematic way in the context of our task
  • When our knowledge about target platform is increased and we found new way to build more effective solution we can refactor or even completely rewrite low-level implementation without affecting the high level

Sunday, June 1, 2008

Simple way of AST transformation using Scheme

Abstract
This article demonstrates simple way of transforming abstract syntax trees represented in S-expressions using Scheme programming language [1, 2]. Arithmetical expression evaluation domain is used as an example.

Problem definition
Imagine that we have following problem:
We need to build simple arithmetical expression processing engine. Expressions in our domain consists of integer numbers and operations of addition, subtraction and multiplication. We required to represent expression in some unified (canonical) way. We need to allow direct evaluation to show value. We are are required to generate C code for evaluating and printing the expression. We are required to print formula in infix way. Additionally we required to generate code for some stack machine that allows operations of pushing integer numbers to the stack, adding subtracting and multiplying and printing pushed values. Moreover, we don't have specification of machine codes or assembler for this machine right now, but we are required to start solving the problem...

Representations of arithmetical expressions
There are different possible ways of representation of expressions. Let's take into attention classical infix notation of some expression is 2+3*4-5. To interpret this expression we need information about operation priority and order of evaluation. This expression could be represented by following equivalent abstract syntax tree:
    expression
|
sub
/ \
add 5
/ \
2 mul
/ \
3 4
Where add/sub/mul is just selected shorthands for addition/substration/multiplication. To understand this tree we don't need information about order/priorities. This tree could be represented in alternative sequential form using S-expression:
(expression (sub (add 2 (mul 3 4)) 5))
For complex expressions indentation is commonly used to keep it human-readable:
(expression
(sub
(add
2
(mul 3 4))
5))
Let this notation will be our DSL to represent expressions.

Calculating value
Now let's try to solve problem of calculating expression value. We will use Scheme programming language [1,2]. Scheme was selected by following reasons:
  • Scheme allows its syntax extension; directly supports homogeneous metaprogramming
  • Scheme is very simple to learn and use
  • Scheme is minimal dialect of LISP that is standardised and high-portable
Because Scheme programs are also written in S-expressions we could say that programming in Scheme is performing directly in abstract syntax tree.
Basic idea of method of handling AST is following: Let define mul/add/sub/expression as scheme procedures and evaluate our DSL as scheme program:
(define (expression v) v)
(define (add x y) (+ x y))
(define (sub x y) (- x y))
(define (mul x y) (* x y))

(define (calculate-value expr)
(eval expr))

We've just got procedure calculate-value that calculates value of provided expression in run-time. Result 9 will be evaluated.if one call it like this:
(calculate-value '(expression (sub (add 2 (mul 3 4)) 5)) )
Same way we can solve other parts of our problem.

Generating printable infix representation
Here is code for generation of printable infix representation:
(define (expression v) v)
(define (add x y) (format "(~a+~a)" x y))
(define (sub x y) (format "(~a/~a)" x y))
(define (mul x y) (format "(~a*~a)" x y))

(define (to-infix expr)
(eval expr))
(to-infix '(expression (sub (add 2 (mul 3 4)) 5))) produces string "((2+(3*4))-5)"

Generating C code
Here is generation of C code:
(define (expression v)
(format
#include
int main()
{
printf(\"%d\",~a);
return 0;
}" v))
(define (add x y) (format "(~a+~a)" x y))
(define (sub x y) (format "(~a/~a)" x y))
(define (mul x y) (format "(~a*~a)" x y))

(define (generate-c expr)
(eval expr))
For our expression this code produces following code that could be directly passed to C-compiler:
#include 
int main()
{
printf("%d",((2+(3*4))/5));
return 0;
}
Note: in non-toys problems it may be better to consider of two-stage transformation: at first stage we can generate AST of C language. At second stage this AST could be transformed to plain C code. By implementing and testing two stages separately we are modularizing our meta-system and make it more maintainable.

Targeting to the stack machine
Target machine codes or assembler syntax not yet specified, so let's transform our expression to abstract stack machine AST. For our expression we expect to get something like this:
(stack-machine-code
(push-integer 2)
(push-integer 3)
(push-integer 4)
(mul)
(add)
(push-integer 5)
(sub)
(print)
)
When stack machine specification is available we can create transformation from this AST to representation required by specification.
Here is a code:
(define (expression v) `(stack-machine-code ,@(walk v) (print)))
(define (walk v)
(if (number? v)
`((push-integer ,v))
v))
(define (add x y) `(,@(walk x) ,@(walk y) (add)))
(define (sub x y) `(,@(walk x) ,@(walk y) (sub)))
(define (mul x y) `(,@(walk x) ,@(walk y) (mul)))

(define (to-stack-machine expr)
(eval expr))
alternative implementation without using splicing (@) is here:
(define (expression v) (append '((stack-machine-code)) (walk v) '((print))))
(define (walk v)
(if (number? v)
`((push-integer ,v))
v))
(define (add x y) (append (walk x) (walk y) '((add))))
(define (sub x y) (append (walk x) (walk y) '((sub))))
(define (mul x y) (append (walk x) (walk y) '((mul))))

(define (to-stack-machine expr)
(eval expr))
Making it practical and... more simple
Above four transformations are defined in obvious way by mapping AST nodes to Scheme procedures. However this solution is practically unusable: if we use these transformations simultaneously we get into trouble of overlapping of different definitions of add/sub/mul/expression.
Possible solution is to make definition local:
(define (calculate-value expr) ;; INVALID CODE
(define (expression v) v))
(define (add x y) (+ x y))
(define (sub x y) (- x y))
(define (mul x y) (* x y))
(eval expr))
Unfortunately (or fortunately?) this doesn't work: by Scheme conventions eval "doesn't see" local definitions. That's not big deal: it is allowed to directly specify environment as a second argument of eval and create isolated environments. But doing such low-level things every time is not a good idea. So I created simple macro define-ast-handler that is intended to do all this stuff. Final code is here:
(load "define-ast-handler.scm")

(define-ast-handler (calculate-value)
(define (expression v) v)
(define (add x y) (+ x y))
(define (sub x y) (- x y))
(define (mul x y) (* x y))
)

(define-ast-handler (to-infix)
(define (expression v) v)
(define (add x y) (format "(~a+~a)" x y))
(define (sub x y) (format "(~a/~a)" x y))
(define (mul x y) (format "(~a*~a)" x y))
)
;; ... generate-c and to-stack-machine are defined similar way
This code declares transformation procedures named calculate-value, to-infix, etc. There is no more overlapping of names. As you can see there is no more "eval" there. So final version became more simpler in comparison with original.

Substituting of variable
Now Imagine that our requirements was extended. Now we are required to handle 'x' variable in expression notation and allow its substitution during evaluation.
All what we need is to add argument x to calculate-value procedure:
(define-ast-handler (calculate-value x)
(define (expression v) v))
(define (add a b) (+ a b))
(define (sub a b) (- a b))
(define (mul a b) (* a b))
)
Now we can use it by passing value of x in runtime:
(calculate-value 3 '(add 2 x)) ;; --> evaluates to 5
For purposes of transforming AST to printable infix notation we might not want to substitute variable. We just want to print x symbol. All what we need is to define it's printable value:
(define-ast-handler (to-infix)
(define x "x")
(define (expression v) v)
(define (add a b) (format "(~a+~a)" a b))
(define (sub a b) (format "(~a/~a)" a b))
(define (mul a b) (format "(~a*~a)" a b))
)
Limitations
Not any AST of arbitrary DSL could be transformed easily this way. Here is discussion of some limitations:

We might be required to walk tree other way but walking order of syntax tree is locked by Scheme.

Syntax is not directly mappable to procedure application. In this situation we may consider to map nodes not to procedures but macros using local define-syntax definitions.

Not all nodes can be allowed to be included to each other. In this case define-syntax is able to help define rules.

We have no access to the parent nodes while evaluating child nodes. In this case closures might be helpful. Method is following: for child node we create closure and when parent node is parsed child's closure is evaluated with passing necessary info via parameters.

We use scheme-report-environment to evaluate user's AST. That's mean we haven't clean DSL. For our implementation of arithmetic expression DSL folowing expression is sill valid: (expression (plus 3 (if (eq? x 1) 2 3))). In some cases this can be feature, not bug ;). But in some cases we might need to restrict syntax by some reason. We can try overcome this issue by modifying define-ast-handler macro and using null-environment with some additional clean-up instead of scheme-report-environment. However this is still non-ideal solution: in this case bodies of node-handling procedures don't have access to scheme primitives.

Conclusion
Method of transforming abstract syntax tree has just been presented. Direct reusing of Scheme's infrastructure allows error diagnostic and debugging for free. Method has significant limitations but primary its advantage is simplicity.

References
1. Wikipedia article: http://en.wikipedia.org/wiki/Scheme_(programming_language)
2. For this article PLT Scheme (mzscheme) was used: http://www.plt-scheme.org/

Wednesday, May 28, 2008

Introduction to staged metaprogramming and metalinguistic abstraction

What is metaprogramming?
According to Wikipedia definition of metaprogramming is the writing of computer programs that write or manipulate other programs (or themselves) as their data... In many cases, this allows programmers to get more done in the same amount of time as they would take to write all the code manually.
Metaprogramming is not a "silver bullet" but it's a well known as a technique that could significantly improve productivity of programmers.
Probably you don't write machine code manually. You use the best programming language insert_name_here. Compiler or interpreter of your favorite programming language is the brightest example of metaprogramming that really improves your productivity.
Moreover I suppose that metaprogramming is the only known way to significantly improve productivity by reducing complexity1.

Metaprogramming and software development industry
Why metaprogramming is not in mainstream now? I tried to address this question to my colleagues and got following answers:
  • "Metaprogramming is only for GURUs"
  • "To do metaprogramming we need to write parser/compiler/interpreter/translator/whatever. These are too complicated"
  • "Programs that writes other program are hard to understand and debug"
  • "Metaprogramming - most people don't know what is it..... This is both reason and consequence."
  • "Many people think that metaprograming means that C++ templates will be used :)"
  • "It is hard to find people that can do this. It's more practical to hire 100 Indians and solve the task in traditional way"
  • "Metaprogramming is suitable only for large-scale projects/systems/problems"
  • "It's idiotically (and horrible) to learn hundred languages..."
  • Advertising from big companies enforces to use other methods"
  • "Lack of tools. Lack of theory"
Some of those statements are true. But some of them are just only commonly shared myth.
One of purposes of this series of articles is to collect and systematize materials which show following:
  • Metaprogramming can be simple. It can be available not only for GURUs
  • Metaprogramming can be effectively used for different scale of domains and ranges of sizes of problems
  • Entering to mainstream is happening right now together with rising of complexity of problems
Metalinguistic abstraction
Metalinguistic abstraction is the process of solving complex problems by creating a new language or vocabulary to better understand the problem space. Such languages in recent time are called domain-specific languages (DSL). While DSL is relatively new term metalinguistic abstraction is old and well-known way of solving problems2.
Metalinguistic abstraction uses language as main instrument of abstraction in the same way as procedural approach uses procedures, OOP uses objects/classes and AOP uses aspects.
By using metalinguistic abstraction we use metaprogramming in cascaded way: we translate program written in high-level DSL to program in another DSL, than we translate to lower-level DSL's until we reach target platform. This is called staged metaprogramming. Algorithm of solving problems using this approach in top-down way can look like this:
  1. Describe solution of you problem using high level DSL
  2. Chose the target platform. This can be whatever you want or whatever required (examples: other DSL, machine language, general purpose programming language)
  3. If your DSL is easily and naturally can be mapped to platform primitives - do this and finish
  4. Substitute original problem to (smaller) problem of making DSL runnable at selected platform. Goto 1.
In simple cases DSLs can be stacked one onto another. In more complex cases DSLs could create hierarchy.
What are benefits of metalinguistic abstraction?
  • Programs can be more readable, supportable if they are specified using language which is close to requirements
  • Avoiding of duplication: ability to follow DRY (Don't Repeat Yourself) rule.
  • Solutions tend to be more ready to radical and unpredicted requirement changes.
    Why this can be true? Functional requirements mostly reside at higher metalinguistic abstraction layers. These are "encoded" in platform independent and close-to-domain form.
    Non-functional requirements tend to reside at lower metalinguistic abstraction layers. Even unpredictable requirement of moving program to exotic platform can often be satisfied by replacing low-level layers without rewriting or affecting high-level.
Is really metaprogramming complicated?
Any simple problem can be solved in complex way using any technique. Metaprogramming isn't an exception. Just don't go this way. If one doesn't want go this way (s)he should not:
  • Invent yet another general purpose language while solving problem of controlling coffee-machine
  • Invent exotic syntax for you languages that is hard to parse
  • Select tools/languages/frameworks which doesn't allow you to easily manipulate/translate syntax trees smoothly and naturally without any overhead (Example of such extra overhead are maintaining classes for any AST node with stuff to implement visitor pattern, manually-supported bridge class hierarchies, so on...)
  • Select tools/languages/instruments that require 5+ years training before programmer can do something serious.
To avoid of building extra-complicated solutions it's reasonable to:
  • Put to DSL as little as possible.
  • Think about any DSL in terms of its AST and keep its syntax as much close to AST as possible. This might allow you to avoid writing and debugging parsers at all.
  • Select simple easy-to-learn tools and languages (or subsets of languages) to allow quick joining of new people to the project
  • Choose one of "programmable programming languages". I.e. extensible language that directly supports homogeneous metaprogramming3.
Should we minimize the number of metalinguistic abstraction layers in order to reduce complexity? If your answer is "yes" consider following analogy: should we minimize the number of functions in the structural approach or minimize the number of classes in OOP-way to reduce complexity? Of course this answer is "no". We can obtain benefits with clean-fine-granted design with thin and simple high-isolated metalinguistic abstraction layers. But of course complexity of support should not grow exponentially with count of these languages. For example procedural programming languages allow easy splitting procedure into several parts when needed. Analogous operation should be easy against DSL.

To be continued...

References:
1. Similar idea is highlighted in the article: "Using a hierarchy of Domain Specific Languages in complex software systems design" by V.S. Lugovsky
2. "Structure and Interpretation of Computer Programs" book aka SICP
3. A Taxonomy of meta-programming systems -
4. Lambda the Ultimate
- The Programming Languageds Weblog