In principle it takes an array of instructions (INST), executing them one by one, operating on the current evaluation stack (or stack for short). Most of the execution really goes in order as the compiler prepares up everything to achieve linearity. This makes the execution loop relatively simple and efficient: it is just a large while(dont_need_to_exit) execute_next_instruction;. The actual instruction execution is a real large switch on the instruction opcode.
An usual example on how this is implemented in practice can be obtained by lmawk -Wdump -f test.awk on some simple arithmetic script:
awk | asm (VM instructions) |
---|---|
BEGIN { a = 3 + 4 * 5 } |
BEGIN 000 pusha a 002 pushd 3 004 pushd 4 006 pushd 5 008 mul 009 add 010 assign 011 pop 012 exit0 |
First the lvalue (target variable, left side) of the assignment is pushed, then the expression (right side). The stack is, from top to down: {5, 4, 3, a}. The top of the stack is 5, the second element is 4 by the time mul runs. Mul will replace these two elements by 20, decreasing the stack size by one, leaving the result on the top. Next add does a similar job, replacing the top two items of the stack {20, 3} with their sum, 23. At the end assign runs on the stack {23, a}, removing both items, copying the value 23 to the global variable a. At the end it also puts the result on the top of the stack, leaving the stack as {23} - this is the result (output) of the assignment operation. Since the script doesn't need to use the result, it runs a pop that removes and discards the top item, leaving the stack empty. Since the script didn't have main or END parts, the script can quit at this point, executing the exit0 instruction (exiting with value 0 - the implicit exit).
NOTE: currently there's absolutely no optimization in the parser: everything is calculated as written in the script and some values are saved just to be discarded by the next instruction.
An interesting and important feature of execute_() is that it can save all states and return to the caller at any point of the execution, i.e. between any two instruction in the code. It can also resume execution from the next instruction. This provides the host application full control over scheduling the script, while the script can be built of sequential, blocking instructions.
Some of the above are implemented using conditional and unconditional jumps to direct addresses (first column on the asm). For example a simple if is compiled to contain 2 jumps:
awk | asm (VM instructions) |
---|---|
BEGIN { if (bool) a = 6 else a = 7 } |
BEGIN 000 pushi bool 002 jz 012 004 pusha a 006 pushd 6 008 assign 009 pop 010 jmp 018 012 pusha a 014 pushd 7 016 assign 017 pop 018 exit0 |
The first one is a conditional jump, "jump if [top of the stack is] zero" (jz) - this makes the VM jump to the else branch at address 10. The then branch ends in an unconditional jump to the next instruction after the if (which is the implicit exit in this example), bypassing the code of the else branch.
A jump is carried out by a simple modification of the "next instruction" pointer before running the next iteration of the execution loop.
The original mawk implementation simply called mawk_execute_() recursively. This meant the C compiler took care of saving all internal states on the C stack for the detour. However, this wouldn't allow the code to be suspended during such detour as it would be problematic to rebuild the C stack on a resume.
Thus libmawk's mawk_execute_() does not recurse on C level but on VM level. For example when a function is called (using the call instruction):
Upon a ret instruction from the function:
It may be that the execution is interrupted in the middle of running of a large block of code, for example in BEGIN. The top of the stack holds the current execution state so that mawk_execute_() will be able to continue execution. The application may decide to run an awk function before resuming the code: this operation would push a new set of execution state on top of the stack and call mawk_execute_(). When the current state finishes at the _RET instruction, mawk_execute_() would take the next frame from the stack and would automatically resume execution of the interrupted BEGIN block. This would cause the return value of the function to be lost and would attempt to resume BEGIN as a side effect of the function call!
To avoid such confusion, any new enter to mawk_execute_() is required to push two sets of states: an EXEST_EXIT and the actual state it wants to "resume" at (start execution at). When mawk_execute_() hits the _RET instruction in the above example, it does pop the next frame, but that frame would be the EXEST_EXIT which would cause it to interrupt immediately. This leaves the stack exactly as it looked like before the function call, and the application later may decide to resume execution.
Fresh start entries: