aboutsummaryrefslogtreecommitdiff
path: root/docs/buzzard/buzzard.2.design
diff options
context:
space:
mode:
Diffstat (limited to 'docs/buzzard/buzzard.2.design')
-rw-r--r--docs/buzzard/buzzard.2.design780
1 files changed, 780 insertions, 0 deletions
diff --git a/docs/buzzard/buzzard.2.design b/docs/buzzard/buzzard.2.design
new file mode 100644
index 0000000..49b2e2d
--- /dev/null
+++ b/docs/buzzard/buzzard.2.design
@@ -0,0 +1,780 @@
+ FIRST & THIRD
+ almost FORTH
+
+ FORTH is a language mostly familiar to users of "small" machines.
+FORTH programs are small because they are interpreted--a function
+call in FORTH takes two bytes. FORTH is an extendable language--
+built-in primitives are indistinguishable from user-defined
+_words_. FORTH interpreters are small because much of the system
+can be coded in FORTH--only a small number of primitives need to
+be implemented. Some FORTH interpreters can also compile defined
+words into machine code, resulting in a fast system.
+
+ FIRST is an incredibly small language which is sufficient for
+defining the language THIRD, which is mostly like FORTH. There are
+some differences, and THIRD is probably just enough like FORTH for
+those differences to be disturbing to regular FORTH users.
+
+ The only existing FIRST interpreter is written in obfuscated C,
+and rings in at under 800 bytes of source code, although through
+deletion of whitespace and unobfuscation it can be brought to about
+650 bytes.
+
+ This document FIRST defines the FIRST environment and primitives,
+with relevent design decision explanations. It secondly documents
+the general strategies we will use to implement THIRD. The THIRD
+section demonstrates how the complete THIRD system is built up
+using FIRST.
+
+
+Section 1: FIRST
+
+
+Environment
+
+ FIRST implements a virtual machine. The machine has three chunks
+of memory: "main memory", "the stack", and "string storage". When
+the virtual machine wishes to do random memory accesses, they come
+out of main memory--it cannot access the stack or string storage.
+
+ The stack is simply a standard LIFO data structure that is used
+implicitly by most of the FIRST primitives. The stack is made up
+of ints, whatever size they are on the host machine.
+
+ String storage is used to store the names of built-in and defined
+primitives. Separate storage is used for these because it allows
+the C code to use C string operations, reducing C source code size.
+
+ Main memory is a large array of ints. When we speak of
+addresses, we actually mean indices into main memory. Main memory
+is used for two things, primarily: the return stack and the dictionary.
+
+ The return stack is a LIFO data structure, independent of
+the abovementioned "the stack", which is used by FIRST to keep
+track of function call return addresses.
+
+ The dictionary is a list of words. Each word contains a header
+and a data field. In the header is the address of the previous word,
+an index into the string storage indicating where the name of this
+word is stored, and a "code pointer". The code pointer is simply
+an integer which names which "machine-language-primitive" implements
+this instruction. For example, for defined words the code pointer
+names the "run some code" primitive, which pushes the current program
+counter onto the return stack and sets the counter to the address of
+the data field for this word.
+
+ There are several important pointers into main memory. There is
+a pointer to the most recently defined word, which is used to start
+searches back through memory when compiling words. There is a pointer
+to the top of the return stack. There is a pointer to the current
+end of the dictionary, used while compiling.
+
+ For the last two pointers, namely the return stack pointer and
+the dictionary pointer, there is an important distinction: the pointers
+themselves are stored in main memory (in FIRST's main memory). This
+is critical, because it means FIRST programs can get at them without
+any further primitives needing to be defined.
+
+
+Instructions
+
+ There are two kinds of FIRST instructions, normal instructions and
+immediate instructions. Immediate instructions do something significant
+when they are used. Normal instructions compile a pointer to their
+executable part onto the end of the dictionary. As we will see, this
+means that by default FIRST simply compiles things.
+
+ Integer Operations
+Symbol Name Function
+ - binary minus pop top 2 elements of stack, subtract, push
+ * multiply pop top 2 elements of stack, multiply, push
+ / divide pop top 2 elements of stack, divide, push
+ <0 less than 0 pop top element of stack, push 1 if < 0 else 0
+
+Note that we can synthesize addition and negation from binary minus,
+but we cannot synthesize a time efficient divide or multiply from it.
+<0 is synthesizable, but only nonportably.
+
+ Memory Operations
+Symbol Name Function
+ @ fetch pop top of stack, treat as address to push contents of
+ ! store top of stack is address, 2nd is value; store to memory
+ and pop both off the stack
+
+ Input/Output Operations
+Name Function
+echo output top of stack through C's putchar()
+key push C's getchar() onto top of stack
+_read read a space-delimited word, find it in the
+ dictionary, and compile a pointer to
+ that word's code pointer onto the
+ current end of the dictionary
+
+Although _read could be synthesized from key, we need _read to be able
+to compile words to be able to start any syntheses.
+
+ Execution Operations
+Name Function
+exit leave the current function: pop the return stack
+ into the program counter
+
+ Immediate (compilation) Operations
+Symbol Name Function
+ : define read in the next space-delimited word, add it to
+ the end of our string storage, and generate
+ a header for the new word so that when it
+ is typed it compiles a pointer to itself
+ so that it can be executed.
+immediate immediate when used immediately after a name following a ':',
+ makes the word being defined run whenever
+ it is typed.
+
+: cannot be synthesized, because we could not synthesize anything.
+immediate has to be an immediate operation, so it could not be
+synthesized unless by default operations were immediate; but that
+would preclude our being able to do any useful compilation.
+
+ Stack Operations
+Name Function
+pick pop top of stack, use as index into stack and copy up
+ that element
+
+If the data stack were stored in main memory, we could synthesize pick;
+but putting the stack and stack pointer in main memory would significantly
+increase the C source code size.
+
+ There are three more primitives, but they are "internal only"--
+they have no names and no dictionary entries. The first is
+"pushint". It takes the next integer out of the instruction stream
+and pushes it on the stack. This could be synthesized, but probably
+not without using integer constants. It is generated by _read when
+the input is not a known word. The second is "compile me". When
+this instruction is executed, a pointer to the word's data field is
+appended to the dictionary. The third is "run me"--the word's data
+field is taken to be a stream of pointers to words, and is executed.
+
+ One last note about the environment: FIRST builds a very small
+word internally that it executes as its main loop. This word calls
+_read and then calls itself. Each time it calls itself, it uses
+up a word on the return stack, so it will eventually trash things.
+This is discussed some more in section 2.
+
+
+Here's a handy summary of all the FIRST words:
+
+ - * / binary integer operations on the stack
+ <0 is top of stack less than 0?
+ @ ! read from or write to memory
+ echo key output or input one character
+ _read read a word from input and compile a pointer to it
+ exit stop running the current function
+ : compile the header of a definition
+ immediate modify the header to create an immediate word
+
+ Here is a sample FIRST program. I'm assuming you're using
+the ASCII character set. FIRST does not depend upon ASCII, but
+since FIRST has no syntax for character constants, one normally has
+to use decimal values. This can be gotten around using getchar, though.
+Oh. One other odd thing. FIRST initially builds its symbol table
+by calling : several times, so it needs to get the names of the base
+symbols as its first 13 words of input. You could even name them
+differently if you wanted.
+ These FIRST programs have FORTH comments in them: they are contained
+inside parentheses. FIRST programs cannot have FORTH comments; but I need
+some device to indicate what's going on. (THIRD programs are an entirely
+different subject.)
+
+ ( Our first line gives the symbols for the built-ins )
+: immediate _read @ ! - * / <0 exit echo key _pick
+
+ ( now we define a simple word that will print out a couple characters )
+
+: L ( define a word named 'L' )
+ 108 echo ( output an ascii 'l' )
+ exit
+
+: hello ( define a word named 'hello')
+ 72 echo ( output an ascii 'H' )
+ 101 echo ( output an ascii 'e' )
+ 111 ( push ascii 'o' onto the stack )
+ L L ( output two ascii 'l's )
+ echo ( output the 'o' we pushed on the stack before )
+ 10 echo ( print a newline )
+ exit ( stop running this routine )
+
+: test immediate ( define a word named 'test' that runs whenever typed )
+ hello ( call hello )
+ exit
+
+test
+
+( The result of running this program should be:
+Hello
+)
+
+
+Section 2: Motivating THIRD
+
+ What is missing from FIRST? There are a large number of
+important primitives that aren't implemented, but which are
+easy to implement. drop , which throws away the top of the
+stack, can be implemented as { 0 * + } -- that is, multiply
+the top of the stack by 0 (which turns the top of the stack
+into a 0), and then add the top two elements of the stack.
+
+ dup , which copies the top of the stack, can be easily
+implemented using temporary storage locations. Conveniently,
+FIRST leaves memory locations 3, 4, and 5 unused. So we can
+implement dup by writing the top of stack into 3, and then
+reading it out twice: { 3 ! 3 @ 3 @ }.
+
+ we will never use the FIRST primitive 'pick' in building THIRD,
+just to show that it can be done; 'pick' is only provided because
+pick itself cannot be built out of the rest of FIRST's building
+blocks.
+
+ So, instead of worrying about stack primitives and the
+like, what else is missing from FIRST? We get recursion, but
+no control flow--no conditional operations. We cannot at the
+moment write a looping routine which terminates.
+
+ Another glaring dissimilarity between FIRST and FORTH is
+that there is no "command mode"--you cannot be outside of a
+: definition and issue some straight commands to be executed.
+Also, as we noted above, we cannot do comments.
+
+ FORTH also provides a system for defining new data types,
+using the words [in one version of FORTH] <builds and does> .
+We would like to implement these words as well.
+
+ As the highest priority thing, we will build control flow
+structures first. Once we have control structures, we can
+write recursive routines that terminate, and we are ready to
+tackle tasks like parsing, and the building of a command mode.
+
+ By the way, location 0 holds the dictionary pointer, location
+1 holds the return stack pointer, and location 2 should always
+be 0--it's a fake dictionary entry that means "pushint".
+
+
+Section 3: Building THIRD
+
+ In this section, I'm going to keep my conversation
+ indented to this depth, rather than using fake comments--
+ because we'll have real comments eventually.
+
+ The first thing we have to do is give the symbols for our
+ built-ins.
+
+: immediate _read @ ! - * / < exit echo key _pick
+
+ Next we want to be mildly self commenting, so we define
+ the word 'r' to push the *address of the return stack
+ pointer* onto the stack--NOT the value of the return
+ stack pointer. (In fact, when we run r, the value of
+ the return stack pointer is temporarily changed.)
+
+: r 1 exit
+
+ Next, we're currently executing a short loop that contains
+ _read and recursion, which is slowly blowing up the return
+ stack. So let's define a new word, from which you can
+ never return. What it does is drops the top value off
+ the return stack, calls _read, then calls itself. Because
+ it kills the top of the return stack, it can recurse
+ indefinitely.
+
+: ]
+ r @ Get the value of the return stack pointer
+ 1 - Subtract one
+ r ! Store it back into the return stack pointer
+ _read Read and compile one word
+ ] Start over
+
+ Notice that we don't need to exit, since we never come
+ back. Also, it's possible that an immediate word may
+ get run during _read, and that _read will never return!
+
+ Now let's get compile running.
+
+: main immediate ]
+main
+
+ Next off, I'm going to do this the easy but non-portable
+ way, and put some character constant definitions in.
+ I wanted them at the top of the file, but that would have
+ burned too much of the return stack.
+
+: '"' 34 exit
+: ')' 41 exit
+: '\n' 10 exit
+: 'space' 32 exit
+: '0' 48 exit
+: '-' 45 exit
+
+: cr '\n' echo exit
+
+ Next, we want to define some temporary variables for
+ locations 3, 4, and 5, since this'll make our code look
+ clearer.
+: _x 3 @ exit
+: _x! 3 ! exit
+: _y 4 @ exit
+: _y! 4 ! exit
+
+ Ok. Now, we want to make THIRD look vaguely like FORTH,
+ so we're going to define ';'. What ; ought to do is
+ terminate a compilation, and turn control over to the
+ command-mode handler. We don't have one, so all we want
+ ';' to do for now is compile 'exit' at the end of the
+ current word. To do this we'll need several other words.
+
+ Swap by writing out the top two elements into temps, and
+ then reading them back in the other order.
+: swap _x! _y! _x _y exit
+ Take another look and make sure you see why that works,
+ since it LOOKS like I'm reading them back in the same
+ order--in fact, it not only looks like it, but I AM!
+
+ Addition might be nice to have. To add, we need to
+ negate the top element of the stack, and then subtract.
+ To negate, we subtract from 0.
+: +
+ 0 swap -
+ -
+ exit
+
+ Create a copy of the top of stack
+: dup _x! _x _x exit
+
+ Get a mnemonic name for our dictionary pointer--we need
+ to compile stuff, so it goes through this.
+: h 0 exit
+
+ We're going to need to advance that pointer, so let's
+ make a generic pointer-advancing function.
+ Given a pointer to a memory location, increment the value
+ at that memory location.
+: inc
+ dup @ Get another copy of the address, and get the value
+ so now we have value, address on top of stack.
+ 1 + Add one to the value
+ swap Swap to put the address on top of the stack
+ ! exit Write it to memory
+
+ , is a standard FORTH word. It should write the top of
+ stack into the dictionary, and advance the pointer
+: ,
+ h @ Get the value of the dictionary pointer
+ ! Write the top of stack there
+ h inc And increment the dictionary pointer
+ exit
+
+ ' is a standard FORTH word. It should push the address
+ of the word that follows it onto the stack. We could
+ do this by making ' immediate, but then it'd need to
+ parse the next word. Instead, we compile the next word
+ as normal. When ' is executed, the top of the return
+ stack will point into the instruction stream immediately
+ after the ' . We push the word there, and advance the
+ return stack pointer so that we don't execute it.
+: '
+ r @ Get the address of the top of return stack
+ We currently have a pointer to the top of return stack
+ @ Get the value from there
+ We currently have a pointer to the instruction stream
+ dup Get another copy of it--the bottom copy will stick
+ around until the end of this word
+ 1 + Increment the pointer, pointing to the NEXT instruction
+ r @ ! Write it back onto the top of the return stack
+ We currently have our first copy of the old pointer
+ to the instruction stream
+ @ Get the value there--the address of the "next word"
+ exit
+
+ Now we're set. ; should be an immediate word that pushes
+ the address of exit onto the stack, then writes it out.
+: ; immediate
+ ' exit Get the address of exit
+ , Compile it
+ exit And we should return
+
+ Now let's test out ; by defining a useful word:
+: drop 0 * + ;
+
+ Since we have 'inc', we ought to make 'dec':
+: dec dup @ 1 - swap ! ;
+
+ Our next goal, now that we have ;, is to implement
+ if-then. To do this, we'll need to play fast and
+ loose with the return stack, so let's make some
+ words to save us some effort.
+
+ First we want a word that pops off the top of the normal
+ stack and pushes it on top of the return stack. We'll
+ call this 'tor', for TO-Return-stack. It sounds easy,
+ but when tor is running, there's an extra value on the
+ return stack--tor's return address! So we have to pop
+ that off first... We better just bite the bullet and
+ code it out--but we can't really break it into smaller
+ words, because that'll trash the return stack.
+: tor
+ r @ @ Get the value off the top of the return stack
+ swap Bring the value to be pushed to the top of stack
+ r @ ! Write it over the current top of return stack
+ r @ 1 + r ! Increment the return stack pointer--but can't use inc
+ r @ ! Store our return address back on the return stack
+;
+
+ Next we want the opposite routine, which pops the top
+ of the return stack, and puts it on the normal stack.
+: fromr
+ r @ @ Save old value
+ r @ 1 - r ! Decrement pointer
+ r @ @ Get value that we want off
+ swap Bring return address to top
+ r @ ! Store it and return
+;
+
+ Now, if we have a routine that's recursing, and we
+ want to be polite about the return stack, right before
+ we recurse we can run { fromr drop } so the stack won't
+ blow up. This means, though, that the first time we
+ enter this recursive routine, we blow our *real* return
+ address--so when we're done, we'll return up two levels.
+ To save a little, we make 'tail' mean { fromr drop };
+ however, it's more complex since there's a new value on
+ top of the return stack.
+: tail fromr fromr drop tor ;
+
+ Now, we want to do 'if'. To do this, we need to convert
+ values to boolean values. The next few words set this
+ up.
+
+ minus gives us unary negation.
+: minus 0 swap - ;
+
+ If top of stack is boolean, bnot gives us inverse
+: bnot 1 swap - ;
+
+ To compare two numbers, subtract and compare to 0.
+: < - <0 ;
+
+ logical turns the top of stack into either 0 or 1.
+: logical
+ dup Get two copies of it
+ 0 < 1 if < 0, 0 otherwise
+ swap minus Swap number back up, and take negative
+ 0 < 1 if original was > 0, 0 otherwise
+ + Add them up--has to be 0 or 1!
+;
+
+ not returns 1 if top of stack is 0, and 0 otherwise
+: not logical bnot ;
+
+ We can test equality by subtracting and comparing to 0.
+: = - not ;
+
+ Just to show how you compute a branch: Suppose you've
+ compiled a call to branch, and immediately after it is
+ an integer constant with the offset of how far to branch.
+ To branch, we use the return stack to read the offset, and
+ add that on to the top of the return stack, and return.
+: branch
+ r @ Address of top of return stack
+ @ Our return address
+ @ Value from there: the branch offset
+ r @ @ Our return address again
+ + The address we want to execute at
+ r @ ! Store it back onto the return stack
+;
+
+ For conditional branches, we want to branch by a certain
+ amount if true, otherwise we want to skip over the branch
+ offset constant--that is, branch by one. Assuming that
+ the top of the stack is the branch offset, and the second
+ on the stack is 1 if we should branch, and 0 if not, the
+ following computes the correct branch offset.
+: computebranch 1 - * 1 + ;
+
+ Branch if the value on top of the stack is 0.
+: notbranch
+ not
+ r @ @ @ Get the branch offset
+ computebranch Adjust as necessary
+ r @ @ + Calculate the new address
+ r @ ! Store it
+;
+
+ here is a standard FORTH word which returns a pointer to
+ the current dictionary address--that is, the value of
+ the dictionary pointer.
+: here h @ ;
+
+ We're ALL SET to compile if...else...then constructs!
+ Here's what we do. When we get 'if', we compile a call
+ to notbranch, and then compile a dummy offset, because
+ we don't know where the 'then' will be. On the *stack*
+ we leave the address where we compiled the dummy offset.
+ 'then' will calculate the offset and fill it in for us.
+: if immediate
+ ' notbranch , Compile notbranch
+ here Save the current dictionary address
+ 0 , Compile a dummy value
+;
+
+ then expects the address to fixup to be on the stack.
+: then immediate
+ dup Make another copy of the address
+ here Find the current location, where to branch to
+ swap - Calculate the difference between them
+ swap ! Bring the address to the top, and store it.
+;
+
+ Now that we can do if...then statements, we can do
+ some parsing! Let's introduce real FORTH comments.
+ find-) will scan the input until it finds a ), and
+ exit.
+: find-)
+ key Read in a character
+ ')' = Compare it to close parentheses
+ not if If it's not equal
+ tail find-) repeat (popping R stack)
+ then Otherwise branch here and exit
+;
+
+: ( immediate
+ find-)
+;
+
+( we should be able to do FORTH-style comments now )
+
+( now that we've got comments, we can comment the rest of the code
+ in a legitimate [self parsing] fashion. Note that you can't
+ nest parentheses... )
+
+: else immediate
+ ' branch , ( compile a definite branch )
+ here ( push the backpatching address )
+ 0 , ( compile a dummy offset for branch )
+ swap ( bring old backpatch address to top )
+ dup here swap - ( calculate the offset from old address )
+ swap ! ( put the address on top and store it )
+;
+
+: over _x! _y! _y _x _y ;
+
+: add
+ _x! ( save the pointer in a temp variable )
+ _x @ ( get the value pointed to )
+ + ( add the incremement from on top of the stack )
+ _x ! ( and save it )
+;
+
+: allot h add ;
+
+: maybebranch
+ logical ( force the TOS to be 0 or 1 )
+ r @ @ @ ( load the branch offset )
+ computebranch ( calculate the condition offset [either TOS or 1])
+ r @ @ + ( add it to the return address )
+ r @ ! ( store it to our return address and return )
+;
+
+: mod _x! _y! ( get x then y off of stack )
+ _y _y _x / _x * ( y - y / x * x )
+ -
+;
+
+: printnum
+ dup
+ 10 mod '0' +
+ swap 10 / dup
+ if
+ printnum
+ echo
+ else
+ drop
+ echo
+ then
+;
+
+: .
+ dup 0 <
+ if
+ '-' echo minus
+ then
+ printnum
+ 'space' echo
+;
+
+: debugprint dup . cr ;
+
+( the following routine takes a pointer to a string, and prints it,
+ except for the trailing quote. returns a pointer to the next word
+ after the trailing quote )
+
+: _print
+ dup 1 +
+ swap @
+ dup '"' =
+ if
+ drop exit
+ then
+ echo
+ tail _print
+;
+
+: print _print ;
+
+ ( print the next thing from the instruction stream )
+: immprint
+ r @ @
+ print
+ r @ !
+;
+
+: find-"
+ key dup ,
+ '"' =
+ if
+ exit
+ then
+ tail find-"
+;
+
+: " immediate
+ key drop
+ ' immprint ,
+ find-"
+;
+
+: do immediate
+ ' swap , ( compile 'swap' to swap the limit and start )
+ ' tor , ( compile to push the limit onto the return stack )
+ ' tor , ( compile to push the start on the return stack )
+ here ( save this address so we can branch back to it )
+;
+
+: i r @ 1 - @ ;
+: j r @ 3 - @ ;
+
+: > swap < ;
+: <= 1 + < ;
+: >= swap <= ;
+
+: inci
+ r @ 1 - ( get the pointer to i )
+ inc ( add one to it )
+ r @ 1 - @ ( find the value again )
+ r @ 2 - @ ( find the limit value )
+ <=
+ if
+ r @ @ @ r @ @ + r @ ! exit ( branch )
+ then
+ fromr 1 +
+ fromr drop
+ fromr drop
+ tor
+;
+
+: loop immediate ' inci @ here - , ;
+
+: loopexit
+
+ fromr drop ( pop off our return address )
+ fromr drop ( pop off i )
+ fromr drop ( pop off the limit of i )
+; ( and return to the caller's caller routine )
+
+: execute
+ 8 !
+ ' exit 9 !
+ 8 tor
+;
+
+: :: ; ( :: is going to be a word that does ':' at runtime )
+
+: fix-:: immediate 3 ' :: ! ;
+fix-::
+
+ ( Override old definition of ':' with a new one that invokes ] )
+: : immediate :: ] ;
+
+: command
+ here 5 ! ( store dict pointer in temp variable )
+ _read ( compile a word )
+ ( if we get control back: )
+ here 5 @
+ = if
+ tail command ( we didn't compile anything )
+ then
+ here 1 - h ! ( decrement the dictionary pointer )
+ here 5 @ ( get the original value )
+ = if
+ here @ ( get the word that was compiled )
+ execute ( and run it )
+ else
+ here @ ( else it was an integer constant, so push it )
+ here 1 - h ! ( and decrement the dictionary pointer again )
+ then
+ tail command
+;
+
+: make-immediate ( make a word just compiled immediate )
+ here 1 - ( back up a word in the dictionary )
+ dup dup ( save the pointer to here )
+ h ! ( store as the current dictionary pointer )
+ @ ( get the run-time code pointer )
+ swap ( get the dict pointer again )
+ 1 - ( point to the compile-time code pointer )
+ ! ( write run-time code pointer on compile-time pointer )
+;
+
+: <build immediate
+ make-immediate ( make the word compiled so far immediate )
+ ' :: , ( compile '::', so we read next word )
+ 2 , ( compile 'pushint' )
+ here 0 , ( write out a 0 but save address for does> )
+ ' , , ( compile a push that address onto dictionary )
+;
+
+: does> immediate
+ ' command , ( jump back into command mode at runtime )
+ here swap ! ( backpatch the build> to point to here )
+ 2 , ( compile run-code primitive so we look like a word )
+ ' fromr , ( compile fromr, which leaves var address on stack )
+;
+
+
+: _dump ( dump out the definition of a word, sort of )
+ dup " (" . " , "
+ dup @ ( save the pointer and get the contents )
+ dup ' exit
+ = if
+ " ;)" cr exit
+ then
+ . " ), "
+ 1 +
+ tail _dump
+;
+
+: dump _dump ;
+
+: # . cr ;
+
+: var <build , does> ;
+: constant <build , does> @ ;
+: array <build allot does> + ;
+
+: [ immediate command ;
+: _welcome " Welcome to THIRD.
+Ok.
+" ;
+
+: ; immediate ' exit , command exit
+
+[
+
+_welcome
+