1.0 Literate Programming - A Gentle Introduction 1.1 What is the problem? 1.2 Is there a better way? 1.3 Literate Programming -- The basic idea 1.3.1 Literate Programming is NOT documentation 1.3.2 Literate Programming is a change in mindset 1.3.3 Literate Programming reduces bugs 1.4 Understanding the literate programming tangle tool 1.4.1 Motivation, or getchunk 1.4.2 All of the literate programming tools 2.0 The Tangle Program 2.1 Understanding the story of the program 2.2 The main task 2.3 A complication with HTML 2.4 Finding the chunk 2.5 Finding the next line in the buffer 2.6 At last, finding the chunk 2.7 Printing the chunk we found 2.8 The simple case, print the line 2.9 Checking chunks for getchunk tags 2.10 Getting a new chunk name from the getchunk tag 2.11 Finding the end of the chunk we are printing 2.12 Final cleanup and housekeeping. The preamble 3.0 The tangle source code 4.0 What have we learned? 5.0 Credits
In the 1970s I worked on a PDP11/40, which is the same class of machine that was used to develop UNIX. My PDP had 8192 bytes of memory. The editor used 4096 bytes and left the other 4096 bytes for an edit buffer.
This meant that the largest file I could create at any time was 4096 bytes. As a result, C programs were limited in size. To get around this problem we used include files, libraries of code, linkers, and overlay loaders. Segment registers exist on the x86 processors to help overlay loaders.
People program today the same way I programmed in the 1970s. We write programs that are made of tiny pieces of sand in hundreds of little files spread across dozens of directories. In order to find anything we use grep to search files, or we create tools to pre-search the files such as Eclipse and IntelliJ.
Imagine taking the same approach to calculus. Create a directory for each chapter. Take each equation in the chapter and put it in a file. Do this for every chapter. Throw away the textbook. Now try to learn calculus.
That's the way we program today, just like I did in the 1970s.
We program to communicate to the machine, not to communicate to people.
That's the problem.
In my life as an architect, I find that the single thing which inhibits young professionals, new students most severely, is their acceptance of standards that are too low. If I ask a student whether her design is as good as Chartres, she often smiles tolerantly at me as if to say, "Of course not, that isn't what I am trying to do...I could never do that."
Then, I express my disagreement, and tell her: "That standard must be our standard. If you are going to be a builder, no other standard is worthwhile."
--Christopher Alexander, May 1996 in (Patterns of Software, Gabriel)
Consider the best possible world. You've been hired at a company and join a team that is already working on a program. They hand you a book, tell you to go home and read it over the next two weeks. At the end of the two weeks you can work on the program as effectively anyone on the team. The team has successfully communicated from one human to another.
What is in the book?
Remember our calculus textbook? It started from the ideas like limits and gradually developed the ideas until they could be expressed in equations. By the time you got to the equations you already understood the concepts. You could look at the equations and see why they matched the text. It is the why that is the important part. It is the part that our programs are missing.
The book you took home uses the same method. You started with the problem in chapter 1. Chapter 2 expresses the ideas needed to solve the problem. The next few chapters expand on each idea, gradually becoming more specific until the idea is reduced to code. By the time you get to the code it should be perfectly clear what the code should look like. Any part of the code you don't understand means that the book needs some additional words.
The best programming language is English. Everything else is notation
It is rather pointless to have code in a book, right? Someone could change the code and then the book would be out of date.
The way to solve this problem is to extract the code directly from the book. That way, every time you change the code in the book you change the actual program.
How hard is it to write an extraction program that can find the code in the book and extract it to some files? Well, lets suppose we name each section of code and call it a chunk.
We need a program that lets us find a chunk in a document and extract it. The traditional name for such a program is tangle. The tangle program takes two arguments, the name of the book and the name of the chunk:
tangle mybookname chunkname
With this simple tool you can now write books that contain the actual source code.
The first reaction is that this is a new, painful form of documentation. It is not. Documentation is a how construct. It is a way of explaining how a program works or how an application interface should be used.
Literate programming is about why. It explains the ideas. It is communication from one person to another.
The why becomes important when you have to maintain and modify a program. You can perfectly understand how a subroutine, module, or class works. You can explain its input and outputs. What you don't understand is why it exists. Nobody writes that down. So when you are joining a group and given the directory containing thousand of pieces of code, you have no idea why the program is structured the way it is.
In order to do literate programming you have to change the way you think. You have to write english text to communicate ideas. You have to move from ideas to implementation in a way that make the story clear.
If programmers were writing a James Bond novel they would create a file for JamesBond, MissMoneyPenny, Q, and TheBadGuys. They would create a directory for Scenes with subdirectories for Scene1, Scene2, and so on. Then they would box up the whole tree, send it to the audience, and tell them
When this program is run, the bad guy dies
We live by stories. We tell stories and we remember stories. Characters do things and the author tries to convey their motivations (the why). Scenes, flashy cars, trick watches, and other items are added to solve local problems in the story. The whole story moves from the beginning to the end. You can remember the story and you can find places where you think it is weak or cheesy or bad.
Subroutines, modules, or classes are like characters in a story. They need motivation. They need the why. The factory classes, the singletons, the garbage collection routines, and other "items" exist to solve local problems in the story of the program.
Literate programming is about writing the story of your program. What are you trying to solve? How are you solving it? Why did you introduce piece of code? Is it part of the main story or is it an item that helps solve a local problem?
We are all conditioned to follow stories. We do it all the time. If I present you with a piece of code without the story then you have to struggle to understand it. Take, for example, this:
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/mman.h> #include <sys/types.h> #include <fcntl.h> /* forward reference for the C compiler */ int getchunk(char *chunkname); char *chunkbegin = "<pre id=\""; int chunkbeginlen = 9; char *chunkend = "</pre>"; int chunkendlen = 6; char *chunkget = "<getchunk id=\""; int chunkgetlen = 14; /* a memory mapped buffer copy of the file */ char *buffer; int bufsize; /* return the length of the next line */ int nextline(int i) { int j; if (i >= bufsize) return(-1); for (j=0; ((i+j < bufsize) && (buffer[i+j] != '\n')); j++); return(j); } /* output the line we need */ int printline(int i, int length) { int j; for (j=0; j<length; j++) { putchar(buffer[i+j]); } printf("\n"); } /* handle <pre id="chunkname"> */ /* is this chunk name we are looking for? */ int foundchunk(int i, char *chunkname) { if ((strncmp(&buffer[i],chunkbegin,chunkbeginlen) == 0) && (strncmp(&buffer[i+9],chunkname,strlen(chunkname)) == 0) && (buffer[i+chunkbeginlen+strlen(chunkname)] == '"') && (buffer[i+chunkbeginlen+strlen(chunkname)+1] == '>')) return(1); return(0); } /* handle </pre> */ /* is it really an end? */ int foundEnd(int i) { if (strncmp(&buffer[i],chunkend,chunkendlen) == 0) { return(1); } return(0); } /* handle <getchunk id="chunkname"/> */ /* is this line a getchunk? */ int foundGetchunk(int i, int linelen) { int len; if (strncmp(&buffer[i],chunkget,chunkgetlen) == 0) { for(len=1; ((len < linelen) && (buffer[i+chunkgetlen+len] != '\"')); len++); return(len); } return(0); } /* Somebody did a getchunk and we need a copy of the name */ /* malloc string storage for a copy of the getchunk name */ char *getChunkname(int k, int getlen) { char *result = (char *)malloc(getlen+1); strncpy(result,&buffer[k+chunkgetlen],getlen); result[getlen]='\0'; return(result); } /* print lines in this chunk, possibly recursing into getchunk */ int printchunk(int i, int chunklinelen, char *chunkname) { int j; int k; int linelen; char *getname; int getlen = 0; for (k=i+chunklinelen+1; ((linelen=nextline(k)) != -1); ) { if ((getlen=foundGetchunk(k,linelen)) > 0) { getname = getChunkname(k,getlen); getchunk(getname); free(getname); k=k+getlen+17; } else { if ((linelen >= chunkendlen) && (foundEnd(k) == 1)) { return(k+chunkbeginlen); } else { printline(k,linelen); k=k+linelen+1; } }} return(k); } /* find the named chunk and call printchunk on it */ int getchunk(char *chunkname) { int i; int j; int linelen; int chunklen = strlen(chunkname); for (i=0; ((linelen=nextline(i)) != -1); ) { if ((linelen >= chunklen+11) && (foundchunk(i,chunkname) == 1)) { i=printchunk(i,linelen,chunkname); } else { i=i+linelen+1; } } return(i); } void fixHTMLcode() { int point = 0; int mark = 0; int i=0; for(point = 0; point < bufsize;) { if ((buffer[point] == '&') && (strncmp(&buffer[point+1],"lt;",3) == 0)) { buffer[mark++] = 60; point = point + 4; } else if ((buffer[point] == '&') && (strncmp(&buffer[point+1],"gt;",3) == 0)) { buffer[mark++] = 62; point = point + 4; } else buffer[mark++] = buffer[point++]; } bufsize = mark; } /* memory map the input file into the global buffer and get the chunk */ int main(int argc, char *argv[]) { int fd; struct stat filestat; if ((argc != 2)) { perror("Usage: tangle filename chunkname"); exit(-1); } fd = open(argv[1], O_RDONLY); if (fd == -1) { perror("Error opening file for reading"); exit(-2); } if (fstat(fd,&filestat) < 0) { perror("Error getting input file size"); exit(-3); } bufsize = (int)filestat.st_size; buffer = (char *)malloc(bufsize); read(fd,buffer,bufsize); fixHTMLcode(); getchunk(argv[2]); close(fd); return(0); }
Odds are good you just skipped over it, right? It is pretty much just a long pile of noise. It really doesn't matter if you are a C programmer or not. It is just noise.
We write literate programs to communicate to both people and machines. Since they are combined in a single document we need a markup language to distinguish human text from machine text as well as a markup language to write readable english.
Literate programs don't care what you use as a documentation markup language. I prefer latex but for this example we'll use HTML.
HTML has a set of markup tags. One of the markup tags is the pre tag. Anything between the beginning of a <pre> tag and the trailing </pre> tag is quoted and printed exactly as written.
The pre tag also has an id field which lets us put a name on the tag. So we can have many named pre tags in a file.
The tangle program reads the HTML file and extracts the text in a pre chunk. That's almost all of the magic.
Remember, though, that we wanted to motivate ideas in a program. Motivations get introduced in an order that humans can understand. So ideas have to be presented in a particular order for humans and a different order for machines. We need a way to reorganize the program into pieces. The trick is getchunk.
We introduce a special tag called getchunk. It only occurs within a pre tag block. The getchunk tag has an id field which names another chunk. When we find a getchunk tag, we read the id name and replace the getchunk tag with the named chunk.
This getchunk tag allows us to "clip out" a piece of a program from the middle of a subroutine, put it into its own chunk, and wrap some text around it.
This is useful if you have some tricky loop in the middle of a large subroutine that you need to explain. You can clip the loop out, put it in a pre tag of its own and replace it with a getchunk.
So we have a single program, called tangle. It works with three HTML tags, the <pre> tag, the </pre> and the <getchunk> tag. They look like:
<pre id="main.c"> </pre> <getchunk id="main.c">
That's all there is.
Now you can embed your subroutines in an HTML file, surround it by pre tags, replace nasty sections with getchunk tags so you can explain them separately, and you can extract the whole set of subroutines using the tangle program.
The big loop of program development is now:
do forever { edit the html file tangle thehtmlfile sub1.c >sub1.c tangle thehtmlfile sub2.c >sub2.c tangle thehtmlfile main.c >main.c gcc -o myfile sub1.c sub2.c main.c ./myfile }
When you are finished for the day, the week, or the project your whole program is properly explained as well as properly maintained.
If you did it right you can just point someone at the HTML file, let them go away for a while, and when they return they understand enough about your program to take over the whole project.
But wait, you say... you skipped over the hard part.
Indeed, I did. The hard part is that you need to communicate your ideas to another human. And you have to do it in a way that reduces your idea to practice. In fact, you need to clearly reduce your ideas to practice.
We will try to illustrate literate programming using the tangle program as an example. However, if you want to see a real literate program I strongly recommend the book:
Lisp in Small Pieces by Christian Queinnec (ISBN 0-521-54566-8)
We would like to write programs embedded in documents which surround the program text with explanations. Since the compiler cannot understand the markup language we need a program, which we will call tangle, that wil extract code from a document in a given markup language.
The tangle program is specific to the markup language used, whether it is latex, HTML, or some other language. For this example we are assuming that the markup is HTML.
The markup language has to have some sort of verbatim quoting mechanism which allows us to put code inline. It also has to have a way to name the quoted pieces of code. We call the quoted pieces of code a chunk. We call the label the chunkname.
HTML uses the pre tag as its quoting mechanism. Items between the opening pre tag and the closing /pre tag are unchanged by HTML processors. The pre tag has an id parameter. We will use the id parameter as a chunk name.
The tangle program takes two arguments, the name of the document and the name of the chunk. It must print the chunk on standard output.
The main thing we have to do is find a chunk somewhere in our book. We specified the book and the chunk name on the command line so we ought to check that we have the right inputs.
There can only be two inputs and they are both required so first we check that condition. If there are not two then the user might not understand what was required so we print the usual usage message.
We have huge amounts of memory so we will simply load the file in a single chunk using the read function. This requires that we can open the file, which we check. It requires a handle to the file, which we get from the open call. It requires the size of the file so it can allocate memory, so we use fstat.
Now that we have all of the read parameters we can load the file into memory. The buffer variable is described as a pointer to a character which is C-speak for a string.
We also need to know the size of the buffer at all times so we allocate the bufsize variable at global scope.
/* a memory mapped buffer copy of the file */ char *buffer; int bufsize;
Since the buffer is being used everywhere we make it global by defining it at the top of the program, outside the scope of the functions.
/* memory map the input file into the global buffer and get the chunk */ int main(int argc, char *argv[]) { int fd; struct stat filestat; if ((argc != 2)) { perror("Usage: tangle filename chunkname"); exit(-1); } fd = open(argv[1], O_RDONLY); if (fd == -1) { perror("Error opening file for reading"); exit(-2); } if (fstat(fd,&filestat) < 0) { perror("Error getting input file size"); exit(-3); } bufsize = (int)filestat.st_size; buffer = (char *)malloc(bufsize); read(fd,buffer,bufsize); fixHTMLcode(); getchunk(argv[2]); close(fd); return(0); }
If the read succeeds we move on to the problem of finding the named chunk in the book.
HTML is not clever about ignoring HTML symbols in <pre> blocks of code. This causes us three complications which do not normally occur with a tangle program.
The first complication can be seen by viewing the source for this page. We need to replace the angle brackets with the HTML escape codes everywhere, including in the verbatim sections. So we have to be careful using certain language constructs. Reasonable markup languages such as latex do not have this problem.
The second complication is that the string search for the pre, /pre and getchunk tags need to use the escape code sequence for matching.
The third complication is that the tangle program has to reverse this translation when printing the program.
There is no such thing as a simple job.
The trick here will be to walk the buffer and replace every use of the HTML code with the corresponding character, appropriately adjusting the space. Since HTML codes are always longer than the character they replace we can just compress the buffer a bit. And since the buffer is never used for anything else we can even replace the HTML codes in parts of the buffer where it might be inappropriate.
The point variable is where we are looking in the buffer. The mark variable is where we are writing in the buffer. Normally we just walk them in parallel. They diverge when they hit an HTML code. The point is copied to the mark. In the absence of any codes this is a copy of the buffer to itself.
When we hit a code we place the corresponding character at the mark, bump the point over the HTML code, and continue the copy.
When we hit the end of the buffer we rewrite the bufsize variable to reflect the new, shorter buffer size.
There is another really subtle point here. Note that we have to do the comparison in two pieces otherwise the combined string is a valid HTML code and we end up destroying our own search string. If we break it into two parts then the substring "lt;" is not a valid HTML code and will not be replaced. This is the kind of information that literate programming preserves. A clever programmer would combine the two tests into the single strncmp and everything will fail.
Michael Albaugh points out that this code does not handle the complete list of escape codes. For a more general implementation we should create a table mapping the code from the HTML version to the characters. Since they do not occur in this example I have not done this in its full generality. Escape codes can be numeric as well as character based and there are several hundred of them.
void fixHTMLcode() { int point = 0; int mark = 0; int i=0; for(point = 0; point < bufsize;) { if ((buffer[point] == '&') && (strncmp(&buffer[point+1],"lt;",3) == 0)) { buffer[mark++] = 60; point = point + 4; } else if ((buffer[point] == '&') && (strncmp(&buffer[point+1],"gt;",3) == 0)) { buffer[mark++] = 62; point = point + 4; } else buffer[mark++] = buffer[point++]; } bufsize = mark; }
We are given the chunkname as a string and we have to search the
file for the
<pre id="name"> tag.
Since the string came from the command line we know that it ends with a null character so we can just call strlen to get the length.
By design, we require the <pre> tag to start in the first character of the line. We could remove this restriction if we felt like being clever but code is usually left-aligned in the output so left-aligning the tag is reasonable.
Given this design decision we can walk the book one line at a time. We delegate the task of finding the line length linelen to the nextline function. The "magic number" 11 exists because the smallest pre tag contains 11 characters:
<pre id="">
If the line is long enough to contain the <pre> tag and its associated id tag and we find the named chunk then we print it.
Another design decision is that we will allow multiple chunks with the same name and we will print them in order. This allows us to break a big block into smaller blocks and insert text in the middle. So you can write
<pre id="somename"> the first part of a function </pre> some explanation <pre id="somename"> the next part of the function </pre> more explanationWhen you run tangle filename somename you will see the output:
the first part of a function the next part of the function
This is trivial to implement. All we have to do is keep searching the document and printing the same named chunk.
/* find the named chunk and call printchunk on it */ int getchunk(char *chunkname) { int i; int j; int linelen; int chunklen = strlen(chunkname); for (i=0; ((linelen=nextline(i)) != -1); ) { if ((linelen >= chunklen+11) && (foundchunk(i,chunkname) == 1)) { i=printchunk(i,linelen,chunkname); } else { i=i+linelen+1; } } return(i); }
The nextline function just walks the buffer starting at any character position and returns the character position after the next newline. We only have to check that we don't run off the end of the buffer, otherwise we just keep a count of the character position.
/* return the length of the next line */ int nextline(int i) { int j; if (i >= bufsize) return(-1); for (j=0; ((i+j < bufsize) && (buffer[i+j] != '\n')); j++); return(j); }
We are positioned at the beginning of a line so we are looking at something that looks like:
<pre id="somechunkname">
We need to check the format of the line. The first 9 characters must be:
<pre id="
The next substring must be the chunkname we seek and the rest of line must be the
">
So if we find a properly formed line with the chunk name then we return a 1 indicating success, otherwise we return a 0 indicating failure.
In a fit of hasty generality we allocate the chunkbegin information at the global scope. In theory we would like to be able to just change the patterns for different markup languages.
char *chunkbegin = "<pre id=\""; int chunkbeginlen = 9;
/* handle <pre id="chunkname"> */ /* is this chunk name we are looking for? */ int foundchunk(int i, char *chunkname) { if ((strncmp(&buffer[i],chunkbegin,chunkbeginlen) == 0) && (strncmp(&buffer[i+9],chunkname,strlen(chunkname)) == 0) && (buffer[i+chunkbeginlen+strlen(chunkname)] == '"') && (buffer[i+chunkbeginlen+strlen(chunkname)+1] == '>')) return(1); return(0); }
So we found our chunk and all we have to do is print it. We just print each line until we find the </pre> tag, a task we delegate to the foundEnd function.
Of course, there is no such thing as a simple job and, indeed, there is a complication here. We made a design decision that we will allow a getchunk tag to have special meaning within the pre tag.
We delegate getting the id from the getchunk tag to the getChunkname function which we will examine next.
Recall that getchunk will search the document for the pre tag named by the getchunk's id label. This means that we need to
1) stop printing the current chunk 2) find the chunk named by getchunk and print it 3) continue printing the original chunk where we left off
We already have a function to do step 2, the getchunk function mentioned above. We can just call that function which handles both finding and printing a chunk.
The magic number 17 occurs because we need to account for all of the characters in the line which are not part of the id string in the getchunk. So an empty string, including the trailing newline is:<getchunk id="">\n a total of 17 characters
/* print lines in this chunk, possibly recursing into getchunk */ int printchunk(int i, int chunklinelen, char *chunkname) { int j; int k; int linelen; char *getname; int getlen = 0; for (k=i+chunklinelen+1; ((linelen=nextline(k)) != -1); ) { if ((getlen=foundGetchunk(k,linelen)) > 0) { getname = getChunkname(k,getlen); getchunk(getname); free(getname); k=k+getlen+17; } else { if ((linelen >= chunkendlen) && (foundEnd(k) == 1)) { return(k+chunkbeginlen); } else { printline(k,linelen); k=k+linelen+1; } }} return(k); }
/* output the line we need */ int printline(int i, int length) { int j; for (j=0; j<length; j++) { putchar(buffer[i+j]); } printf("\n"); }
Because we treat the getchunk tag as a special case when we are printing we need a predicate to look for it. We are positioned at the beginning of a line when we are called so we only need to check for the string we call chunkget, that is:
char *chunkget = "<getchunk id=\""; int chunkgetlen = 14;This predicate returns the length of the chunkname or 0 if it is not a getchunk tag.
/* handle*/ /* is this line a getchunk? */ int foundGetchunk(int i, int linelen) { int len; if (strncmp(&buffer[i],chunkget,chunkgetlen) == 0) { for(len=1; ((len < linelen) && (buffer[i+chunkgetlen+len] != '\"')); len++); return(len); } return(0); }
We have encountered a <getChunkName> tag. We need to strip out the id string. We malloc a small piece of memory to contain the name. Note that this is a potential memory leak since we are letting allocated memory leave this function. The caller must remember to free it. If we check the caller, the printchunk function we see that the buffer is freed and memory does not leak.
/* Somebody did a getchunk and we need a copy of the name */ /* malloc string storage for a copy of the getchunk name */ char *getChunkname(int k, int getlen) { char *result = (char *)malloc(getlen+1); strncpy(result,&buffer[k+chunkgetlen],getlen); result[getlen]='\0'; return(result); }
The printchunk routine delegated the task of finding the </pre> tag to foundEnd. If we find the tag we return 1 otherwise we return 0.
The chunkend and chunkendlen variables are allocated in the global scope.
char *chunkend = "</pre>"; int chunkendlen = 6;
/* handle </pre> */ /* is it really an end? */ int foundEnd(int i) { if (strncmp(&buffer[i],chunkend,chunkendlen) == 0) { return(1); } return(0); }
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/mman.h> #include <sys/types.h> #include <fcntl.h> /* forward reference for the C compiler */ int getchunk(char *chunkname);
The big chunk of code at the beginning of this book is a working tangle program. You can clip it out of a web page, save it as tangle.c, type
gcc -o tangle tangle.cand you now have a working tangle program for HTML files. If you want to improve it just modify this web page and run tangle again to extract your changes.
Alternatively you can get the tangle source code from here
When we want the whole program we output this chunk with the command:
tangle litprog.html tangle.c >tangle.cand it recreates the tangle.c function.
<getchunk id="include"> <getchunk id="chunkbegin"> <getchunk id="chunkend"> <getchunk id="chunkget"> <getchunk id="buffer"> <getchunk id="nextline"> <getchunk id="printline"> <getchunk id="foundchunk"> <getchunk id="foundEnd"> <getchunk id="foundGetchunk"> <getchunk id="getChunkname"> <getchunk id="printchunk"> <getchunk id="getchunk"> <getchunk id="fixHTMLcode"> <getchunk id="main">
We have walked through a fairly simple program in great detail. You now understand why HTML causes problems, you understand what the magic numbers in the program mean, and you understand the design decisions such as always have pre tags in the first column.
We have learned that we only need a simple 160 line tangle program to handle all of the literate programming tasks. You can make this more complex if you like. You could create a tangle plugin for Eclipse. You could add automatic indexing. Many people have done similar things with their literate tools.
Ultimately, though, the important point is that you need to write programs to communicate with other people. We do not do that now. The result is that there are tens of thousands of programs on Sourceforge and Savannah and Github that are never going to be used because nobody understands them.
We can do better. We can make programs that live. We can communicate our ideas in ways that others can understand without talking to us.
We can be better programmers.
Tim Daly, November 18, 2011
Guilherme Reis found that the check for the number of arguments was incorrect. It should have been exactly 2. Michael Albaugh points out that this code does not handle the complete of escape codes, as noted above. A production version of this code would have to handle that.