Friday, December 28, 2007

Noticing Patterns versus Proving Theorems and Another Angle versus The Truth

In responding to Smart Languages and Dumb Parsers, several people have pointed to languages like Haskell or Ruby and said "counterexample." This seemed like a strange response to me, until I realized that they were viewing my post as an attempt at proving a theorem. On rereading my post, I realize that by using phrases like "necessary ingredient" and "what makes a highly extensible language possible" I unwittingly invited this interpretation. (Although I did also use phrases like "I'm starting to wonder" and "maybe.")

Smart Languages and Dumb Parsers, and several of the posts that I have planned, are more like observations of possible patterns than theorems. I have found some interesting angles from which to look at things. I'm not trying to convince anyone that these are the only ways to look at the world, or that they represent the Absolute Truth, but rather that there is something interesting and/or fun to be noticed if you squint the right way.

Good titles matter

I think that I made a mistake when I chose the title for Smart Languages and Dumb Parsers. Several people seem to have interpreted "Smart" to mean "good" or "high level" or "expressive," and I must admit that those are all perfectly valid interpretations of the word even though they aren't what I meant. Furthermore, since a reader sees the title before reading the post, I realize that it is foolish of me to expect said reader to interpret the title in the context of the post. Of course, I didn't do this consciously, but since I came up with the title after I wrote the post, I think I did have that expectation. In retrospect, I probably should have chosen a less pithy but more descriptive title like "Maximally Extensible Languages and Simple Parsers." I'll try to do better next time.

Monday, December 24, 2007

Smart Languages and Dumb Parsers

I used to find it supremely ironic that some of the languages which receive the most praise for placing the programmer on equal footing with the language designer, which give the programmer the most power to extend the language in any way the programmer wants, seem to have some of the dumbest parsers. Now I'm starting to wonder if a dumb parser is a necessary ingredient in a highly extensible language.

Consider, for example, Lisp, Forth, and Smalltalk. None of these languages support Fortran style formulas. None of them will evaluate "10-2*3" and give you 4. Lisp will reject it outright. Forth will leave (x-10)*2 and 3 on the stack, where x is whatever was on the stack before evaluating the expression. Smalltalk will give you 24.

Lisp macros allow you to add your own syntax to Lisp, but the reason why they are so much cooler than Perl's source filters or the C preprocessor is the fact that they operate on the syntax trees of your already parsed source code, instead of just text strings. But how can the parser ("reader" in Lisp terminology) parse your code before it knows what it means? Because the parser doesn't know anything about what any code means. It follows some fairly simple rules about how to parse s-expressions and leaves the meaning ("semantics") of those s-expressions entirely up to the macro expander and the interpreter or code generator.
Forth's defining words and immediate words allow you to extend the compiler in arbitrary ways. Defining words are words that you can write which will create new words when they are run. Immediate words allow you to use the full power of the language during compilation. Both of these features are made possible by the fact that the Forth compiler is incredibly simple minded. It has a simple rule for figuring out what the next word on the input is, and a simple rule for deciding what to do with it (run it if it's immediate, and compile its address if it's not immediate).

Smalltalk automatically turns any method whose name matches some simple criteria (1 or two non-alphanumeric characters) into a binary operator, allowing you to implement any binary operator that you like. It is able to do this because the parser treats all binary operators the same, making them all left associative. This simple minded approach places user created operators on the same level as those that come with the language. The programmer can implement any desired control structure by writing methods that expect blocks as arguments. The built in control structures are also all methods that take blocks as arguments. These programmer defined control structures look exactly like the built in control structures because the parser doesn't know about any control structures.

In all three of these cases, having a simple syntax that the compiler can parse without any knowledge of the semantics is what allows the parser to parse programmer created extensions to the language without knowing what they mean.

When a parser knows too much about the meaning of a program, for example the fact that + means addition and * means multiplication, a fact which is implicit in the common assumption that * has a higher level of precedence than +, maybe that "freezes" the language in a way. Maybe a really simple syntax makes it easier to completely separate syntax from semantics, and maybe a strong separation between syntax and semantics is what makes a highly extensible language possible.

Saturday, December 22, 2007

So my wife looks over my shoulder...

So my wife looks over my shoulder while I'm checking to see if anyone replied to my first blog post, and she says "Glomek's Blog? Do you have a secret life as a blogger that I don't know about? I guess it doesn't matter. It's probably all about computers anyway."

Friday, December 21, 2007

Unix makes Computer Science easy

As I've learned more Computer Science over the years, I've grown to realize that Unix packs a whole lot of CS power into a bunch of seemingly small commands with simple interfaces.

Consider the diff utility. It figures out the smallest number of changes required to transform one file into another. It uses some pretty heavy computer science, but the end user just tells it which files to compare and gets back the differences.

Or how about bc and dc? Arbitrary precision arithmetic was a bit of a trick in those days, but with Unix, you just run "bc" and type in your expressions, including exponential or trigonometric functions, and out pop your answers.

External sorting takes some skill and work to pull off. To sort a file that's ten times the size of available RAM, you need to sort lots of little chunks of the file and merge them together over multiple passes. Unix's sort utility just does it for you, creating temporary files as needed and hiding all of the complexity from you. It even does variable length records without batting an eye.

Once you've sorted your text file, and want to find something in it quickly, Unix lets you do a binary search in a file with variable length records with just a single command. look. And you don't even have to know what log(n) means.

Shell scripting provides something similar to functional programming, with files treated as lists of lines. The map, reduce, and filter operations from functional programming are made available in things like sed, wc, and grep.

Awk can do all three. Patterns without actions are a filter. Actions that do output perform mappings. Actions that accumulate information to be spat out by an END block do reductions.

For programmers, things get even better. Topological sorting is where you take a bunch of things, and a bunch of rules saying that one thing must come before another thing, and come up with an ordering of the things that satisfies all of the rules. It's kind of useful when you need to figure out what order in which to run all of the commands that build your program. make handles it all in a way that completely hides the complexity of the topological sort from the programmer.

Compilers, and parsing in particular, used to be deep magic, understood fully by very few people. lex and yacc made it easy for anyone to write a few regular expressions and a grammar and get out a parser (especially if they owned a copy of The Unix Programming Environment), without any need to worry about the ins and outs of finite state machines or push down automata or a whole bunch of stuff in the dragon book.

And all of this was available in Unix Seventh Edition. In 1979. On a computer that only gave each program 64k of memory in which to run.

[edit: Changed to reflect the fact that I was talking more about parsing than compiling in general. From what I have been able to gather from old books on compilers, parsing used to dominate the compiler writing effort much more than it does nowadays, but the distinction is still a valid one. Thanks to simen and Blackheart for pointing this out in their comments on reddit.]