Issue 5

by Alex Shinn

2010-09-27 +0000

0. Introduction

Welcome to issue 5 of the Chicken Gazette! We're tentatively switching the Gazette publication to Monday this week, to give people something to read after coming back from the weekend.

1. The Hatching Farm - New Eggs & The Egg Repository

This week Mario Domenech Goulart released a new egg called accents-substitute to replaced accented Latin characters with either HTML entities or their non-accented ASCII equivalents, for when you need to work with ASCII-only text.

Ivan Raikov has been busy, and in addition to many egg updates has released a new egg called cis (compact integer sets) as a possible alternate to last week's featured egg iset. It's less efficient in terms of time and space, but has a simpler implementation for when performance doesn't matter.

Our fearless leader Felix also added a new egg system, inspired by the CL defsystem macro. system lets you define groups of files and their dependencies which can be loaded or compiled, and even re-loaded or compiled keeping track of modified files. Use it for a make alternative in your .setup files, or for rapid development in the repl!

2. The Core - Bleeding Edge Development

It's been another busy week for core development:

Overflow-detection for basic arithmetic ops (+, -, * and /) has been changed to use bit-twiddling instead of "parallel flonum" computations, since 64-bit IEEE doubles have not enough precision to hold the full range of fixnums on 64-bit systems

A serious compiler bug related to inlining was fixed (found with much help by Sven Hartrumpf), and several other bugs reported by Kon Lovett were fixed.

A new foreign type pointer-vectors (vectors of native unboxed pointers) was added, with an API in lolevel.

A simpler alternative to er-macro-transformer, ir-macro-transformer (implicit renaming macros) was added by Peter Bex. See ticket #394 on trac.

But the biggest change: irregex is now the official regex API, and has full library unit status, regex unit is removed and available as an egg (should be fully backwards compatible, as long as (use regex) (import irregex) idiom is used; dependencies on regex unit not in egg repo, though), upgraded irregex version, many upstream bug-fixes and optimizations, with many thanks to Peter Bex and Alex Shinn.

And thanks to Felix for help with the summary!

3. Chicken Talk

The exciting news on the mailing list this week was a performance boost mentioned by Mario Domenech Goulart where the awful web framework ran a benchmark almost 7x faster. This is likely due to a new GC improvement by Felix.

Taylor Venable brought up an issue in the new coops object system involving class redefinition and define-method on a generic not first provided with define-generic. It turns out Chicken will do an implicit define-generic for you as a convenience, but it's probably best to define each generic once explicitly. Also be aware that redefining a class will create a completely new class which instances of the old class will not belong to.

Richard Hollos reported an error compiling on AMD64 Linux, which turned out to be just a chicken and egg problem.

Markus Klotzbuecher provided a patch for the cairo egg, showing activity in the graphical library front.

Finally Felix just announced coops version 1.0! If you've been using tinyclos, give coops a try.

4. Omelette Recipes - Tips and Tricks

We've got a wide variety of options for parsing in Chicken, from the built-in read and extensions thereon, to regular expressions, to a plethora of both specific and general purpose parsing libraries. This week, I want to take a look at one of the general purpose libraries, the packrat egg by Tony Garnock-Jones.

A packrat parser is a parser for Parsing Expression Grammars (PEGs), which is essentially a recursive decent parser with backtracking plus memoization. What does all that mean? Well, a recursive decent parser just looks ahead a fixed number of tokens and dispatches accordingly. Scheme's own s-expressions are a simple example of this, since a datum can be determined by just the first few characters. Since you just determine the rule once in advance and then recurse, this lends itself to very simple parsers, and indeed read is very easy to implement in Scheme.

Adding backtracking removes the restriction on fixing the rule in advance, allowing more complex grammars resembling those of typical parser generators. With backtracking we usually call the library a parser combinator, and there are already a number of examples available including Taylor Campbell's parscheme. The disadvantage of a naive backtracking approach is that it may require exponential time to parse. A packrat parser just adds memoization, which guarantees a linear time parse.

The rules for PEGs look similar to Context Free Grammars (CFGs), but are unambiguous by virtue of being ordered - the leftmost matching rule is always chosen. PEGs also allow and and not rules, which allow them to parse languages that can't be described by CFGs.

The documentation for packrat is not too user-friendly - among other things doesn't include any examples of parsing from actual text. This is because packrat is written to work over abstract streams of tokens, not just text, but text is usually what people want to work with when looking at parsers. So we'll start by writing a version of the expression parser example to work on text.

  ;; Start with the base textual generator from the documentation:
  (define (generator filename port)
    (let ((ateof #f)
          (pos (top-parse-position filename)))
      (lambda ()
        (if ateof
            (values pos #f)
            (let ((x (read-char port)))
              (if (eof-object? x)
                  (begin
                    (set! ateof #t)
                    (values pos #f))
                  (let ((old-pos pos))
                    (set! pos (update-parse-position pos x))
                    (values old-pos (cons x x)))))))))

You can just use this as-is without worrying about the details for now.

  ;; Now define a character-oriented version of the expression parser:
  (define expr-parser
    (packrat-parser expr
      (expr ((a <- mulexp '#\+ b <- mulexp) (+ a b)) ((a <- mulexp) a))
      (mulexp ((a <- simple '#\* b <- simple) (* a b)) ((a <- simple) a))
      (simple ((a <- num) a) (('#\( a <- expr '#\)) a))
      (num ((a <- digits) (string->number (list->string a))))
      (digits ((a <- digit b <- digits) (cons a b)) ((a <- digit) (list a)))
      (digit ((a <- '#\0) a) ((a <- '#\1) a) ((a <- '#\2) a)
             ((a <- '#\3) a) ((a <- '#\4) a) ((a <- '#\5) a)
             ((a <- '#\6) a) ((a <- '#\7) a) ((a <- '#\8) a)
             ((a <- '#\9) a))))

Here we describe our grammar as returning an expr (the initial rule). Following this is a list of definitions for the non-terminals expr, mulexp, simple, num, digits and digit, which may recursively refer to each other.

Each definition has one or more clauses which contain a sequence and an action. The action is normal Scheme code. The sequence is a list of other non-terminals referred to by name, and non-terminals referred to by quoting them. You can also bind a value to be bound in the action by writing <id> <- <pattern>.

So the simplest (though longest) rule is for digit which simply matches any digit character, binds is, and returns it. These are collected by the digits rule which conses a list of one or more digits. Notice that digits first tries to bind more than one digit by recursively matching itself, defaulting to just a single digit only when this fails - the rules are tried in left-to-right order, only backtracking on failure. Finally, num takes the result of digits and constructs a number from it.

The other rules are from the original documentation, and are fairly straightforward parser rules. We've chosen to compute the values directly, using (+ a b) and (* a b), but could just have easily returned the results as s-expressions, e.g. with `(+ ,a ,b) giving the user an AST to work with.

Now to put together the parser and generator, let's write a simple utility function:

  ;; Utility function to parse from strings or ports,
  ;; and return the semantic results on success or #f on failure:
  (define (expr-parse x)
    (let ((g (base-generator->results
              (generator #f (if (string? x) (open-input-string x) x)))))
      (let ((res (expr-parser g)))
        (and (parse-result-successful? res)
             (parse-result-semantic-value res)))))

  ;; Some examples:
  (expr-parse "2") => 2
  (expr-parse "2+2") => 4
  (expr-parse "2+2*7") => 16
  (expr-parse "(2+2)*7") => 28
  (expr-parse "3*4+5*6") => 42

We pass a dummy filename and a port to the generator, and pass the result of this to base-generator->results, then call the parser on it. The parser results are available in parse-result-semantic-value.

Note there's no need for a separate lexer, as the rules are matching directly on characters. You can use a separate lexer if you want, though, by providing a higher-level generator than the one above.

For a more complex example see the packrat alternative in the uri-generic egg.

The egg is still rather primitive, and is missing some useful features such as charsets (which would simplify our "digit" rule greatly), and arbitrary predicates, but is definitely something that can be built on. It would be especially nice if rules could be composed - maybe someone can add that!

5. About the Chicken Gazette

The Gazette is produced weekly by a volunteer from the Chicken community. The latest issue can be found at http://gazette.call-cc.org or you can follow it in your feed reader at http://gazette.call-cc.org/feed.atom. If you'd like to write an issue, check out the instructions and come and find us in #chicken on Freenode!