Chapter 12 レキサー・パーザージェネレータ (ocamllex, ocamlyacc)

このページは最後に更新されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

last mod. 2008-10-01 (水) 17:38:16

この章では ocamllex と ocamlyacc の解説をします。ocamllex は、正規表現の集合と、それに対応するセマンティクスから字句解析器を生成し、ocamlyacc は文法と、それに対応するセマンティクスから構文解析器を生成します。

このプログラム生成ツールは、C 言語環境で有名な lex と yacc ととてもよく似ています。この章は lex と yacc の知識があることを前提としています。ocamllex と ocamlyacc に与えるソースの文法や lex と yacc との主な相違点などは解説しますが、lex と yacc での字句解析器や構文解析器の基本的な書き方は解説しません。lex と yacc をよく知らない方は、Compilers: principles, techniques, and tools(Aho, Sethi and Ullman (Addison-Wesley, 1986)) や Lex & Yacc(Levine, Mason and Brown (O'Reilly, 1992)) などを参照してください。

Overview of ocamllex

ocamllex は lex のように、正規表現の集合と、それに対応するセマンティクスから字句解析器を生成します。入力ファイルが lexer.mll であるとき、以下のようにして字句解析器の Caml コードファイル lexer.ml を生成することができます。

       ocamllex lexer.mll

このファイルでは字句解析器の定義で、エントリーポイントあたり 1 つの関数が定義されています。それぞれの関数名はエントリーポイントの名前と同じです。それぞれの字句解析関数は字句解析バッファを引数に取り、それぞれ関連付けられたエントリーポイントの文法属性を返します。

字句解析バッファは標準ライブラリ Lexing モジュール内で抽象データ型 (abstract data type) として実装されています。Lexing.from_channel 、Lexing.from_string 、Lexing.from_function はそれぞれ入力チャンネル、文字列、読み込み関数を読んで、字句解析バッファを返します (Lexing モジュールの解説は 20 章にあります) 。

使用する際は ocamlyacc で生成されるパーザと結合して、構文解析器で定義されている型 token に属する値をセマンティクスに基づいて計算します (ocamlyacc については後述) 。

↑

Options

ocamllexは、以下のコマンドラインオプションをサポートします。

-o output-file: ocamllexによって生成されるファイルの名前を指定します。デフォルトは、ocamllex lexer.mllで起動された場合はlexer.mlです。
-ml: Camlのビルトインのオートマトンインタプリタを使用しないコード生成します。その代わり、オートマトンはCamlの関数にエンコードされます。このオプションは、ocamllexのデバッグに便利ですが、通常用いることはお勧めしません。
-q: Quietモード。ocamllexは通常、標準出力に情報を表示します。-qが使われた場合はこれらが表示されなくなります。
-version: バージョンを表示し終了します。

↑

Syntax of lexer definitions

字句解析の定義の記述法は以下の通りです。

{ header }
let ident = regexp ...
rule entrypoint [arg_1... arg_n] =
  parse regexp { action }
      | ...
      | regexp { action }
and entrypoint [arg_1... arg_n] =
  parse ...
and ...
{ trailer }

コメントは Caml と同じように (* と *) で区切られます。 parseキーワードはshortestキーワードで置き換えられますが、その意味は以下で解説します。

↑

Header and trailer

ヘッダ(header) と trailer セクションには中カッコ({ ... })でくくって任意の Caml コードを入れることが出来ます。これらはどちらも省略可能です。存在すれば、header は出力ファイルの先頭に、trailer は出力ファイルの終端に置かれます。大抵、header セクションには action で必要な openディレクティブを置きます。actions 内で使う補助関数を定義することもあります。

↑

Naming regular expressions

header と entry points のあいだで、頻繁に発生する正規表現に名前を与えることが出来ます。let ident = regexpのように書きます。今後、正規表現を書くところに regexpの代わりに識別子identを使うことが出来ます。

↑

Entry points

エントリーポイントの名前は Caml の値の名前 (小文字で始まる) として有効な識別子である必要があります。同様に、引数arg_1... arg_nもCamlとして有効な識別子でなければなりません。それぞれのエントリーポイントはn+1個の引数を受け問り、最後に追加される暗黙の引数の型はLexing.lexbufです。文字を Lexing.lexbuf から読み、rule にある正規表現にマッチさせます。その正規表現に対応する action が評価され、関数の結果として返されます。

先頭文字列にマッチする正規表現が複数ある場合は、"最大マッチ" する正規表現が適用され、先頭文字列を長くして最後までマッチする正規表現を選択します。それでもふるいきれないときは、rule のなかで先に出現したものが優先されます。

しかし、parseキーワードの代わりにshortestキーワードが用いられている場合は、"最小マッチ"が適用されます:つまり最短の先頭文字列が選択されます。それでもふるいきれないときは、ruleの中で先に出現したものが優先されます。この機能は通常の字句解析で用いられることを意図しておらず、ocamllexを単なるテキスト処理ツールとして用いることを想定しています。

↑

Regular expressions

正規表現は lex と同じですが、より Caml らしい文法です。

 regexp ::=  ...

' regular-char | escape-sequence ': 文字定数。Objective Caml の文字定数と同じ文法です。その文字とマッチします。
_: (アンダースコア) どんな文字とでもマッチします。
eof: 入力終端にマッチします。

警告: システムによっては対話式入力において、end-of-file のあとにさらに文字列が続く場合がありますが、ocamllex では eof の後に何かが続く正規表現を正しく扱うことは出来ません。
" { string-character } ": 文字列定数。Objective Caml の文字列定数と同じ文法です。文字列にマッチします。
[ character-set ]: 指定された文字集合に属する 1 文字とマッチします。有効な文字集合は、文字定数 1 つの ' c ' 、範囲 ' c1 ' - ' c2 ' (c1 と c2 自身を含むその間の文字すべて) 、または 2 つ以上の文字集合を結合したもののどれかです。
[ ^ character-set ]: 指定された文字集合に属さない 1 文字とマッチします。
regexp_1 # regexp_2: (文字集合の差). regexp_1とregexp_2は[... ] (もしくは_以外の文字定数)で定義される文字集合でなければなりません。2つの指定された集合の差とマッチします。
regexp *: (反復) regexp とマッチする文字列が 0 個以上連なった文字列にマッチします。
regexp +: (厳しい反復) regexp とマッチする文字列が 1 個以上連なった文字列にマッチします。
regexp ?: (オプション) 空文字列か、regexp とマッチする文字列にマッチします。
regexp_1 | regexp_2: (どちらか) regexp1 にマッチする文字列か、 regexp2 にマッチする文字列のどちらかにマッチします。
regexp_1 regexp_2: (結合) regexp1 にマッチする文字列のあとに regexp2 にマッチする文字列を結合した文字列にマッチします。
( regexp ): regexp と同じです。
ident: 前もって let ident = regexp で定義された ident に割り当てられている正規表現になります。
regexp as ident: regexpにマッチした部分文字列を識別子identに束縛します

演算子の優先順位は、* と + が最優先、次に ? 、次に結合、最後に | になります。

↑

Actions

action は Caml の任意な式です。これらの式は lexbuf 識別子に現在の字句解析バッファが割り当てられた環境で評価されます。lexbuf の主な利用法を、Lexing 標準ライブラリモジュールの字句解析バッファ処理関数と関連付けながら以下に示します。

Lexing.lexeme lexbuf: マッチした文字列を返します。
Lexing.lexeme_char lexbuf n: マッチした文字列の n 番目の文字を返します。最初の文字は n = 0 になります。
Lexing.lexeme_start lexbuf: マッチした文字列の先頭の文字が入力文字列全体において何番目の文字であるかを返します。入力文字列の先頭は 0 です。
Lexing.lexeme_end lexbuf: マッチした文字列の末尾の文字が入力文字列全体において何番目の文字であるかを返します。入力文字列の先頭は 0 です。
entrypoint [exp_1 ... exp_n] lexbuf: (entrypoint には、同じ字句解析器の定義内にある別のエントリーポイント名が入ります) 字句解析器の指定されたエントリーポイントを再帰的に呼びます。入れ子のコメントなどの字句解析に便利です。

↑

正規表現中の変数

'as'コンストラクタは多くの正規表現パッケージが提供している"グループ"に似ています。これらの変数の型は、string、char、string、option、char optionのいずれかです。

まず線形なパターン、つまりすべての束縛変数がお互いに異なる場合について考えてみましょう。regexp as identにおいて、identの型は通常string(もしくはstring option)になります。例外はregexpが文字列定数、アンダースコア、長さ1の文字列定数、文字集合、これらの選択の場合です。その場合、identの型はchar(もしくはchar option)になります。option型は、ルールのマッチングが必ず成功するとは限らない場合です。例えば、( regexp as ident) ?やregexp_1 | ( regexp_2 as ident )のような場合です。

変数束縛が非線形の場合について考えてみましょう。変数が一度以上束縛される場合、前述のルールは以下のように拡張さえれます。

前述したルールで、全ての変数出現が文字に束縛されている場合、変数は文字変数になります。
いずれかの式がこの変数を束縛しない場合は、変数はoptionになります。

例えば、 ('a' as x) | ( 'a' (_ as x) )' において、変数xはchar型になります。一方、 ("ab" as x) | ( 'a' (_ as x) ? )において変数xはstring option型になります。

多くの場合、連続したマッチがユニークな束縛を生まないことがあります。例えば、abaを (('a'|"ab") as x) (("ba"|'a') as y)はxを"ab"にyを"a"に束縛するか、xを"a"にyを"ba"に束縛するかのいずれかになります。 camllexはこのような曖昧な正規表現の場合、ありうる束縛のうち、いずれかを選びます。ただし、選択の方法は規定されていません。

↑

Reserved identifiers

__ocaml_lex で始まるすべての識別子は ocamllex が予約しています。プログラム中にそのような識別子は使用しないでください。

↑

Overview of ocamlyacc

The ocamlyacc command produces a parser from a context-free grammar specification with attached semantic actions, in the style of yacc. Assuming the input file is grammar.mly, executing

        ocamlyacc options grammar.mly

produces Caml code for a parser in the file grammar.ml, and its interface in file grammar.mli.

The generated module defines one parsing function per entry point in the grammar. These functions have the same names as the entry points. Parsing functions take as arguments a lexical analyzer (a function from lexer buffers to tokens) and a lexer buffer, and return the semantic attribute of the corresponding entry point. Lexical analyzer functions are usually generated from a lexer specification by the ocamllex program. Lexer buffers are an abstract data type implemented in the standard library module Lexing. Tokens are values from the concrete type token, defined in the interface file grammar.mli produced by ocamlyacc.

↑

Syntax of grammar definitions

Grammar definitions have the following format:

%{
  header
%}
  declarations
%%
  rules
%%
  trailer

Comments are enclosed between `/*' and `*/' (as in C) in the "declarations" and "rules" sections, and between `(*' and `*)' (as in Caml) in the "header" and "trailer" sections.

↑

Header and trailer

The header and the trailer sections are Caml code that is copied as is into file grammar.ml. Both sections are optional. The header goes at the beginning of the output file; it usually contains open directives and auxiliary functions required by the semantic actions of the rules. The trailer goes at the end of the output file.

↑

Declarations

Declarations are given one per line. They all start with a `%' sign.

%token constr ...  constr  Declare the given symbols constr ...  constr as

tokens (terminal symbols). These symbols are added as constant constructors for the token concrete type.

%token < typexpr >  constr ...  constr  Declare the given symbols constr ...

constr as tokens with an attached attribute of the given type. These symbols are added as constructors with arguments of the given type for the token concrete type. The typexpr part is an arbitrary Caml type expression, except that all type constructor names must be fully qualified (e.g.

Modname.typename) for all types except standard built-in types, even if the proper `open' directives (e.g. `open Modname') were given in the header section. That's because the header is copied only to the .ml output file, but not to the .mli output file, while the typexpr part of a `%token' declaration is copied to both.

%start symbol ...  symbol  Declare the given symbols as entry points for the

grammar. For each entry point, a parsing function with the same name is defined in the output module. Non-terminals that are not declared as entry points have no such parsing function. Start symbols must be given a type with the `%type' directive below.

%type < typexpr >  symbol ...  symbol  Specify the type of the semantic

attributes for the given symbols. This is mandatory for start symbols only.

Other nonterminal symbols need not be given types by hand: these types will be inferred when running the output files through the Objective Caml compiler (unless the `-s' option is in effect). The typexpr part is an arbitrary Caml type expression, except that all type constructor names must be fully qualified, as explained above for %token.

%left symbol ...  symbol  
%right symbol ...  symbol  
%nonassoc symbol ...  symbol

Associate precedences and associativities to the given symbols. All symbols on the same line are given the same precedence. They have higher precedence than symbols declared before in a `%left', `%right' or `%nonassoc' line.

They have lower precedence than symbols declared after in a `%left', `%right' or `%nonassoc' line. The symbols are declared to associate to the left (`%left'), to the right (`%right'), or to be non-associative (`%nonassoc'). The symbols are usually tokens. They can also be dummy nonterminals, for use with the `%prec' directive inside the rules.

The precedence declarations are used in the following way to resolve reduce/reduce and shift/reduce conflicts:

   - Tokens and rules have precedences. By default, the precedence of a rule
     is the precedence of its rightmost terminal. You can override this
     default by using the %prec directive in the rule. 
   - A reduce/reduce conflict is resolved in favor of the first rule (in the
     order given by the source file), and ocamlyacc outputs a warning. 
   - A shift/reduce conflict is resolved by comparing the precedence of the
     rule to be reduced with the precedence of the token to be shifted. If the
     precedence of the rule is higher, then the rule will be reduced; if the
     precedence of the token is higher, then the token will be shifted. 
   - A shift/reduce conflict between a rule and a token with the same
     precedence will be resolved using the associativity: if the token is
     left-associative, then the parser will reduce; if the token is
     right-associative, then the parser will shift. If the token is
     non-associative, then the parser will declare a syntax error. 
   - When a shift/reduce conflict cannot be resolved using the above method,
     then ocamlyacc will output a warning and the parser will always shift.

↑

Rules

The syntax for rules is as usual:

nonterminal :
    symbol ... symbol { semantic-action }
  | ...
  | symbol ... symbol { semantic-action }
;

Rules can also contain the `%prec 'symbol directive in the right-hand side part, to override the default precedence and associativity of the rule with the precedence and associativity of the given symbol.

Semantic actions are arbitrary Caml expressions, that are evaluated to produce the semantic attribute attached to the defined nonterminal. The semantic actions can access the semantic attributes of the symbols in the right-hand side of the rule with the `$' notation: `$1' is the attribute for the first (leftmost) symbol, `$2' is the attribute for the second symbol, etc.

The rules may contain the special symbol error to indicate resynchronization points, as in yacc.

Actions occurring in the middle of rules are not supported.

Nonterminal symbols are like regular Caml symbols, except that they cannot end with ' (single quote).

↑

Error handling

Error recovery is supported as follows: when the parser reaches an error state (no grammar rules can apply), it calls a function named parse_error with the string "syntax error" as argument. The default parse_error function does nothing and returns, thus initiating error recovery (see below). The user can define a customized parse_error function in the header section of the grammar file.

The parser also enters error recovery mode if one of the grammar actions raises the Parsing.Parse_error exception.

In error recovery mode, the parser discards states from the stack until it reaches a place where the error token can be shifted. It then discards tokens from the input until it finds three successive tokens that can be accepted, and starts processing with the first of these. If no state can be uncovered where the error token can be shifted, then the parser aborts by raising the Parsing.Parse_error exception.

Refer to documentation on yacc for more details and guidance in how to use error recovery.

↑

Options

The ocamlyacc command recognizes the following options:

bprefix Name the output files prefix.ml, prefix.mli, prefix.output, instead of the default naming convention.
v Generate a description of the parsing tables and a report on conflicts resulting from ambiguities in the grammar. The description is put in file grammar.output.
version Print version and exit.

At run-time, the ocamlyacc-generated parser can be debugged by setting the p option in the OCAMLRUNPARAM environment variable (see section 10.2). This causes the pushdown automaton executing the parser to print a trace of its action (tokens shifted, rules reduced, etc). The trace mentions rule numbers and state numbers that can be interpreted by looking at the file grammar.output generated by ocamlyacc -v.

↑

A complete example

The all-time favorite: a desk calculator. This program reads arithmetic expressions on standard input, one per line, and prints their values. Here is the grammar definition:

        /* File parser.mly */
        %token <int> INT
        %token PLUS MINUS TIMES DIV
        %token LPAREN RPAREN
        %token EOL
        %left PLUS MINUS        /* lowest precedence */
        %left TIMES DIV         /* medium precedence */
        %nonassoc UMINUS        /* highest precedence */
        %start main             /* the entry point */
        %type <int> main
        %%
        main:
            expr EOL                { $1 }
        ;
        expr:
            INT                     { $1 }
          | LPAREN expr RPAREN      { $2 }
          | expr PLUS expr          { $1 + $3 }
          | expr MINUS expr         { $1 - $3 }
          | expr TIMES expr         { $1 * $3 }
          | expr DIV expr           { $1 / $3 }
          | MINUS expr %prec UMINUS { - $2 }
        ;

Here is the definition for the corresponding lexer:

        (* File lexer.mll *)
        {
        open Parser        (* The type token is defined in parser.mli *)
        exception Eof
        }
        rule token = parse
            [' ' '\t']     { token lexbuf }     (* skip blanks *)
          | ['\n' ]        { EOL }
          | ['0'-'9']+ as lxm { INT(int_of_string lxm) }
          | '+'            { PLUS }
          | '-'            { MINUS }
          | '*'            { TIMES }
          | '/'            { DIV }
          | '('            { LPAREN }
          | ')'            { RPAREN }
          | eof            { raise Eof }

Here is the main program, that combines the parser with the lexer:

        (* File calc.ml *)
        let _ =
          try
            let lexbuf = Lexing.from_channel stdin in
            while true do
              let result = Parser.main Lexer.token lexbuf in
                print_int result; print_newline(); flush stdout
            done
          with Lexer.Eof ->
            exit 0

To compile everything, execute:

        ocamllex lexer.mll       # generates lexer.ml
        ocamlyacc parser.mly     # generates parser.ml and parser.mli
        ocamlc -c parser.mli
        ocamlc -c lexer.ml
        ocamlc -c parser.ml
        ocamlc -c calc.ml
        ocamlc -o calc lexer.cmo parser.cmo calc.cmo

↑

Common errors

ocamllex: transition table overflow, automaton is too big

The deterministic automata generated by ocamllex are limited to at most 32767 transitions. The message above indicates that your lexer definition is too complex and overflows this limit. This is commonly caused by lexer definitions that have separate rules for each of the alphabetic keywords of the language, as in the following example.

rule token = parse

      "keyword1"   { KWD1 }
    | "keyword2"   { KWD2 }
    | ...
    | "keyword100" { KWD100 }
    | ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * as id
                   { IDENT id}

To keep the generated automata small, rewrite those definitions with only one general "identifier" rule, followed by a hashtable lookup to separate keywords from identifiers:

{ let keyword_table = Hashtbl.create 53
      let _ =

List.iter (fun (kwd, tok) -> Hashtbl.add keyword_table kwd tok)

                  [ "keyword1", KWD1;
                    "keyword2", KWD2; ...
                    "keyword100", KWD100 ]
    }
    rule token = parse
      ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * as id
                   { try

Hashtbl.find keyword_table id

                     with Not_found ->

IDENT id }

ocamllex: Position memory overflow, too many bindings  The deterministic
automata generated by ocamllex maintains a table of positions inside the scanned lexer buffer. The size of this table is limited to at most 255 cells. This error should not show up in normal situations.

メニュー

Chapter 12 レキサー・パーザージェネレータ (ocamllex, ocamlyacc)

Overview of ocamllex

Options

Syntax of lexer definitions

Header and trailer

Naming regular expressions

Entry points

Regular expressions

Actions

正規表現中の変数

Reserved identifiers

Overview of ocamlyacc

Syntax of grammar definitions

Header and trailer

Declarations

Rules

Error handling

Options

A complete example

Common errors