Lexer：把源码切成 token

本节阅读量:

上一节先看了目标：我们希望最后得到 AST。

但源码一开始不是 AST，它只是文本：

1

(+ 40 2)

如果直接把这段文本交给 parser，parser 要一边处理字符，一边理解结构，会很乱。

所以第一步先做一件更小的事：

1

source text -> tokens

这一步由 lexer 完成。

token 是什么

token 可以先理解成：

1

源码里有意义的小块。

源码：

1

(+ 40 2)

切成 token 后是：

1

(   +   40   2   )

第一章只有这几类 token：

1
2
3
4
5


(       左括号
)       右括号
+       加号
42      整数字符串
-7      整数字符串

第一章先不单独定义 Token 类型，直接用 std::string 保存每个 token。也就是说，lexer 返回的是一串字符串小块。

注意，lexer 还不理解整棵程序。

它不会判断 (+ 40 2) 是不是一个合法加法，也不会判断 + 后面是不是正好有两个表达式。它只负责把字符切成字符串小块。

1
2


lexer 问：源码里下一个小块是什么？
parser 才问：这些小块组成了什么表达式？

先让括号独立出来

代码在：

1

code/01_numbers/src/front/lexer.cpp

如果只按空白切分，(+ 40 2) 会遇到一个麻烦：

1
2


(+      会粘在一起
2)      也会粘在一起

但在 Lisp 风格语法里，括号本身就是重要 token。我们希望它们独立出来：

1

(   +   40   2   )

所以代码先给括号两边补空格：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


std::string space_around_parens(const std::string& source) {
    std::string text;
    for (char c : source) {
        if (c == '(') {
            text += " ( ";
        } else if (c == ')') {
            text += " ) ";
        } else {
            text += c;
        }
    }
    return text;
}

这段代码可以读成：

1
2
3


如果看到左括号，就放进 " ( "。
如果看到右括号，就放进 " ) "。
其他字符原样放进去。

所以：

1

(+ 40 2)

会先变成类似：

1

 ( + 40 2 )

现在括号已经不会和旁边的字符粘在一起了。

stringstream 的抽取逻辑

接下来用 std::istringstream 按空白抽出一个个小块：

1
2
3
4
5
6


std::istringstream words(space_around_parens(source));

std::string word;
while (words >> word) {
    tokens.push_back(word);
}

std::istringstream 可以先理解成：

1

把一个 string 包装成可以读取的输入流。

它有点像 std::cin，只是 std::cin 从键盘读，std::istringstream 从字符串读。

这里最核心的是这一句：

1

words >> word

当右边是 std::string 时，它会做三件事：

1
2
3


先跳过前面的空白。
再读取一段连续的非空白字符，放进 word。
遇到下一个空白就停。

比如输入流里是：

1

 ( + 40 2 )

抽取过程可以这样看：

1
2
3
4
5
6


第一次 words >> word   word = "("
第二次 words >> word   word = "+"
第三次 words >> word   word = "40"
第四次 words >> word   word = "2"
第五次 words >> word   word = ")"
第六次 words >> word   没有更多小块，while 结束

while (words >> word) 的意思不是“永远循环”。它的意思是：

1
2


只要还能成功抽出一个 word，就进入循环。
抽不出来了，就停止。

每次成功抽出一个 word，就把它放进 tokens：

1

tokens.push_back(word);

最后得到：

1

["(", "+", "40", "2", ")"]

这里的空格、换行、tab 都只是分隔符，不会进入 token。

完整 lexer

第一章的 lex 函数很短：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


std::vector<std::string> lex(const std::string& source) {
    std::vector<std::string> tokens;
    std::istringstream words(space_around_parens(source));

    std::string word;
    while (words >> word) {
        tokens.push_back(word);
    }

    return tokens;
}

这段代码做的事情可以压成三步：

1
2
3


1. 给括号两边补空格。
2. 用 istringstream 按空白抽取 word。
3. 把每个 word 存进 tokens。

所以：

1

(+ (+ 1 2) 3)

会被切成：

1

(   +   (   +   1   2   )   3   )

也就是：

1

["(", "+", "(", "+", "1", "2", ")", "3", ")"]

lexer 不负责判断结构

第一章的 lexer 很朴素。它只切字符串，不做深判断。

比如：

1

-7

lexer 只会得到：

1

["-7"]

它不会在这里决定 -7 是不是合法整数。这个判断交给 parser 里的 is_integer。

再看这个：

1

(+40 2)

给括号补空格后，会变成类似：

1

 ( +40 2 )

所以 token 是：

1

["(", "+40", "2", ")"]

lexer 不会报错。它只是照实切出来。

后面的 parser 看到左括号后，会期待下一个 token 是 "+"。但它实际看到的是 "+40"，于是才报错。

这也是第一章的一个约定：

1

普通小块要用空白或括号边界分开。

所以：

1
2


(+ 40 2)      合法
(+40 2)       不作为合法写法

下一节的 parser 会接过这串 token，把它们组装成真正的表达式结构。

1.2 AST：从字符串到 C++ 可操作的结构

1.4 Parser：把 token 组成 AST

本节目录