Laboratory 12

Calcolo numerico per la generazione di immagini fotorealistiche

Maurizio Tomasi

Implementing a lexer

Example

# Declare a floating-point variable named "clock"
float clock(150)

# Declare a few new materials. Each of them includes a BRDF and a pigment

# We can split a definition over multiple lines and indent them as we like
material sky_material(
    diffuse(image("sky-dome.pfm")),
    uniform(<0.7, 0.5, 1>)
)

material ground_material(
    diffuse(checkered(<0.3, 0.5, 0.1>,
                      <0.1, 0.2, 0.5>, 4)),
    uniform(<0, 0, 0>)
)

material sphere_material(
    specular(uniform(<0.5, 0.5, 0.5>)),
    uniform(<0, 0, 0>)
)

# Define a few shapes
sphere(sphere_material, translation([0, 0, 1]))

# The language is flexible enough to permit spaces before "("
plane (ground_material, identity)

# Here we use the "clock" variable! Note that vectors are notated using
# square brackets ([]) instead of angular brackets (<>) like colors, and
# that we can compose transformations through the "*" operator
plane(sky_material, translation([0, 0, 100]) * rotation_y(clock))

# Define a camera
camera(perspective, rotation_z(30) * translation([-4, 0, 1]), 1.0, 1.0)

This is the kind of file for which our lexer will have to produce a list of tokens.

Error Handling

Error Conditions

  • Even while writing the lexer, before dealing with the syntactic and semantic aspects, it is possible to encounter errors in the code.

  • For example, the presence of a character like @ is not allowed in our language, and the lexer can already detect this type of error.

  • Another example is forgetting to close a double quote " at the end of a string.

How to Report Errors

  • In modern compilers, the Token type contains information about the position of the token in the source file (see for example the Token type in version 10.0.0 of the Clang compiler: it’s not a very elegant implementation, but it is optimized for efficiency!).

  • This information is used by the lexer and the parser to print error messages like the following (produced by Clang 10):

    test.cpp:31:15: error: no viable conversion from 'int' to 'std::string'
          (aka 'basic_string<char>')
      std::string message = 127;
                  ^         ~~~

    where the file name (test.cpp), the line number (31), and the column number (15) where the error was found are indicated.

Tracking Positions

  • The position of a token in a file is identified by three pieces of information:

    1. The name of the source file (a string);
    2. The line number (an integer, starting from 1);
    3. The column number (same).
  • The Token type should therefore contain these three fields. In PyTracer I created a SourceLocation type. (But it would be more efficient to keep a list of file names, and use an index to the file here!)

    @dataclass
    class SourceLocation:
        file_name: str = ""
        line_num: int = 0
        col_num: int = 0

Positions and Tokens

  • If you use a class hierarchy, put a field of type SourceLocation in the base class Token:

    @dataclass
    class Token:
        """A lexical token, used when parsing a scene file"""
        location: SourceLocation
  • If you use sum types, remember to properly use tags. (In C++, implement them using an enum class, since this language does not support tags natively.)

Reporting Errors

  • The most practical way to report errors is to raise an exception: pytracer defines GrammarError.

  • The error message and a SourceLocation are associated with the exception: this way you can tell the user the location where the error was found.

  • Using exceptions implies that as soon as an error is found, compilation stops. This is not very user-friendly: if compilation is a slow process (e.g., C++, Rust), it would be better to produce a list of errors.

  • We will not do this: our compiler will be very fast, and it is not easy to implement a method to produce a list of errors, because of the way parsing is performed (see the next lesson).

Definition of InputStream

Use of streams in lexing

  • We saw in the theory lesson that it is necessary to read one character at a time from the source file.

  • It is therefore ideal to use a stream to read from a file, taking care to open the file in text mode (it is not a binary file like the PFM format!):

    with open(file_name, "rt") as f:  # "rt" stands for "*R*ead *T*ext"
    
  • We also need the possibility of look-ahead, that is, to read a character and put it back, as well as the ability to keep track of the position in the stream where we have arrived (to produce error messages).

Definition of InputStream

  • The InputStream type must contain these fields:

    1. A stream field, whose type depends on the language you use: for example std::istream in C++;
    2. A location field of type SourceLocation.
  • In addition to these fields, others are needed to implement look-ahead.

Character look-ahead

  • Recall that InputStream must offer the possibility of «putting back» a character through the unread_char function.

  • When putting back a character, you must also put back the position in the file, i.e., the location field.

  • This means that InputStream must also contain the following members:

    1. saved_char (which will contain the «un-read» character, or zero if unread_char has not been called);
    2. saved_location, which contains the value of SourceLocation associated with saved_char.

Constructor of InputStream

class InputStream:
    def __init__(self, stream, file_name="", tabulations=8):
        self.stream = stream

        # Note that we start counting lines/columns from 1, not from 0
        self.location = SourceLocation(file_name=file_name, line_num=1, col_num=1)

        self.saved_char = ""
        self.saved_location = self.location
        self.tabulations = tabulations
  • If your language supports nullable types (C#, Kotlin), use them to define saved_char.
  • In C++, Rust, or Nim, use optional types (e.g., std::optional in C++).

Position Tracking

  • It is very important to correctly track the position in the file, i.e., to update the location field appropriately.

  • These are the rules to follow:

    1. In the presence of a newline character like \n, increment line_num and reset col_num to 1;
    2. In the presence of \t (tab), increment col_num by a conventional value (usually 4 or 8);
    3. In all other cases, increment col_num and leave line_num untouched.
  • This approach would require additional measures for Unicode characters, but our format only allows ASCII characters (fortunately!).

class InputStream:


    def _update_pos(self, ch):
        """Update `location` after having read `ch` from the stream"""
        if ch == "":
            # Nothing to do!
            return
        elif ch == "\n":
            self.location.line_num += 1
            self.location.col_num = 1
        elif ch == "\t":
            self.location.col_num += self.tabulations
        else:
            self.location.col_num += 1

    def read_char(self) -> str:
        """Read a new character from the stream"""
        if self.saved_char != "":
            # Recover the «unread» character and return it
            ch = self.saved_char
            self.saved_char = ""
        else:
            # Read a new character from the stream
            ch = self.stream.read(1)

        self.saved_location = copy(self.location)
        self._update_pos(ch)

        return ch

    def unread_char(self, ch):
        """Push a character back to the stream"""
        assert self.saved_char == ""
        self.saved_char = ch
        self.location = copy(self.saved_location)

The Token Type

The Token Type

  • Decide whether to use a class hierarchy or a tagged union (sum type); in Python I used hierarchies because sum types don’t exist (link to the code).

  • The token types to define are the following:

    1. Keyword: use an enumerated type (enum in Nim, D, C#, Rust, Java, enum class in Kotlin), because it will make the parser more efficient;
    2. Identifier: it’s a string;
    3. Literal string: it’s again a string;
    4. Literal number: a floating-point value;
    5. Symbol: a character;
    6. Stop token (see next slide).

End of File

  • The lexer needs to be able to signal when a file has ended.

  • In the pytracer code I implemented a new «special» token: StopToken. It is emitted when the end of the file is reached.

  • The StopToken trick is not essential: you could simply check when the stream has reached the end…

  • …but it shows how flexible the concept of token can be. Tricks like this are very common in compilers (see for example how Python handles indentation changes at the lexer level).

Whitespace and Newlines

  • Our format ignores spaces, newlines between tokens and comments.

  • To implement read_token, we need a function that skips these characters:

    WHITESPACE = " \t\n\r"
    
    def skip_whitespaces_and_comments(self):
        "Keep reading characters until a non-whitespace/non-comment character is found"
        ch = self.read_char()
        while ch in WHITESPACE or ch == "#":
            if ch == "#":
                # It's a comment! Keep reading until the end of the line
                # (include the case "", the end-of-file)
                while self.read_char() not in ["\r", "\n", ""]:
                    pass
    
            ch = self.read_char()
            if ch == "":
                return
    
        # Put the non-whitespace character back
        self.unread_char(ch)

Reading a token

  • Divide the read_token method into simple functions as I did in pytracer, so that it is clearer to read.

  • After skipping any whitespace and comments, read_token must read the first character c and decide which token should be created:

    1. If it is a symbol (comma, parenthesis, etc.), it returns a SymbolToken;
    2. If it is a digit, it returns a LiteralNumberToken;
    3. If it is ", it returns a LiteralStringToken;
    4. If it is a sequence of characters az, it returns a KeywordToken if the sequence is a keyword, IdentifierToken otherwise;
    5. If the file is finished, it returns StopToken.
SYMBOLS = "()<>[],*"

def read_token(self) -> Token:
    self.skip_whitespaces_and_comments()

    # At this point we're sure that ch does *not* contain a whitespace character
    ch = self.read_char()
    if ch == "":
        # No more characters in the file, so return a StopToken
        return StopToken(location=self.location)

    # At this point we must check whick kind of token begins with `ch`
    # First, we save the position in the stream
    token_location = copy(self.location)

    if ch in SYMBOLS:
        # One-character symbol, like '(' or ','
        return SymbolToken(token_location, ch)
    elif ch == '"':
        # A literal string (used for file names)
        return self._parse_string_token(token_location=token_location)
    elif ch.isdecimal() or ch in ["+", "-", "."]:
        # A floating-point number
        return self._parse_float_token(first_char=ch, token_location=token_location)
    elif ch.isalpha() or ch == "_":
        # Since it begins with an alphabetic character, it must either be a keyword
        # or a identifier
        return self._parse_keyword_or_identifier_token(
            first_char=ch,
            token_location=token_location,
        )
    else:
        # We got some weird character, like '@` or `&`
        raise GrammarError(self.location, f"Invalid character {ch}")

Test

  • Implement two families of tests:
    1. A set of tests for InputStream, which verifies that the position in a file is tracked correctly even if there are newlines or unread_char is called;
    2. A set of tests for read_token, which verifies that spaces and comments are skipped and that the token sequence is produced correctly.
  • Writing these tests will allow you to familiarize yourself with the types you have defined (especially if you use sum types!), in preparation for the next lesson.

What to do today

  1. Release version 0.3.0 if you did not already do it last week;
  2. Create a new branch called scenefiles;
  3. Implement SourceLocation;
  4. Implement GrammarError;
  5. Implement InputStream and the associated functions/methods (especially unread_char!);
  6. Implement the Token type, making sure that all token types present in pytracer are included (there are six in total);
  7. Implement the function/method read_token.