A programming language lexer is the part in charge of converting textual code
representation into a structured memory representation. In Red, it is
accomplished by the
load function, which calls
the lower-level transcode
native. Until now, Red was relying on a lexer entirely written using the
Parse dialect. Though, the parsing rules were constructed to be easily maintained and not
for performance. Rewriting those rules to speed them up could have been
possible, but rewriting the lexer entirely in Red/System would give the
ultimate performance. It might not matter for most user scripts, but given
that Red is also a data format, we need a solution for fast (near-instant)
loading of huge quantities of Red values stored in files or transferred
through the network.
The new lexer main features are:
- High performance, typically 50 to 200 times faster than the older one.
- New scanning features: identify values and their datatypes without loading them.
- Instrumentation: customize the lexer's behavior at will using an event-oriented API.
The reference documentation is available there. This new lexer is available in Red's auto-builds since June.
Performance
Vastly increased performance is the main driver for this new lexer. Here is a
little benchmark to let you appreciate how far it gets.
The benchmarking tasks are:
- 100 x compiler.r: loads 100 times compiler.r source file from memory (~126KB, so about ~12MB in total).
- 1M short integers: loads a string of 1 million `1` separated by a space.
- 1M long integers: loads a string of 1 million `123456789` separated by a space.
- 1M dates: loads a string of 1 million `26/12/2019/10:18:25` separated by a space.
- 1M characters: loads a string of 1 million `#"A"` separated by a space.
- 1M escaped characters: loads a string of 1 million `#"^(1234)"` separated by a space.
- 1M words: loads a string of 1 million `random "abcdefghijk"` separated by a space.
- 100K words: loads a string of 100 thousands `random "abcdefghijk"` separated by a space.
And the results are (on a Core i7-4790K):
Loading Task v0.6.4 (sec) Current (sec) Gain factor --------------------------------------------------------------------- 100 x compiler.r 41.871 0.463 90 1M short integers 14.295 0.071 201 1M long integers 18.105 0.159 114 1M dates 29.319 0.389 75 1M characters 14.865 0.092 162 1M escaped characters 14.909 0.120 124 1M words n/a 1.216 n/a 100K words 23.183 0.070 331
Notes:
- Only transcode is used in the loading tasks (system/lexer/transcode in 0.6.4).
- The "1M words" task fails on 0.6.4 as the symbol table expansion time is
exponential due to some hashtable bugs. That also explains the big gap for the
"100K words" task. Those issues are fixed in the current version and the
symbol table further optimized for speed. Though, the execution time increase
between 100K and 1M words tests in new lexer is not linear which may be
explained by a high number of collisions in the internal hashtable due to
limited input variability.
- The 0.6.4's lexer can only process strings as input, while the new lexer
only processes internally only UTF-8 binary inputs. The input strings were
converted to the lexer's native format in order to more accurately compare
their speed. Providing a string instead of a binary series as input to the new
lexer incurs on average a ~10% speed penalty.
Scanning
It is now possible to only scan tokens instead of loading them. Basically,
that means identifying a token's length and type without loading it (so
without requiring extra memory and processing time). This is achieved by using
the new scan native.
>> scan "123" == integer! >> scan "w:" == set-word! >> scan "user@domain.com" == email! >> scan "123a" == error!
It is possible to achieve even higher scanning speed by giving up a bit on
accuracy. That is the purpose of the scan/fast
refinement. It trades maximum performance for type recognition accuracy. You
can find the list of "guessed" types in the table
there.
>> scan/fast "123" == integer! >> scan/fast "a:" == word! >> scan/fast "a/b" == path!
Scanning applies to the first token in the input series. When an iterative
application is needed in order to scan all tokens from a given input, the
/next refinement can be
used for that. It will return the input series past the current token allowing
to get the precise token size in the input string. It can be used in
combination with /fast if
required. For example:
src: "hello 123 world 456.789" until [ probe first src: scan/next src empty? src: src/2 ]
Outputs:
word! integer! word! float!
Matching by datatype in Parse
The new lexer enables also matching by datatype directly from Parse dialect.
Though, this feature is limited to binary input only.
>> parse to-binary "Hello 2020 World!" [word! integer! word!] == true >> parse to-binary "My IP is 192.168.0.1" [3 word! copy ip tuple!] == true >> ip == #{203139322E3136382E302E31} >> load ip == 192.168.0.1
Notice that the whitespaces in front of tokens are skipped automatically in this
matching mode.
Instrumentation
Lexers in Red and Rebol world used to be black boxes, this is no longer the
case with Red's new lexer and its tracing capabilities. It is now possible to
provide a callback function that will be called upon lexer events triggered
while parsing tokens. It gives deeper control to users, for example allowing
to:
- Trace the behavior of the lexer for debugging or statistical purposes.
- Catch errors and resume loading by skipping invalid data.
- On-the-fly input transformation (to remove/alter some non-loadable parts).
- Extend the lexer with new lexical forms.
- Process serialized Red data without having to fully load the input.
- Extract line comments that would be lost otherwise.
Lexer's tracing mode is activated by using the
/trace refinement on transcode. The syntax is:
transcode/trace <input> <callback> <input> : series to load (binary! string!). <callback> : a callback function to process lexer events (function!).
That function is called on specific events generated by the lexer:
prescan, scan, load, open, close, error. The
callback function and events specification can be found there.
A default tracing callback is provided in
system/lexer/tracer:
>> transcode/trace "hello 123" :system/lexer/tracer prescan word 1x6 1 " 123" scan word 1x6 1 " 123" load word hello 1 " 123" prescan integer 7x10 1 "" scan integer 7x10 1 "" load integer 123 1 "" == [hello 123]
That tracing function will simply print the lexer event info. If a syntax
error occurs, it will cancel it and resume on the next character after the
error position.
Several more sophisticated examples can be found on our
red/code repository.
Implementation notes
This new lexer has been specifically prototyped and designed for performance.
It relies on a token-oriented pipelined approach consisting of 3 stages:
prescanning, scanning and loading.
Prescanning is achieved using only a
tight loop
and a state machine (FSM). The loop reads UTF-8 encoded input characters one byte at a time. Each
byte is identified as part of a lexical class. The lexical class is then used to transition from one state to another in
the
FSM, using a big
transition table. Once a terminal state (state names
with a `T_` prefix) or input's end is reached, the loop exits, leading to the
next stage. The result of the prescanning stage is to locate a token begin/end
positions and give a pretty accurate guess about the token's datatype. It can
also detect some syntax errors if the FSM cannot reach a proper datatype
terminal state. This approach provides the fastest possible speed for tokens
detection, but it cannot be fully accurate, nor can it validate deeply the
token content for some complex types (e.g. dates).
Adding more states would provide greater accuracy and cover more syntatic
forms, but at the cost of growing the transition table a lot due to the need to duplicate many state. Currently the table weights 2440 bytes, which is already quite
big to be kept entirely in the CPU data cache (usually 8, 16 or 32KB per core, the
lexical table uses 1024 bytes and there two other minor tables used in the
tight loop). The data cache also needs to handle the parsed input data and
part of the native stack, so the available space is limited.
The tight loop code is also optimized for keeping branch mispredictions as low as possible. It currently
only relies on two branchings. The loop code could be also further reduced by, for
example, pre-multiplying the state values to avoid the multiplication when
calculating the table entry offset. Though, we need to wait for a fully
optimizing code generation backend before trying to extract more performance
from that loop code, or we might be taking wrong directions.
Scanning stage happens when a token has been identified. It consists in
eventually calling a scanner function to deep-check the token for errors and
more accurately determine the datatype. Loading stage then follows (unless
only scanning was requested by the user). It will eventually call a loader
function that will construct the Red value out of the token. In case of
any-block series, the scanners will actually do the series construction on
reaching the ending delimiter (which requires
special handling
for paths), so no loader is needed there. Conversely, loaders can be invoked
in validating mode only (not constructing the value), in order to avoid code
duplication when complex code is required for decoding/validating the
token (e.g. date!, time!, strings with UTF-8 decoding,...).
For the record, there was an
attempt
at creating specific FSM for date! and time! literal forms parsing, to reduce
the amount of rules that need to be handled by pure code. The results were not
conclusive, as the amount of code required for special case handling was
still significant and the performance of the FSM parsing loop was below the
current pure code version. This approach can be reexamined once we get the
fully optimizing backend.
The FSM states, lexical classes and transitions are documented in
lexer-states.txt
file. A simple syntax is used to describe the transitions and possible
branching from one state to others. The FSM has three possible entry points:
S_START, S_PATH and
S_M_STRING. Parsing path items requires specific
states even for common types. For curly-braced strings, it is necessary to
exit the FSM on each occurrence of open/close curly braces in order to count
the nested ones and accurately determine where it ends. In both those path and
string cases, the FSM needs to be re-entered in a different state than
S_START.
In order to build the FSM transition table, there is a workflow that goes from that lexer-states.txt file to the final transition
table data in binary. It basically looks like this:
FSM graph -> Excel file -> CSV file -> binary table
The more detailed steps are:
- Manually edit changes in the lexer-states.txt file.
- Port those changes into the lexer.xlsx file by properly setting the transition values.
- Save that Excel table in CSV format as lexer.csv.
- Run the generate-lexer-table.red script from Red repo root folder. The lexer-transitions.reds file is regenerated.
The lexer code relies on several other tables for specific types handling like
path ending detection, floating point numbers syntax validation, binary series
and escaped characters decoding. Those tables are either manually written (not
planned to be ever changed) or generated using
this script.
Various other points worth mentioning:
- The lexer works natively with UTF-8 encoded binary buffers as input. If a
string! is provided as input, there is an overhead for converting internally
such string to binary before passing it to the lexer. A unique internal buffer
is used for those conversions with support for recursive calls.
- The lexer uses a single accumulative cells buffer for storing loaded values,
with an inlined any-block stack.
- The lexer and lexer callbacks are fully recursive and GC-compliant.
Currently callbacks can be function! only, this can be extended in the future
to support routines also for much faster processing.
Very impressive. Congratulations :-)
ReplyDeleteCongratulations Doc & team.
ReplyDelete