Lexing of MATLAB looks easy enough at first, but there are many language feature that make a traditional approach (Dragon style lexing + parsing) impossible.
This file attempts to document these, so that others don't make the same mistakes (e.g. "trust me, it's easy, I'll just write a flex lexer and fix some stuff afterwards").
Doesn't exist. This is the first hurdle.
A function without outputs, that takes string arguments such as "disp" or "clear", is normally called like this:
disp('Potato!');
However, you may omit the brackets (and optionally the quotation):
disp Potato!
This is called command form, and requires special treatment in the
lexer. For example \
is not a valid character to appear in MATLAB
program text, but you can write:
cd C:\potato
So, we need to have special rules to detect command form. Thankfully this got a bit easier in a recent MATLAB release and it is documented.
If command form wasn't weird enough already, there are some surprising
properties involving brackets. They are classed into two: opening
({[
, and closing )}]
. If there is an imbalance, whitespace is
included in the string lexed.
For example, this should lex as IDENT(foo) CARRAY(bar
) CARRAY(baz
):
foo bar baz % potato
However, this should lex as IDENT(foo) CARRAY(b)ar baz
). Please
note the trailing whitespace here, it's intentional.
foo b)ar baz % potato
To add to this, you can single-quote strings in command form, and they disappear.
For example, this again lexes as IDENT(foo) CARRAY(bar
) CARRAY(baz
):
foo 'bar' baz % potato
Imbalanced strings create lex errors:
foo 'bar baz % potato
^ error here
Quotations without separating whitespace are concatenated into
whatever is lexed. For example, this lexes as IDENT(foo)
CARRAY(barbaz
). Note no quotations here.
foo bar'baz'
The creates counter-intuitive things like:
foo a' 'b % actually a single CARRAY 'a b'
A double single quote escapes to a single quote:
foo pot''''ato % pot'ato
Finally, this does not work with double-quotes. The following lexes as
IDENT(foo) CARRAY("bar"
). Note the double-quotes are part of the
single-quoted string.
foo "bar"
The final cherry on top is that continuations can appear in command
forms as well, and they carry on to the next line. This lexes as
IDENT(foo) CARRAY(bar
) CONTINUATION CARRAY(baz
).
foo bar ...
baz
However, continuations are not allowed to be the first thing in a command-form. This lexes as IDENT(foo) CONTINUATION IDENT(bar) CONTINUATION IDENT(baz).
foo ...
bar ...
baz
The allowed syntax for numbers is extremely free, which creates ambiguities that need to be manually resolved. For example the following are all valid number tokens:
1
2.0
3.
.4
In a classic lexer, this would create an ambiguity for this:
1.1.1
In older versions of MATLAB (R) this used to parse, but now thankfully it doesn't. But you still need to add a special error checking for number lexing to raise an error in this specific case.
Also note that the ./
operator (and the other 4 operators that start
with a '.') create an ambiguity here. Consider the following expression:
x = 1./b
In MATLAB and Octave this appears to be resolved by giving the ./
higher precedence, so the individual tokens here should be 1
, ./
,
and b
.
The single quote has two meanings. It can either be a transpose
operation (e.g. a'
) or a character array (e.g. 'foo'
). It also
means some expressions require you to use intermediate values. For
example the following cannot be simplified to y = 'foo'';
:
x = 'foo';
y = x';
The rules as to when a '
introduces a string are complex and depend
on the previously lexed token and whitespace before this
token. The following rules are given in order of precedence:
If there is preceding whitespace or we're the first token on the current line then we're always a CARRAY:
'foo'
x = 'foo'
x = [a' 'foo']
If the last token is an identifier, number, transpose operator, or any closing bracket then we deal with a transpose operator.
x = y';
y = 1'';
z = 1.''.';
w = size(1)';
Otherwise we have a character array.
MATLAB (R) features very user friendly ways to write matrices, but they are painful to parse and lex correctly. Inside matrices whitespace matters. The lexer needs to keep track of all brackets so that it can always know if we're inside a matrix (or cell) or not.
The canonical way to write a 2x2 identity matrix is this:
m = [1, 0; 0, 1];
However there are many other ways this can be written:
m = [1 0
0 1];
Or perhaps:
m = [1,0;;;
% potato
,0 1,;];
Rows are separated by either ;
or a newline. Items on each row are
separated by whitespace or commas, or both. Unfortunately this leads
to trouble:
m = [+1 +0
+ 0 +1 + 0];
Why is the final +0
added to the +1
? Why is it not its own element
creating a malformed matrix? Or even worse:
x = [1 + 1] % this is [2]
y = [1 ++ 1] % this is [1 1]
z = [1 +++ 1] % this is also [1 1]
As this depends on whitespace, to deal with this in the parser we have decided to add anonymous commas in the lexer. So the parser sees the following instead:
m = [+1, +0
+ 0, +1 + 0];
x = [1 + 1] % this is [2]
y = [1, ++ 1] % this is [1 1]
z = [1, +++ 1] % this is also [1 1]
The rules when to add such a comma are probably the single most complex feature of the lexer.
First we need to find the first two characters after any
whitespace. Following the 1
in x, this would be +
and for y it
would be ++
. Note that this search must also correctly skip line
continuations.
Second, commas can only appear after some tokens. Specifically identifiers, numbers, char arrays or strings, closing brackets, the 'end' keyword or transpose operations.
Third, we never add commas if no whitespace follows the current token.
So if we have whitespace, the previous token is relevant, and the next
characters exists and is not */\^<>^|=.:!
(indicating a binary
operation) but is not .NUMBER
, or the next two characters are not
==
(again indicating a binary operation) then maybe we need to add a
comma. In order of precedence, based on the next non-whitespace
character:
No comma in these situations:
,
,;
, newline%
(and#
for octave)
Add an anonymous comma in these situations:
- any alphanumeric character
'
or"
- any opening bracket
@
or?
.
(it must be a number literal)-
or+
or~
and the character after is also+
or-
or(
-
or+
or~
and the character after is alphanumeric
Do not add a comma otherwise.
There is one more exception to these already confusing rules, and these are lambda functions. If we encounter a closing bracket of a lambda parameter list, then no comma is inserted:
x = {@(x) 12};
Note that your lexer will need to keep track of, and match, brackets; and distinguish between normal round brackets and a lambda function parameter list.
Assignment targets look like matrices, but they are not. Many things that are permitted in a matrix are not valid in an assignment target. For example the following is not valid:
[x; y] = size(1)';
Ideally, in the parser we know if we're dealing with a matrix or not;
so we can directly reject some expressions. Otherwise we'd have to go
back after we've seen the =
and either re-parse or do much deeper
analysis; or reject during semantic analysis.
To make this easier the lexer should distinguish between a matrix and
assignment targets. This essentially involves looking ahead matching
brackets to find the first non-whitespace character following the
closing ]
. If it is a =
then we deal with an assignment target, if
not then it must be a matrix.
This look-ahead is not easy, since strings can appear legitimately. For example:
[x.('fo)o')] = 10;
While this is not semantically legal (field names cannot have brackets) it should be lexed correctly.
Classic dragon-style lexing and parsing benefits greatly from having keywords. But what is a keyword in MATLAB (R)? It depends on the context, and it is highly malleable.
The documentation gives us a list via the iskeyword
function,
listing e.g. if
, else
, and end
.
However the following is legal:
x = struct()
x.end = 12;
In lexing terms, if the previous token was a selection token (.
)
then the following token is always an identifier.
Having a function named end
is also legal if the function is a class
method.
~
is an operator, but it can be a legal identifier in an assignment
target (to indicate "don't care"):
[~, x] = size(1);
In lexing terms, we have decided to keep it as an operator, but have a
special rule in the parser rule for identifier that allows the ~
in
some cases. A similar rule exists for end
to allow the following
expressions:
y = x(1:end, end-1:end-1);
Some keywords are keywords only in some contexts. Specifically
classdef
introduces properties
, enumeration
, events
, and
methods
as keywords. The methods
block removes this limitation
again. Hence you can have a method called properties
, but not a
property called properties
.
Since 2019b the function
keyword makes arguments
become a keyword
in the top-level block.
function potato(x)
arguments % keyword
x uint
end
try
arguments = 12; % identifier
end
arguments = 42; % keyword again not allowed in MISS_HIT
Note that keeping track of blocks is essential anyway, since the
argument validation x uint
would normally be lexed as
command-form. But inside properties
, events
, enumeration
, and
arguments
blocks the command-form detection must always be turned
off.
Block comments can be nested (this is not documented officially). The following will print 1 and 5:
disp 1
%{
disp 2
%{
disp 3
%}
disp 4
%}
disp 5
If any text appears before or after the %{
or %}
the block comment
is ignored. For example this will print 1 2 4 5.
disp 1 %{
disp 2
%{
disp 3
%}
disp 4
%}
disp 5
MISS_HIT Lint will emit warnings about comments that contain %{
and
%}
unless they are proper block comments.
While unicode is not permitted in identifiers, it is permitted in program texts and literals. This can lead to surprising programs:
potato foo bar ٪ This will print foo and bar
function potato(varargin)
for i = 1:nargin
disp(varargin{i});
end
end
Here we have a clever use of the arabic percent ٪
. This is not %
,
which starts a comment. Instead we print:
foo
bar
٪
This
will
print
foo
and
bar
If the ٪
is a %
then the program would only print:
foo
bar