Parse quoted token like String #392

ScottFreeCode · 2021-10-19T02:54:58Z

I am having trouble figuring out how I can write a token like the built-in String but with different quotation marks and, in my Abs.hs result, only get the content inside the quotes like String does. If I create a token with some kind of quotes it seems that whatever interprets the Abs members has to remove them, whereas this is not necessary for String as far as I've noticed. Is there a way to achieve this?

My first thought was to define the token as just the stuff inside the quotes, then use it a la MyQuotedType . MyQuotedType ::= "'" MyToken "'"; But this causes other keywords to be parsed (or perhaps lexed? I'm not terribly familiar with the distinction) as MyToken even though they are not preceded by the quote; and that breaks (such that it won't even parse) code that was successfully parsing in the version where the quotes are part of the token and get manually unquoted in the interpreter.

The text was updated successfully, but these errors were encountered:

ScottFreeCode · 2021-10-19T03:07:04Z

(I realize ' is used by Char, but I don't happen to be using Char. And, I could always change it to backticks or something if I ended up needing to. Some languages use / as quotes for regular expressions. Ideally I'd be able to specify the quotation marks to use/remove.)

ScottFreeCode · 2021-10-19T03:28:42Z

Relatedly, do the content of strings need to be unescaped (e.g. \" -> " and \\ -> \) or does the parser also handle that? Could the parser handle it for custom quoted tokens similarly if so?

andreasabel · 2021-10-19T09:57:08Z

I am having trouble figuring out how I can write a token like the built-in String but with different quotation marks and, in my Abs.hs result, only get the content inside the quotes like String does.

This is unfortunately not possible with BNFC. The token types Char, Double, Integer, String are hard-wired and do something special.
All user-defined token types are represented as string in the abstract syntax, and these strings contain the whole string that matched the respective regular expression.

ScottFreeCode · 2021-10-19T10:23:59Z

Thanks for the clarification @andreasabel !

Do you think it would be difficult to add either a directive to the grammar such as quoted '<open>' '<close>' "<escapes>" token … or, more flexibly, a hook to embed Haskell postprocessing (ideally even allow type conversion, but at least String -> String/Text -> Text) in the parsing of a given type or constructor without having to manually modify the generated code?

andreasabel · 2021-10-19T10:36:26Z

A workaround would be that you record a patch that you could apply each time after BNFC has run (via patching the Makefile).

a hook to embed Haskell postprocessing (ideally even allow type conversion, but at least String -> String/Text -> Text) in the parsing of a given type or constructor

One design for this would be #267, but I welcome spinning more design ideas!

ScottFreeCode · 2021-10-21T01:57:40Z

I tried patching the lexer like so based on the string handling:

diff --git a/src/MyPackage/Lex.x b/src/MyPackage/Lex.x
index 1234567..1234567 100644
--- a/src/MyPackage/Lex.x
+++ b/src/MyPackage/Lex.x
@@ -25,19 +25,19 @@ $u = [. \n]          -- universal: any character
    \, | \[ | \] | \{ | \} | \: | \[ \]
 
 :-
 
 
 $white+ ;
 @rsyms
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \' ([$u # [\' \\]] | \\ [\' \\ f n r t]) * \'
-    { tok (\p s -> PT p (eitherResIdent T_MyToken s)) }
+    { tok (\p s -> PT p (eitherResIdent T_MyToken $ unescapeInitTail s)) }
 
 $l $i*
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \" ([$u # [\" \\ \n]] | (\\ (\" | \\ | \' | n | t | r | f)))* \"
     { tok (\p s -> PT p (TL $ unescapeInitTail s)) }
 
 $d+
     { tok (\p s -> PT p (TI s))    }
 $d+ \. $d+ (e (\-)? $d+)?
@@ -117,18 +117,19 @@ unescapeInitTail :: Data.Text.Text -> Data.Text.Text
 unescapeInitTail = Data.Text.pack . unesc . tail . Data.Text.unpack
   where
   unesc s = case s of
     '\\':c:cs | elem c ['\"', '\\', '\''] -> c : unesc cs
     '\\':'n':cs  -> '\n' : unesc cs
     '\\':'t':cs  -> '\t' : unesc cs
     '\\':'r':cs  -> '\r' : unesc cs
     '\\':'f':cs  -> '\f' : unesc cs
     '"':[]    -> []
+    '\'':[]   -> []
     c:cs      -> c : unesc cs
     _         -> []
 
 -------------------------------------------------------------------
 -- Alex wrapper code.
 -- A modified "posn" wrapper.
 -------------------------------------------------------------------
 
 data Posn = Pn !Int !Int !Int

This mostly works. However, without the patch I am able to use a grammar that can match two forms:

x y

OR

x 'thing'

And 'thing' can be 'y'.

With the patch, the one case that now fails is that x 'y' is interpreted the same as x y.

I'm guessing I didn't correctly modify the lexer to do what strings are doing.

Any ideas?

(I can rig up a minimal reproducible test if that would help.)

ScottFreeCode · 2021-10-21T02:10:42Z

Ah, nevermind, I took a second look at the code, saw the definition of eitherResIdent and realized this was overriding 'y' with y if the contents matches an identifier, and changed + { tok (\p s -> PT p (eitherResIdent T_MyToken $ unescapeInitTail s)) } to + { tok (\p s -> PT p (T_MyToken $ unescapeInitTail s)) }

Well, I guess I should probably leave these comments up here in case anyone else makes the same mistake. Correct patch:

diff --git a/src/MyPackage/Lex.x b/src/MyPackage/Lex.x
index 1234567..1234567 100644
--- a/src/MyPackage/Lex.x
+++ b/src/MyPackage/Lex.x
@@ -25,19 +25,19 @@ $u = [. \n]          -- universal: any character
    \, | \[ | \] | \{ | \} | \: | \[ \]
 
 :-
 
 
 $white+ ;
 @rsyms
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \' ([$u # [\' \\]] | \\ [\' \\ f n r t]) * \'
-    { tok (\p s -> PT p (eitherResIdent T_MyToken s)) }
+    { tok (\p s -> PT p (T_MyToken $ unescapeInitTail s)) }
 
 $l $i*
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \" ([$u # [\" \\ \n]] | (\\ (\" | \\ | \' | n | t | r | f)))* \"
     { tok (\p s -> PT p (TL $ unescapeInitTail s)) }
 
 $d+
     { tok (\p s -> PT p (TI s))    }
 $d+ \. $d+ (e (\-)? $d+)?
@@ -117,18 +117,19 @@ unescapeInitTail :: Data.Text.Text -> Data.Text.Text
 unescapeInitTail = Data.Text.pack . unesc . tail . Data.Text.unpack
   where
   unesc s = case s of
     '\\':c:cs | elem c ['\"', '\\', '\''] -> c : unesc cs
     '\\':'n':cs  -> '\n' : unesc cs
     '\\':'t':cs  -> '\t' : unesc cs
     '\\':'r':cs  -> '\r' : unesc cs
     '\\':'f':cs  -> '\f' : unesc cs
     '"':[]    -> []
+    '\'':[]   -> []
     c:cs      -> c : unesc cs
     _         -> []
 
 -------------------------------------------------------------------
 -- Alex wrapper code.
 -- A modified "posn" wrapper.
 -------------------------------------------------------------------
 
 data Posn = Pn !Int !Int !Int

ScottFreeCode · 2021-10-21T04:32:11Z

Judging from playing with the Test program, the printer might also need adjusting to restore the quotes around MyToken. I'll have to take a look at that at whatever point I need the printer.

andreasabel added builtin Concerning built-in tokens like Integer, String etc. token Concerning token categories. labels Oct 19, 2021

andreasabel added the faq User question label Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse quoted token like String #392

Parse quoted token like String #392

ScottFreeCode commented Oct 19, 2021

ScottFreeCode commented Oct 19, 2021

ScottFreeCode commented Oct 19, 2021

andreasabel commented Oct 19, 2021

ScottFreeCode commented Oct 19, 2021

andreasabel commented Oct 19, 2021

ScottFreeCode commented Oct 21, 2021

ScottFreeCode commented Oct 21, 2021

ScottFreeCode commented Oct 21, 2021

Parse quoted token like String #392

Parse quoted token like String #392

Comments

ScottFreeCode commented Oct 19, 2021

ScottFreeCode commented Oct 19, 2021

ScottFreeCode commented Oct 19, 2021

andreasabel commented Oct 19, 2021

ScottFreeCode commented Oct 19, 2021

andreasabel commented Oct 19, 2021

ScottFreeCode commented Oct 21, 2021

ScottFreeCode commented Oct 21, 2021

ScottFreeCode commented Oct 21, 2021