Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse quoted token like String #392

Open
ScottFreeCode opened this issue Oct 19, 2021 · 8 comments
Open

Parse quoted token like String #392

ScottFreeCode opened this issue Oct 19, 2021 · 8 comments
Labels
builtin Concerning built-in tokens like Integer, String etc. faq User question token Concerning token categories.

Comments

@ScottFreeCode
Copy link

I am having trouble figuring out how I can write a token like the built-in String but with different quotation marks and, in my Abs.hs result, only get the content inside the quotes like String does. If I create a token with some kind of quotes it seems that whatever interprets the Abs members has to remove them, whereas this is not necessary for String as far as I've noticed. Is there a way to achieve this?

My first thought was to define the token as just the stuff inside the quotes, then use it a la MyQuotedType . MyQuotedType ::= "'" MyToken "'"; But this causes other keywords to be parsed (or perhaps lexed? I'm not terribly familiar with the distinction) as MyToken even though they are not preceded by the quote; and that breaks (such that it won't even parse) code that was successfully parsing in the version where the quotes are part of the token and get manually unquoted in the interpreter.

@ScottFreeCode
Copy link
Author

(I realize ' is used by Char, but I don't happen to be using Char. And, I could always change it to backticks or something if I ended up needing to. Some languages use / as quotes for regular expressions. Ideally I'd be able to specify the quotation marks to use/remove.)

@ScottFreeCode
Copy link
Author

Relatedly, do the content of strings need to be unescaped (e.g. \" -> " and \\ -> \) or does the parser also handle that? Could the parser handle it for custom quoted tokens similarly if so?

@andreasabel andreasabel added builtin Concerning built-in tokens like Integer, String etc. token Concerning token categories. labels Oct 19, 2021
@andreasabel
Copy link
Member

I am having trouble figuring out how I can write a token like the built-in String but with different quotation marks and, in my Abs.hs result, only get the content inside the quotes like String does.

This is unfortunately not possible with BNFC. The token types Char, Double, Integer, String are hard-wired and do something special.
All user-defined token types are represented as string in the abstract syntax, and these strings contain the whole string that matched the respective regular expression.

@andreasabel andreasabel added the faq User question label Oct 19, 2021
@ScottFreeCode
Copy link
Author

Thanks for the clarification @andreasabel !

Do you think it would be difficult to add either a directive to the grammar such as quoted '<open>' '<close>' "<escapes>" token … or, more flexibly, a hook to embed Haskell postprocessing (ideally even allow type conversion, but at least String -> String/Text -> Text) in the parsing of a given type or constructor without having to manually modify the generated code?

@andreasabel
Copy link
Member

A workaround would be that you record a patch that you could apply each time after BNFC has run (via patching the Makefile).

a hook to embed Haskell postprocessing (ideally even allow type conversion, but at least String -> String/Text -> Text) in the parsing of a given type or constructor

One design for this would be #267, but I welcome spinning more design ideas!

@ScottFreeCode
Copy link
Author

I tried patching the lexer like so based on the string handling:

diff --git a/src/MyPackage/Lex.x b/src/MyPackage/Lex.x
index 1234567..1234567 100644
--- a/src/MyPackage/Lex.x
+++ b/src/MyPackage/Lex.x
@@ -25,19 +25,19 @@ $u = [. \n]          -- universal: any character
    \, | \[ | \] | \{ | \} | \: | \[ \]
 
 :-
 
 
 $white+ ;
 @rsyms
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \' ([$u # [\' \\]] | \\ [\' \\ f n r t]) * \'
-    { tok (\p s -> PT p (eitherResIdent T_MyToken s)) }
+    { tok (\p s -> PT p (eitherResIdent T_MyToken $ unescapeInitTail s)) }
 
 $l $i*
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \" ([$u # [\" \\ \n]] | (\\ (\" | \\ | \' | n | t | r | f)))* \"
     { tok (\p s -> PT p (TL $ unescapeInitTail s)) }
 
 $d+
     { tok (\p s -> PT p (TI s))    }
 $d+ \. $d+ (e (\-)? $d+)?
@@ -117,18 +117,19 @@ unescapeInitTail :: Data.Text.Text -> Data.Text.Text
 unescapeInitTail = Data.Text.pack . unesc . tail . Data.Text.unpack
   where
   unesc s = case s of
     '\\':c:cs | elem c ['\"', '\\', '\''] -> c : unesc cs
     '\\':'n':cs  -> '\n' : unesc cs
     '\\':'t':cs  -> '\t' : unesc cs
     '\\':'r':cs  -> '\r' : unesc cs
     '\\':'f':cs  -> '\f' : unesc cs
     '"':[]    -> []
+    '\'':[]   -> []
     c:cs      -> c : unesc cs
     _         -> []
 
 -------------------------------------------------------------------
 -- Alex wrapper code.
 -- A modified "posn" wrapper.
 -------------------------------------------------------------------
 
 data Posn = Pn !Int !Int !Int

This mostly works. However, without the patch I am able to use a grammar that can match two forms:

x y

OR

x 'thing'

And 'thing' can be 'y'.

With the patch, the one case that now fails is that x 'y' is interpreted the same as x y.

I'm guessing I didn't correctly modify the lexer to do what strings are doing.

Any ideas?

(I can rig up a minimal reproducible test if that would help.)

@ScottFreeCode
Copy link
Author

Ah, nevermind, I took a second look at the code, saw the definition of eitherResIdent and realized this was overriding 'y' with y if the contents matches an identifier, and changed + { tok (\p s -> PT p (eitherResIdent T_MyToken $ unescapeInitTail s)) } to + { tok (\p s -> PT p (T_MyToken $ unescapeInitTail s)) }

Well, I guess I should probably leave these comments up here in case anyone else makes the same mistake. Correct patch:

diff --git a/src/MyPackage/Lex.x b/src/MyPackage/Lex.x
index 1234567..1234567 100644
--- a/src/MyPackage/Lex.x
+++ b/src/MyPackage/Lex.x
@@ -25,19 +25,19 @@ $u = [. \n]          -- universal: any character
    \, | \[ | \] | \{ | \} | \: | \[ \]
 
 :-
 
 
 $white+ ;
 @rsyms
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \' ([$u # [\' \\]] | \\ [\' \\ f n r t]) * \'
-    { tok (\p s -> PT p (eitherResIdent T_MyToken s)) }
+    { tok (\p s -> PT p (T_MyToken $ unescapeInitTail s)) }
 
 $l $i*
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \" ([$u # [\" \\ \n]] | (\\ (\" | \\ | \' | n | t | r | f)))* \"
     { tok (\p s -> PT p (TL $ unescapeInitTail s)) }
 
 $d+
     { tok (\p s -> PT p (TI s))    }
 $d+ \. $d+ (e (\-)? $d+)?
@@ -117,18 +117,19 @@ unescapeInitTail :: Data.Text.Text -> Data.Text.Text
 unescapeInitTail = Data.Text.pack . unesc . tail . Data.Text.unpack
   where
   unesc s = case s of
     '\\':c:cs | elem c ['\"', '\\', '\''] -> c : unesc cs
     '\\':'n':cs  -> '\n' : unesc cs
     '\\':'t':cs  -> '\t' : unesc cs
     '\\':'r':cs  -> '\r' : unesc cs
     '\\':'f':cs  -> '\f' : unesc cs
     '"':[]    -> []
+    '\'':[]   -> []
     c:cs      -> c : unesc cs
     _         -> []
 
 -------------------------------------------------------------------
 -- Alex wrapper code.
 -- A modified "posn" wrapper.
 -------------------------------------------------------------------
 
 data Posn = Pn !Int !Int !Int

@ScottFreeCode
Copy link
Author

Judging from playing with the Test program, the printer might also need adjusting to restore the quotes around MyToken. I'll have to take a look at that at whatever point I need the printer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
builtin Concerning built-in tokens like Integer, String etc. faq User question token Concerning token categories.
Projects
None yet
Development

No branches or pull requests

2 participants