Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative AST tool #4

Open
alsoeric opened this issue Nov 20, 2019 · 3 comments
Open

Alternative AST tool #4

alsoeric opened this issue Nov 20, 2019 · 3 comments

Comments

@alsoeric
Copy link

http://tree-sitter.github.io/tree-sitter/

This is an incremental AST parsing tool used by atom for its parsing. It may solve some of your AST parsing problems you hinted at in the plug-in description.

I was saying very excited by this development, and it may even get me to buy a sublime license :-)

@mpourmpoulis
Copy link
Owner

Thank you very much for your interest and for your suggestion!!! I definitely will look into it!

Glad you liked it! This project still needs a lot of work but I will try to do my best to keep improving it!

@mpourmpoulis
Copy link
Owner

firstly, O(+oo) thanks for your interest and for your suggestion, I really appreciate it:) in the past couple of weeks
I have been looking into your suggestion , though I haven't been able to devote as much time as I would like to it. Eitherway,here is a small update.

This looks really powerful and has so many features but I absolutely like and make it very strong candidate for after I do a rewrite. I find it very elegant that by means of a unified interface they make it so simple to switch between languages and without any dependencies other than the parsers for your language of interest. I also really like the whole incremental stuff,as having constantly an up to date syntax tree background could be helpful with some ideas I have. And S-expressions can be pretty cool:)

However, the way errors are handled gives me some concern. don't want to bother you with details but it seems that depending on where and what kind of error is present, the S expression and the tree hierarchy can be broken. It seems that for some single errors things are not that bad, and the recovery should be possible, but when you have two or three (or the error is on the right-hand side of an assignment) things can start breaking really bad. not so small sections of tree can become flat losing a lot of information, nodes/tokens can be split and become children of sibling/parent/grandparent nodes, subsequent errors can be ignored and and so can be ; terminating logical lines and so on... The other day I was able to construct test cases where two different pieces of code seem to produce the same tree even though they are semantically different and I expected an error node in one of them:) furthermore,there are cases where it does not seem to hint anything about indentation errors but that is a little bit of a double-edged knife.

in general I like the whole idea and I really hope that there is a neat workaround these issues or that is just me doing something wrong or missing something here, but I fear error recovery might give me more trouble than it currently does. I intent to keep investigating and experimenting as otherwise I really like it! So thanks a lot and sorry for the late reply!

@mpourmpoulis
Copy link
Owner

As there was some discussion in the dictation toolbox general chat Regarding error recovery of tree sitter, I am posting my comment from there for future reference

I first came into contact with tree sitter from an issue at my PythonVoiceCodingPlugin. At first glance it appeared pretty awesome(and it still does) but After playing around a little bit quickly realized that there are some fluctuations in the performance of its error recovery: while generally up to the task, sometimes it felt less than satisfying. without going into full detail

  • while it does its best to produce a tree no matter what, sometimes trees produced are just broken

  • recovery sometimes happens in a way that can lose quite a bit of information compared to what a human would infer from the code at hand even in not so complicated cases. Of course there is a bit of subjectivity involved regarding this one but I hope my following examples can give you an idea of what I'm talking about

  • there is some fragility, there are quite a few cases where tiny modifications can make all the difference whether solving or worsening the problem

  • it does not have much of a problem treating keywords just like variable names

I will post some quick examples which you can also give them a try online

For instance,

if in b:
	x
print( in b ,f)
[x for  in b]

produces

module [0, 0] - [6, 0])
  if_statement [0, 0] - [1, 2])
    condition: identifier [0, 3] - [0, 5])
    ERROR [0, 6] - [0, 7])
      identifier [0, 6] - [0, 7])
    consequence: block [1, 1] - [1, 2])
      expression_statement [1, 1] - [1, 2])
        identifier [1, 1] - [1, 2])
  expression_statement [2, 0] - [2, 15])
    call [2, 0] - [2, 15])
      function: identifier [2, 0] - [2, 5])
      arguments: argument_list [2, 5] - [2, 15])
        identifier [2, 7] - [2, 9])
        ERROR [2, 10] - [2, 11])
          identifier [2, 10] - [2, 11])
        identifier [2, 13] - [2, 14])
  expression_statement [3, 0] - [3, 13])
    list [3, 0] - [3, 13])
      identifier [3, 1] - [3, 2])
      ERROR [3, 3] - [3, 12])
        identifier [3, 8] - [3, 10])
        identifier [3, 11] - [3, 12])

which is not fully satisfying for me at least.Still all of these examples are kinda individual issues and have a localized impact, so they're not really the end of the world. More frustrating are cases where the structure of the tree gets kinda crippled

if x in :
    print(y in)
    b = x
module [0, 0] - [3, 0])
  ERROR [0, 0] - [2, 9])
    identifier [0, 3] - [0, 4])
    ERROR [0, 8] - [0, 9])
    identifier [1, 4] - [1, 9])
    comparison_operator [1, 10] - [2, 5])
      identifier [1, 10] - [1, 11])
      ERROR [1, 14] - [1, 15])
      identifier [2, 4] - [2, 5])
    identifier [2, 8] - [2, 9])

which can sometimes be caused by multiple or carefully placed single errors( by the way please notice that it believes that y in and b are one thing).but then again missing some edge cases that may not appear frequently in user code or failing to handle adversarial input is not really that much of a problem to be honest. What is a problem, is there exists frequently appearing pattern that can do that: missing or incomplete right hand side of an assignment!

x = 
y = 4

x = y + 
z = 0

x = 
if i ==0:
    print(i)
module [0, 0] - [9, 0])
  expression_statement [0, 0] - [1, 5])
    assignment [0, 0] - [1, 5])
      left: expression_list [0, 0] - [0, 1])
        identifier [0, 0] - [0, 1])
      right: assignment [1, 0] - [1, 5])
        left: expression_list [1, 0] - [1, 1])
          identifier [1, 0] - [1, 1])
        right: expression_list [1, 4] - [1, 5])
          integer [1, 4] - [1, 5])
  expression_statement [3, 0] - [4, 5])
    assignment [3, 0] - [4, 5])
      left: expression_list [3, 0] - [3, 1])
        identifier [3, 0] - [3, 1])
      right: assignment [3, 4] - [4, 5])
        left: expression_list [3, 4] - [4, 1])
          binary_operator [3, 4] - [4, 1])
            left: identifier [3, 4] - [3, 5])
            right: identifier [4, 0] - [4, 1])
        right: expression_list [4, 4] - [4, 5])
          integer [4, 4] - [4, 5])
  expression_statement [6, 0] - [8, 12])
    assignment [6, 0] - [8, 12])
      left: expression_list [6, 0] - [6, 1])
        identifier [6, 0] - [6, 1])
      right: assignment [7, 0] - [8, 12])
        left: expression_list [7, 0] - [7, 8])
          comparison_operator [7, 0] - [7, 8])
            identifier [7, 0] - [7, 2])
            ERROR [7, 3] - [7, 4])
              identifier [7, 3] - [7, 4])
            integer [7, 7] - [7, 8])
        type: type [8, 4] - [8, 12])
          call [8, 4] - [8, 12])
            function: identifier [8, 4] - [8, 9])
            arguments: argument_list [8, 9] - [8, 12])
              identifier [8, 10] - [8, 11])

as it then believes that what ever comes after it is a continuation of that assignment, even though they are separated by newline tokens.These can especially be a problem if you enjoy using a workflow like "delete right, paste something else" like me. There was also a major issue with not allowing for empty blocks which meant that anything coming right after an if,for would be considered to belong in its block regardless of the indentation(!) but these fortunately got batched recently

As a final note, please do not interpret these post as a discouragement attempt against the usege of tree sitter. end of the day I have only carried out a handful of small-scale experiments and I still think it is a pretty powerful and more importantly general-purpose library that has much to offer. I would just like to point out that there are some issues one may not expect at first glance, some of which may require us to search for workarounds depending on what we want our final program to be capable of and what we deem to be important for the end user experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants