Alternative AST tool #4

alsoeric · 2019-11-20T19:41:19Z

http://tree-sitter.github.io/tree-sitter/

This is an incremental AST parsing tool used by atom for its parsing. It may solve some of your AST parsing problems you hinted at in the plug-in description.

I was saying very excited by this development, and it may even get me to buy a sublime license :-)

mpourmpoulis · 2019-11-20T23:19:50Z

Thank you very much for your interest and for your suggestion!!! I definitely will look into it!

Glad you liked it! This project still needs a lot of work but I will try to do my best to keep improving it!

mpourmpoulis · 2019-12-06T13:25:01Z

firstly, O(+oo) thanks for your interest and for your suggestion, I really appreciate it:) in the past couple of weeks
I have been looking into your suggestion , though I haven't been able to devote as much time as I would like to it. Eitherway,here is a small update.

This looks really powerful and has so many features but I absolutely like and make it very strong candidate for after I do a rewrite. I find it very elegant that by means of a unified interface they make it so simple to switch between languages and without any dependencies other than the parsers for your language of interest. I also really like the whole incremental stuff,as having constantly an up to date syntax tree background could be helpful with some ideas I have. And S-expressions can be pretty cool:)

However, the way errors are handled gives me some concern. don't want to bother you with details but it seems that depending on where and what kind of error is present, the S expression and the tree hierarchy can be broken. It seems that for some single errors things are not that bad, and the recovery should be possible, but when you have two or three (or the error is on the right-hand side of an assignment) things can start breaking really bad. not so small sections of tree can become flat losing a lot of information, nodes/tokens can be split and become children of sibling/parent/grandparent nodes, subsequent errors can be ignored and and so can be ; terminating logical lines and so on... The other day I was able to construct test cases where two different pieces of code seem to produce the same tree even though they are semantically different and I expected an error node in one of them:) furthermore,there are cases where it does not seem to hint anything about indentation errors but that is a little bit of a double-edged knife.

in general I like the whole idea and I really hope that there is a neat workaround these issues or that is just me doing something wrong or missing something here, but I fear error recovery might give me more trouble than it currently does. I intent to keep investigating and experimenting as otherwise I really like it! So thanks a lot and sorry for the late reply!

mpourmpoulis · 2020-06-01T09:55:09Z

As there was some discussion in the dictation toolbox general chat Regarding error recovery of tree sitter, I am posting my comment from there for future reference

I first came into contact with tree sitter from an issue at my PythonVoiceCodingPlugin. At first glance it appeared pretty awesome(and it still does) but After playing around a little bit quickly realized that there are some fluctuations in the performance of its error recovery: while generally up to the task, sometimes it felt less than satisfying. without going into full detail

while it does its best to produce a tree no matter what, sometimes trees produced are just broken
recovery sometimes happens in a way that can lose quite a bit of information compared to what a human would infer from the code at hand even in not so complicated cases. Of course there is a bit of subjectivity involved regarding this one but I hope my following examples can give you an idea of what I'm talking about
there is some fragility, there are quite a few cases where tiny modifications can make all the difference whether solving or worsening the problem
it does not have much of a problem treating keywords just like variable names

I will post some quick examples which you can also give them a try online

For instance,

if in b:
	x
print( in b ,f)
[x for  in b]

produces

module [0, 0] - [6, 0])
  if_statement [0, 0] - [1, 2])
    condition: identifier [0, 3] - [0, 5])
    ERROR [0, 6] - [0, 7])
      identifier [0, 6] - [0, 7])
    consequence: block [1, 1] - [1, 2])
      expression_statement [1, 1] - [1, 2])
        identifier [1, 1] - [1, 2])
  expression_statement [2, 0] - [2, 15])
    call [2, 0] - [2, 15])
      function: identifier [2, 0] - [2, 5])
      arguments: argument_list [2, 5] - [2, 15])
        identifier [2, 7] - [2, 9])
        ERROR [2, 10] - [2, 11])
          identifier [2, 10] - [2, 11])
        identifier [2, 13] - [2, 14])
  expression_statement [3, 0] - [3, 13])
    list [3, 0] - [3, 13])
      identifier [3, 1] - [3, 2])
      ERROR [3, 3] - [3, 12])
        identifier [3, 8] - [3, 10])
        identifier [3, 11] - [3, 12])

which is not fully satisfying for me at least.Still all of these examples are kinda individual issues and have a localized impact, so they're not really the end of the world. More frustrating are cases where the structure of the tree gets kinda crippled

if x in :
    print(y in)
    b = x

module [0, 0] - [3, 0])
  ERROR [0, 0] - [2, 9])
    identifier [0, 3] - [0, 4])
    ERROR [0, 8] - [0, 9])
    identifier [1, 4] - [1, 9])
    comparison_operator [1, 10] - [2, 5])
      identifier [1, 10] - [1, 11])
      ERROR [1, 14] - [1, 15])
      identifier [2, 4] - [2, 5])
    identifier [2, 8] - [2, 9])

which can sometimes be caused by multiple or carefully placed single errors( by the way please notice that it believes that y in and b are one thing).but then again missing some edge cases that may not appear frequently in user code or failing to handle adversarial input is not really that much of a problem to be honest. What is a problem, is there exists frequently appearing pattern that can do that: missing or incomplete right hand side of an assignment!

x = 
y = 4

x = y + 
z = 0

x = 
if i ==0:
    print(i)

module [0, 0] - [9, 0])
  expression_statement [0, 0] - [1, 5])
    assignment [0, 0] - [1, 5])
      left: expression_list [0, 0] - [0, 1])
        identifier [0, 0] - [0, 1])
      right: assignment [1, 0] - [1, 5])
        left: expression_list [1, 0] - [1, 1])
          identifier [1, 0] - [1, 1])
        right: expression_list [1, 4] - [1, 5])
          integer [1, 4] - [1, 5])
  expression_statement [3, 0] - [4, 5])
    assignment [3, 0] - [4, 5])
      left: expression_list [3, 0] - [3, 1])
        identifier [3, 0] - [3, 1])
      right: assignment [3, 4] - [4, 5])
        left: expression_list [3, 4] - [4, 1])
          binary_operator [3, 4] - [4, 1])
            left: identifier [3, 4] - [3, 5])
            right: identifier [4, 0] - [4, 1])
        right: expression_list [4, 4] - [4, 5])
          integer [4, 4] - [4, 5])
  expression_statement [6, 0] - [8, 12])
    assignment [6, 0] - [8, 12])
      left: expression_list [6, 0] - [6, 1])
        identifier [6, 0] - [6, 1])
      right: assignment [7, 0] - [8, 12])
        left: expression_list [7, 0] - [7, 8])
          comparison_operator [7, 0] - [7, 8])
            identifier [7, 0] - [7, 2])
            ERROR [7, 3] - [7, 4])
              identifier [7, 3] - [7, 4])
            integer [7, 7] - [7, 8])
        type: type [8, 4] - [8, 12])
          call [8, 4] - [8, 12])
            function: identifier [8, 4] - [8, 9])
            arguments: argument_list [8, 9] - [8, 12])
              identifier [8, 10] - [8, 11])

as it then believes that what ever comes after it is a continuation of that assignment, even though they are separated by newline tokens.These can especially be a problem if you enjoy using a workflow like "delete right, paste something else" like me. There was also a major issue with not allowing for empty blocks which meant that anything coming right after an if,for would be considered to belong in its block regardless of the indentation(!) but these fortunately got batched recently

As a final note, please do not interpret these post as a discouragement attempt against the usege of tree sitter. end of the day I have only carried out a handful of small-scale experiments and I still think it is a pretty powerful and more importantly general-purpose library that has much to offer. I would just like to point out that there are some issues one may not expect at first glance, some of which may require us to search for workarounds depending on what we want our final program to be capable of and what we deem to be important for the end user experience.

mpourmpoulis added the LibrarySuggestion label Nov 20, 2019

mpourmpoulis self-assigned this Nov 21, 2019

mpourmpoulis added code coverage Under Investigation labels Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative AST tool #4

Alternative AST tool #4

alsoeric commented Nov 20, 2019

mpourmpoulis commented Nov 20, 2019

mpourmpoulis commented Dec 6, 2019

mpourmpoulis commented Jun 1, 2020

Alternative AST tool #4

Alternative AST tool #4

Comments

alsoeric commented Nov 20, 2019

mpourmpoulis commented Nov 20, 2019

mpourmpoulis commented Dec 6, 2019

mpourmpoulis commented Jun 1, 2020