-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor / performance: Track scope information while traversing #609
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #609 +/- ##
============================================
+ Coverage 81.63% 82.02% +0.38%
+ Complexity 910 882 -28
============================================
Files 61 65 +4
Lines 2075 2097 +22
============================================
+ Hits 1694 1720 +26
+ Misses 381 377 -4
|
} | ||
|
||
if ($node instanceof ClassLike | ||
&& (in_array($childName, ['classMembers', 'interfaceMembers','traitMembers'], true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to strictly
compare here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really - I thought it would be good form, just like using ===
instead of ==
- do you think there's a reason to change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow my benchmarks are showing different results:
https://gist.github.com/jens1o/621c1307ee0b9d618839211688d46dba
PS C:\nginx\html\projects\php7.2-playground> php .\in_array_strict_benchmark.php
Took 0.42s with strict compare
Took 0.35s without strict compare
Took 5.47s with strict compare
Took 0.43s without strict compare
Took 0.54s with strict compare
Took 0.47s without strict compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your benchmark shows searching for an empty string which does not is the case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikic is it expexcted that strictly searching for any empty string via in_array is slower then when doing it non-strict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, the in_array strict path is slightly less optimized than the non-strict one. Though if you want to micro-optimize this, the way to do it is to use \in_array
(with strict), which will transform to an HT lookup. Or even better, just directly implement it as an HT lookup so you're not dependent on optimization behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't actually mean performance-wise but style-wise.
I think it's better because the code does not perform convoluted type casts under the hood. It's not that the code would be more incorrect that way, just that it feels that way - when I'm searching for a bug, I don't have to wonder if this comparison here does something funny.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... although I would have expected the strict version also to be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that specific benchmark is rather useless, as the second array contains 0 as first element, which is (non-strict) equal to the empty string, so what you see there is the difference between matching on the first element in the array vs scanning through all 10000 elements and not finding anything.
Good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that strict comparison is always easier to reason about.
src/Scope/TreeTraverser.php
Outdated
|
||
|
||
// TODO: Handle use (&$x) when $x is not defined in scope. | ||
// TODO: Handle list(...) = $a; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and PHP7.1+ [...] = $a;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I didn't even know about that construct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I added a TODO comment for that and foreach ($a as [...])
@@ -716,6 +603,7 @@ public function resolveExpressionNodeToType($expr) | |||
if ($token === PhpParser\TokenKind::NullReservedWord) { | |||
return new Types\Null_; | |||
} | |||
return new Types\Mixed_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you expect this case to ever get hit? Or is it for future-proofing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were added accidentally (same goes the Types\Mixed_
below) - I tried various things in the same branch if they improved performance, including bailing early here. It did not really affect performance, but I apparently forgot to revert some changes. I can revert this.
$traverser = new TreeTraverser($definitionResolver); | ||
$resultScope = null; | ||
$traverser->traverse( | ||
$sourceFile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand why you would want to traverse the whole source file from the root if you already have the node and every node is linked to its parent. This seems very inefficient, as you will visit a lot of nodes that you don't care about.
Intuitively, a scope is anything enclosed in by the closest function node, so only that should the nodes inside that boundary should be visited for any needed operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps Scope
is a misnomer, since it also contains information about other names in effect - in particular the resolved name cache depends on the current namespace. Maybe ParsingContext
could be more appropriate.
However, after reading your comment, I now realize that scanning the whole file is not necessary, since the code is not interested in what the current namespace is, only when it changes.
Also, the scope contains $this
variable and $currentSelf
, these depend on the class. These could be handled separately and start the scanning from the function like you said. It might be more efficient on large files.
class Scope | ||
{ | ||
/** | ||
* @var Variable|null "Variable" representing this/self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...or null
if the scope is not inside a class
use Microsoft\PhpParser\Node\QualifiedName; | ||
|
||
/** | ||
* Contains information about variables at a point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate what "point" means here? How is the scope boundary defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scope when evaluating the expression at the node being traversed, if it were an expression.
... perhaps? I'm not sure how to express it clearly.
public function getResolvedName(QualifiedName $name) | ||
{ | ||
$nameStr = (string)$name; | ||
if (array_key_exists($nameStr, $this->resolvedNameCache)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming there are no null
values in the array so isset
would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I'll fix it. I just copy-pasted code from the DocBlock thingie.
/** | ||
* Traversers AST with Scope information. | ||
*/ | ||
class TreeTraverser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser already exposes utilities to traverse the tree (and uses the iterator pattern instead of the visitor pattern, which is superior imo). Given that this is a fair amount of code, could you explain why this is needed? What is the difference between this scope-tree traverser and a general traverser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look into the parser code and see if we could use it - my initial guess is that we can't though, since the traversing needs to happen in the following fashion:
Scope = [] // Scope 1 - Root
Scope = [$this] // Scope 2 - Class entered
Scope = [$this, $a] // Scope 3 - Function entered
Scope = [$this] // !! Scope 2 - Same scope again
Scope = [] // !! Scope 1 - Same scope again
The lines marked with !!
- if we only get a callback per Node
or Token
, we cannot know when a scope is exited. Additionally, we'd need to manage a stack manually instead of relying on the call stack.
Further, we need information on the context of the element - if we have a BracedStatementList
(or whatever it's called), we need to know if it is a functionBodyOrSemicolon
to start a new Scope
, so the Node
alone is not sufficient.
If the parser does not provide a traverser that fits those two needs, I think that working around them might make the code more confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this, and getDescendantNodesAndTokens
or getDescendantNodes
cannot really be used, since we won't know when a node is exited and the scope should be restored.
I think that TreeTraverser
could be converted into a generator yielding something - e.g. rename Scope
to TraversingContext
which contains both $node
and $variables
etc. Do you think that's worth pursuing?
(I accidentally posted this with a wrong GitHub account first - sorry about the noise)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So first of all I still think parent traversal is the better way to go.
And then getDescendentNodes
afaik takes a callback to decide whether a node should be entered or not.
For traversal that needs full control over the recursive aspect I did a PR a while ago to tolerant-php-parser with RecursiveIterator support: microsoft/tolerant-php-parser#139
I didn't need it at the time though and it turned out it was a tad slower than the generators. It would be interesting to see how it compares to this TreeTraverser and the previous implementation, especially on latest PHP 7.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RecursiveIterator
thing certainly looks interesting. It would require manual handling of a stack though, if I understand it correctly.
I think that traversing the code backwards to find types has a few problems, which I tried to solve with this:
1. Duplicated work:
Find references on line 2:
1 $a = new A; // 2. Ok, `$a is A`
2 $b = $a->foo(); // 1. Start traversing - need to find what $a is
3 $b->bar();
After that, find references on line 3:
1 $a = new A; // 3. Duplicated work - Ok, `$a is A`
2 $b = $a->foo(); // 2. Duplicated work - Need to find out what $a is
3 $b->bar(); // 1. Start Traversing - need to find what $b is to get reference to bar
This could of course be solved with some sort of memoization. It would probably end up building a scope, but in reverse order.
2. Slowness of parsing backwards in Tolerant PHP Parser
Getting a previous sibling gets its parent, enumerates its children, and when it finds the original node, it returns the previously met one.
(3. Harder to understand - this is mostly a matter of preference maybe)
Since code is evaluated from top to bottom, an approach evaluating code from top to bottom is easier to understand, at least for me.
I'm going to sleep now, so I won't reply for a while. I might take a look at the other PRs and comments tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Duplicated work:
I don't think this problem has anything to do with how to traverse. The difference is that this PR introduces a Scope object, where before this was always computed JIT. I never had any performance problems with find-refs on variables though because their scope is naturally small in PHP, so it's a CPU vs memory usage trade off.
- Slowness of parsing backwards in Tolerant PHP Parser
Getting a previous sibling gets its parent, enumerates its children, and when it finds the original node, it returns the previously met one.
This is a good point. With PHPParser I added the siblings as properties so that was fast. We could do the same for tolerant-parser.
Are you maintaining the same behaviour as before with the new traverser?
I believe before, I would always look for the closest assignment to the variable, in case it got overridden:
1 $a = 123;
2 $a = 'abc';
3 $a // should jump to L2
For that I find it very natural to search backwards and confusing to search top-down. If you intend to get to L1, then I would agree that top-down is more natural, but perf-wise (leaving aside 2.) the assumption is that variables are defined close to the reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never had any performance problems with find-refs on variables though because their scope is naturally small in PHP, so it's a CPU vs memory usage trade off.
It might be that the performance benefits here come from caching getResolvedName
results in the Scope
object. I started working on this because XDebug profiling showed finding variable types as a significant cost. XDebug has very high overhead though, so it might be completely inaccurate. It might be worth trying to cache the getResolvedName
results alone, especially if you don't like the Scope
stuff.
Are you maintaining the same behaviour as before with the new traverser?
I believe so - the variables in Scope
are overwritten when a new assignment is met, so on line 3 $a
would be a string.
} | ||
|
||
if ($node instanceof ClassLike | ||
&& (in_array($childName, ['classMembers', 'interfaceMembers','traitMembers'], true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that strict comparison is always easier to reason about.
@@ -175,10 +177,15 @@ private function getDocBlock(Node $node) | |||
* | |||
* @param Node $node | |||
* @param string $fqn | |||
* @param Scope|null $scope Scope at the point of Node. If not provided, will be computed from $node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you ensure that the scope passed in is the scope of the node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to think about this a bit more.
One thing that comes to mind is that the Scope
could contain information about the Node
or FunctionLike
it is defined at? The added check would have a performance penalty (which I guess might be significant), since you'd need to find the closest FunctionLike
, ClassLike
, or SourceFileNode
ancestor every time you wanted to check if the scope belongs to a node.
@@ -726,6 +614,7 @@ public function resolveExpressionNodeToType($expr) | |||
if ($def !== null) { | |||
return $def->type; | |||
} | |||
return new Types\Mixed_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixfbecker This is not related to this PR, I left it here by accident. Should I revert it or let it stay?
This PR makes the resolution of variable types go forwards instead of backwards. It keeps an active scope. In my opinion, this makes the code more understandable (although the number of lines increased.) Code duplication is removed between completion for variables and resolving them in
DefinitionResolver
.Additionally, the scope is used to cache results of
getResolvedName
, since it is quite slow, at least with the current Tolerant PHP Parser. This cache is invalidated when a namespace declaration is met.There are no tests for
Scope
yet, in case there are severe issues with the general approach.Edit: This increases performance of
Performance.php
by ~10% on my machine.