-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Option to Turn Off _fill_perfdata #168
Add Option to Turn Off _fill_perfdata #168
Conversation
Need to add unit test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few comments/changes, but this mostly looks good.
The one thing I'd mention is that this will have a notable impact in performance in some cases. As you said in the initial PR comment, this functionality can substantially increase the number of rows in the performance dataframe. Given how slow pandas is and given that pandas uses a minimum of 2x the memory that a single DataFrame
should take up, adding a substantial number of nodes will slow down all of Thicket.
Given that, my one remaining comment (which we'll need to discuss) is that it might be better to keep this off by default. Right now in this PR, users have to opt out of this feature, which could destroy Thicket's performance if used in the wrong situation. I'd argue that it's better for users to opt into this feature in situations where they know the performance won't be destroyed. Addressed in the discussion below. This is the default behavior of Thicket right now, so this PR shouldn't change that. However, IMO, this will cause issues in the future.
unique_perf_indices = set(new_thicket.dataframe.index.droplevel("node")) | ||
full_idx = all([idx in unique_perf_indices for idx in new_thicket.profile]) | ||
if not full_idx: | ||
new_thicket = new_thicket.squash() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this is trying to do. From what I can tell, this is trying to see if the profile
list is a subset of the unique profile IDs in the performance dataframe index. If that's true, then the dataframe can store profile IDs that aren't in the Thicket.profile
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is detecting if fill_perfdata
is off. To make it more clear I thought of a couple options:
- I could make it a helper function with a docstring to make it more clear.
- Have a boolean thicket attribute for
fill_perfdata
that can be checked here.
The purpose of this is that there is a case when the index is not full, and a filter is performed on the metadata that nodes disappear from the performance data, if the metadata for those profiles was filtered out. The code above will update the graph using squash if this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I'm understanding right, what you're trying to do is 1) check if any nodes were removed from the performance data as a side effect of filtering profiles from the metadata and 2) fix the graph with squash
if any nodes were removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If what I wrote above is correct, then this code will not do what you're wanting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code (assuming the set(Series)
thing works) only checks if Thicket.profiles
is a subset of the profile IDs in the performance dataframe index. That tells you nothing about the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It tells you if there are profiles in the performance dataframe that do not exist in tk.profile
. For example if I simulate removing a profile from the dataframe, it will be false which will trigger a squash:
The actual issue I noticed with this is that they will both always be in sync due to _sync_profile_components
. So what I should check here instead is if len(tk.graph) != len(tk.dataframe.index.get_level_values("node").unique())
which is more explicit anyway. These sort of things will get caught when I add unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code is not checking if there are profiles in the dataframe that don't exist in tk.profile
. It's checking if there are profiles in tk.profile
that don't exist in the dataframe. It's a subtle difference, but it would allow e.g., the dataframe to contain extra profiles that don't exist in tk.profiles
.
I actually just encountered this same type of thing in Befikir's query_stats
PR (#157), and I realized there's a better way to do this check:
sorted(tk.profile) == sorted(unique_perf_indices.tolist())
This will check that both the dataframe and tk.profile
agree in terms of contents and length. Node
objects are comparable via their _hatchet_nid
fields. As a result, the Node
objects within different parts of the same Thicket
object can be compared in this way. This comparison will not work across Thicket
objects though because there's no guarantee that equivalent Node
objects will have the same _hatchet_nid
values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's checking if there are profiles in tk.profile that don't exist in the dataframe.
But that is what I was trying to accomplish with that code. That would indicate that there are missing profiles in the performance data, which is what I was saying. However, it doesn't address
it would allow e.g., the dataframe to contain extra profiles that don't exist in tk.profiles.
which your correct would be an issue, if not for _sync_profile_components
being called earlier. Regardless, what I was actually trying to do is check nodes so I'm using a different check now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, if we are trying to update the state of the profiles in a Thicket we should be using _sync_profile_components
. And if we are trying to validate properties about the state of a Thicket we should use validate_profile
or validate_dataframe
in utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're 100% positive that you've ensured the graph and dataframe contain the same node objects, then this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidentally forget to set "Request changes" on my main review. Refer to the main review for comments.
It's on by default because the default behavior in Thicket has been to fill the index in the performance table. Actually, the default behavior originally was to not fill the index, then there was a design decision made at some point to fill it all the time. My goal with this PR is to at least give the user the option, based on their data. |
Wait, really? In that case, what you have in this PR is fine, but I will second guess that decision. I suspect we haven't run into any major performance issues because we've just used effectively N copies of the same program. But, once we run into less homogeneous data (e.g., MPMD-style data like what my benchmark will create), we will likely start running into major performance issues with this decision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, so long as we are positive that the Node
objects are the same across thicket components.
They should be unless there is a bug. Edit: I just realized we had a basically identical discussion in #173 (comment) so I am referencing it here. I agree that this is a broader issue than both of these PRs. |
_fill_perfdata
adds empty rows for profiles that do not contain data for a node, such that each node has a row entry for every profile. This can be obnoxious in some cases, as in the performance algorithms dataset, 2,531,150 rows in the performance data table with_fill_perfdata
and 163,778 rows with_fill_perfdata
off.This is a small optimization for the reader, as
_fill_perdata
is only called once, most time is spent in_unify
. However, when working with a large amount of files with missing rows, a Thicket with_fill_perfdata
off will be easier and faster to work with.