Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

direct lake - columns into memory #115

Open
m-kovalsky opened this issue Feb 19, 2024 · 21 comments · Fixed by #118
Open

direct lake - columns into memory #115

m-kovalsky opened this issue Feb 19, 2024 · 21 comments · Fixed by #118

Comments

@m-kovalsky
Copy link

When vertipaq analyzer is run (inside DAX Studio) against a Direct Lake model, it fires off a bunch of DAX queries which essentially put all the model's columns into memory. This defeats the purpose of a Direct Lake model having only the columns which a user queries put into memory. There should be some sort of configuration made to Veritpaq Analyzer so that it does not do this by default for Direct Lake models. And, that it obtains as many properties as possible from DMVs and as few as possible from DAX statements. IMO, the default setting for Direct Lake should be to not run any DAX queries against the model so as not to put any unnecessary columns into memory. This is highly problematic for Direct Lake.

@marcosqlbi
Copy link
Collaborator

You can already disable all the queries to DirectLake by disabling the "Read statistics from data" setting.
We could add a setting similar to "Read statistics from DirectQuery tables" for DirectLake, but I would keep the default setting enabled (in that case) because the purpose of VertiPaq Analyzer is to provide detailed information about the data model and not to provide information about the transient state of data (e.g. segments temperature and so on). I would like another tool/library to do that.
What do you think about that?

image

@m-kovalsky
Copy link
Author

If I turn off the 'Read statistics from data' I don't get back the column cardinality (which is quite important) because that was being calculated by doing a DISTINCTCOUNT against every column in the model. Why isn't this using the Statistics_DistinctStates from TMSCHEMA_COLUMN_STORAGES? Then you don't need to run all that DAX and put the columns into memory. That seems to potentially solve the problem in an easier fashion.

Also, I'm not sure if Vertipaq Analyzer considers only the columns where IsResident=True for the memory usage (for Direct Lake models). That would be useful (if it's not done already).

@marcosqlbi
Copy link
Collaborator

Good point - I have to investigate. Originally, we were using the dictionary size as column cardinality (before we had TMSCHEMA DMVs) and the DISTINCTCOUNT calculation was the only way to get the right number as the dictionary can be larger than the actual unique values in the data (it happens with incremental refresh or when you refresh a partition).
I will investigate it and keep you posted (not very soon, unfortunately).

@m-kovalsky
Copy link
Author

What should we do in the meantime? I've had customers complain about this and I've told them not to use Vertipaq Analyzer for now. There are many others who simply don't know this is happening behind the scenes and I feel they should know. Large models are more likely to be checked by Vertipaq Analyzer and those are exactly the ones for which this is problematic.

@marcosqlbi
Copy link
Collaborator

marcosqlbi commented Feb 19, 2024

Well, the code is open source, if you have time for a pull request... :)

@dgosbell
Copy link
Collaborator

I don't think any of the DMVs will have cardinality information until the column is paged into memory.

I think a better option might be to treat DL models like DQ models and assume that reading the cardinality statistics is potentially very expensive. We could change the "Read statistics from Direct Query Tables" option to be "Read statistics from Direct Query / Direct Lake Tables". This is already disabled by default, but if people want to take the performance hit to get the cardinality they can choose to do so.

A better option for DL models is probably for people to run a notebook against the source delta tables to view the cardinality

@marcosqlbi
Copy link
Collaborator

@dgosbell I would use two separate settings because, depending on the size of DirectLake databases, the approach could be very different, and I might have scenarios where I use different defaults for the two cases. I am not sure about the default, though. There are reasons why it should be enabled also for DirectLake, but if we remove the statistics collection for all the columns, then it would be clear that something is missing and users should act accordingly.

@dgosbell
Copy link
Collaborator

Yes, a separate setting would also work. And You are right that it should be very rare that it's needed for DQ models, but there may be situations where someone wants to force a full scan of their DL model.

I'm thinking that it might make sense for client tools to have some visual indicator like an icon or a different background color when a column has isResident=false as that will indicate that the cardinality and size columns are probably not accurate

@marcosqlbi
Copy link
Collaborator

Yep, I want to spend time on it to improve VertiPaq Analyzer for these cases. I don't have time soon, unfortunately, but it's on the list.

@dgosbell
Copy link
Collaborator

I'll have a look at doing a pull request for this.

@otykier
Copy link
Collaborator

otykier commented Feb 22, 2024

This is also relevant for Tabular Editor 3, so I'm just posting here to be notified of any updates. Thanks @dgosbell!

@dgosbell
Copy link
Collaborator

I'm currently testing adding an extra parameter currently called analyzeDirectLake to the GetDaxModel function which I default to false (similar to the analyzeDirectQuery parameter). But the difference with the DirectLake parameter is that I always run the column and relationship statistics if the column is resident in memory even if analyzeDirectLake is set to false.

The thinking here is that people will open one or more reports and navigate through all the pages then run an analysis to see what the typical memory usage is to support viewing those reports. They could use this information to see how close they are to the premium SKU limits or to drop unused columns from the ETL.

If the analyzeDirectLake is set to true I run the distinct count queries across all columns, then I re-run the DMV queries to re-collect the memory usage since the memory values will have changed for any columns that were not resident during the initial DMV run.

I'm looking at maybe adding a warning symbol like this mock-up for columns that are not resident
image

I might even see if I can add a toggle switch to the ribbon in DAX Studio to make it easier to toggle this setting on. Since I can imagine there may be some scenarios where people will want to just see the current "warm" cache size and others where they will want to force all columns into memory to see the peak size.

@marcosqlbi
Copy link
Collaborator

I'm not sure this is the right way to go long-term.
Probably, a faster and simplified analysis of the columns and their temperature, with a scheduled refresh (seconds/minutes?) is what people need for "real-time" evaluation. VertiPaq Analyzer in its current form is designed for a complete analysis, and I'd say that saving the VPAX should be enabled only when a full analysis has been completed, or we should display a big warning if a user wants to save the VPAX in an "incomplete" state (DAX Optimizer should probably not accept such "incomplete" files, for example).
If the warning is visible in the VPAX visualization, there should be a button to run a complete analysis without changing the setting in the Options for all future analyses.

@dgosbell
Copy link
Collaborator

I don't think people would need "real-time" analysis. It would be more of a specific point in time snapshot. I don't really like the idea of doing an automated refresh.

I'd say that saving the VPAX should be enabled only when a full analysis has been completed, or we should display a big warning if a user wants to save the VPAX in an "incomplete" state (DAX Optimizer should probably not accept such "incomplete" files, for example).

The problem here is that users can currently turn off "Read statistics from data" and generate an "incomplete" file like this today.

The problem I see with doing a "complete" analysis with Direct Lake models is the definition of "complete". Is it:

  • every column in the model
  • every column used by reports
  • every column referenced by explicit measures

The problem with querying every column is you could have a hidden column like a fact ID which normally would not cause issues in a DL model if it was not queried, but VPA could cause issues either evicting other columns from memory or forcing a fallback to DQ mode.

And obviously with reports in other workspaces and external tools and Excel reports it's tricky to accurately cover columns used in reports. We'd have to ask the user to try and do this manually which is not ideal.

One possible compromise might be to use DISCOVER_CALC_DEPENDENCY to get columns referenced in measures and then collect stats for those plus any columns involved in relationships. Would that collect enough information for something like DAX Optimizer?

Then we could have an enum rather than a boolean for the Direct Lake options. And we could do something like the following:

  • ResidentOnly - only collect stats on columns already resident
  • Minimal - only collect stats on columns already resident or those referenced by measures or relationships
  • Full - collect stats on every column

It would probably also make sense to store this in the VPAX somewhere so that we can easily tell what option was selected when generating the file and tools like DAX Optimizer can then treat them differently if it needs to. And a client tool like DAX Studio could display a warning with a button to trigger a higher level of scan if the full scan was not performed.

What do you think of this as an approach?

@otykier
Copy link
Collaborator

otykier commented Feb 27, 2024

Tagging a few more people for visibility, @DBojsen @greggyb @mlonsk

@marcosqlbi
Copy link
Collaborator

Good points.

We are already working on an extension of VPAX called TCDX - Tabular Consumer Dependencies eXchange libraries. It's not public yet. It keeps track of dependencies in different ways, from static analysis to log processing. The Tcdx could be the "minimal" tracking. From my point of view, limiting the analysis to measure dependencies within the model is not very useful unless you assume that any visible column is used - and it's still incorrect in several cases (users can use hidden columns anyway).

Moreover, DirectLake has the same behavior of Large data models for VertiPaq, so it's not really a separate use case - I guess that it's more likely you want a Minimal or ResidentOnly scan with DirectLake rather than with VertiPaq, but there are benefits in doing that also for regular imported models.

I think that we should:

  1. Extend VPAX libraries supporting all the scenarios beyond the features we'd like to have in DAX Studio in the short term. This should include Tcdx references, which we'll made public soon.
  2. Implement more options in DAX Studio and/or provide a "smart" option that automatically chooses the more likely combination depending on the model and allows the user to change the behavior of the analysis for the current model without changing the global settings.

Thoughts?

@dgosbell
Copy link
Collaborator

From my point of view, limiting the analysis to measure dependencies within the model is not very useful unless you assume that any visible column is used - and it's still incorrect in several cases (users can use hidden columns anyway).

My thinking behind the minimal option was providing just enough information for someone or some tool (ie. DAX Optimizer) to suggest options for explicit measure optimization. If we could add regularly used columns using information from the TCDX I think that might also meet this goal of finding a happy middle ground between ResidentOnly and Complete

What do you think here - is this an achievable goal? Or am I oversimplifying here, and the minimal option is never going to be of much value?

Moreover, DirectLake has the same behavior of Large data models for VertiPaq, so it's not really a separate use case - I guess that it's more likely you want a Minimal or ResidentOnly scan with DirectLake rather than with VertiPaq, but there are benefits in doing that also for regular imported models.

I did consider whether having something like ResidentOnly would make sense for an import model in the Large storage format. It would not be hard to do, but the large format does not have as much overhead for paging in a non-resident column since there is no transcoding. And an import model has to fit within the memory limits for the premium SKU it is running on. So, I was thinking the little bit of extra time to bring in non-resident columns for a large format import model was probably worth it.

The challenge with Direct Lake is that the total data size can be larger than the premium SKU memory limits. I can see a situation where it might be impossible to do a "complete" scan, since by the time we've finished scanning the last batch of columns for accurate cardinality, the first batch of columns has been paged out.

  1. Extend VPAX libraries supporting all the scenarios beyond the features we'd like to have in DAX Studio in the short term. This should include Tcdx references, which we'll made public soon.

Agreed. I'm very aware that DAX Studio is only one client of these libraries. I've started experimenting in my own fork to see what's possible, but I want to make sure we reach a consensus the best strategic implementation before doing a pull request that will change the behaviour. I was not aware of the Tcdx work, but if it make sense for the broader use cases for vpax I don't have objections to including this.

But I'm also aware that more and more people are trying out Direct Lake, so I'd like to try and get some smarter defaults in place sooner rather than later.

  1. Implement more options in DAX Studio and/or provide a "smart" option that automatically chooses the more likely combination depending on the model and allows the user to change the behavior of the analysis for the current model without changing the global settings.

Agreed. Currently I'm turning off the "Read statistics from data" option globally if I only want to look at resident columns and it's a pain. There are a number of different options. I could add a dialog where the default settings could be overriden for the current run. Or if something lower than a complete scan was done I could display a warning bar with buttons to run one of the more intensive scans.

It probably makes sense for @otykier and I to use a similar approach across both TE and DS if we can.

@marcosqlbi
Copy link
Collaborator

Ok, I'm traveling this week, but I want to make sure we make Tcdx visible in 1-2 weeks so we can start working on a long-term plan for VPAX according to the considerations we made in this thread. I'll keep you posted!

@albertospelta
Copy link
Collaborator

The new DirectLakeExtractionMode parameter for StatExtractor.UpdateStatisticsModel() is available in version 1.5.0

@otykier
Copy link
Collaborator

otykier commented Apr 10, 2024

I'm planning to add these options to TE3, like so (with "Resident only" being the default):

image

When "Read statistics from data" is unchecked, the options are disabled:

image

Does this align with your expectations and what you have planned for DAX Studio?

@marcosqlbi
Copy link
Collaborator

Yes, we should use the same default in DAX Studio once we implement that - @dgosbell do you agree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants