Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit Cache Size on Disk #18045

Open
yyilong335 opened this issue Nov 20, 2024 · 9 comments
Open

Limit Cache Size on Disk #18045

yyilong335 opened this issue Nov 20, 2024 · 9 comments
Labels
question Further information is requested

Comments

@yyilong335
Copy link

Dear Developers,

I am using CodeQL to analyze my database. In the command, I use --max-disk-cache=200000 to specify the maximum disk space the cache would take in this query. However, when it finishes, the cache is taking to 1.3TB.

Would --max-disk-cache limit the disk to use? Or is there any other command to resolve this issue?

Thank you so much.

@yyilong335 yyilong335 added the question Further information is requested label Nov 20, 2024
@aibaars
Copy link
Contributor

aibaars commented Nov 20, 2024

Thanks for reporting. I would have expected the disk cache to be limited to 200G . Could you share which folders of the codeql database and cache are larger than expected? If you are using Linux or macOS then the du command can be used to print some reports. For example (replace my-database with the folder of your codeql database, and if needed replace cpp with the language you are analyzing) :

du -sh my-database/db-cpp/default/*
du -sh my-database/db-cpp/default/cache/*

@yyilong335
Copy link
Author

Thank you for the attention to this issue! The page directory is abnormal.

I get:
2.0T default/cache/page

Thanks.

@aibaars
Copy link
Contributor

aibaars commented Nov 20, 2024

Could you get a more detailed du report for the page folder? Like du -sh default/cache/page/* or without the -s : du -h default/cache/page . Just to figure out if all page file are large or whether there are only a couple that take up all the space.

@yyilong335
Copy link
Author

Sure.

ls -l | wc -l shows there are 579 items.

du -sh * | sort -hr | head -n 10 shows the top 10 big items:

8.3G    30                                                                                                                
8.2G    fc                                                                                                                
8.2G    f2                                                                                                                
8.2G    ee                                                                                                                
8.2G    e8                                                                                                                
8.2G    e2                                                                                                                
8.2G    e1                                                                                                                
8.2G    df                                                                                                                
8.2G    de                                                                                                                
8.2G    d7

I checked 30 and fc. They are directories containing more than 500 items. The organization is just like /pages. Could it be a recursively creating pages issue?

@yyilong335
Copy link
Author

I did a quick estimate myself. It seems to me that half of the items under pages/ are directories which take a lot of space, another half is the .pack which are small. Say one such directory is 8GB, and there are more than 200 of that, so it is nearly 2TB. And the pages directory is 2TB. Hence I believe it's too many such big directories to take space.

@aibaars
Copy link
Contributor

aibaars commented Nov 21, 2024

I checked with some of the CodeQL developers and they said:

The --max-disk-cache value is not really a hard limit, more a firm wish. The evaluator will try to stay under the indicated size by removing pages that were kept "because they may be useful later". However, if using more than that is the only way it can actually complete the evaluation, that's what it will do.

The word "cache" here is actually a bit of a misnomer -- in production analyses, this space is mostly used for intermediate results that we had to spill out from RAM because there were too many of them to fit there.

If the results take up that much space on disk, it's probably a symptom that CodeQL is doing far too much computation, so I'd presume the query also takes far too long.

@yyilong335 If you are willing to share the database and the query, we can try to determine whether there's a performance problem with one of our own supported queries.

@hmakholm
Copy link
Contributor

hmakholm commented Nov 21, 2024

The organization is just like /pages. Could it be a recursively creating pages issue?

For what it's worth, your observations of the file structure are consistent with the cache subsystem working as designed.

The individual items stored in the cache are each identified by a hash. The evaluator starts by storing those in files named aa.pack and aa.pack.d where aa are just the first two hex digits of the hash. But when those files begin to grow large (I think about 10 MB per .pack.d file, but don't quote me on that), will create a subdirectory and store further items in files called aa/bb.pack and aa/bb.pack.d, where bb are the two next hex digits. And so on recursively as long as the need to store things keeps increasing.

@yyilong335
Copy link
Author

@aibaars Thank you. I feel the database I am analyzing is very big, the pattern I would like to find is complex, and I have no idea how to optimize my query code as well.

Basically, I am trying to find a store-store-load pattern, and follow by an indirect branch, in large code database like httpd. It is taking day-level time, but still not finishing.

To be more specific, here are two queries, ssl_v4.ql and ssl_v5.ql, short for v4 and v5. The v5 is only changing some "small" details. v4 only finds loads whose line number is the last store +1, while v5 allows +5, but no meaningful expr/stmt is allowed in between. The v5 also add two more control flow constrains, but also remove one.

ssl_v5.ql.txt
ssl_v4.ql.txt

The v4 is taking only minutes, but the v5 never finishes and the cache size is huge.

I am a newbie in CodeQL, and I clearly understand asking for debugging or optimization could be out-of-scope for this issue. But I would like to say thank you very much.

@hmakholm Thank you for the clarification. I understand the cache subsystem is working fine now.

@aibaars
Copy link
Contributor

aibaars commented Nov 22, 2024

@yyilong335 Debugging QL query performance can be quite tricky. Make sure to read this page first https://codeql.github.com/docs/writing-codeql-queries/troubleshooting-query-performance/ . This page lists the most common causes of performance problems. At a glance I think that maybe hasDisallowedContentBetween could be a problem. That seemingly simple predicate computes a relation between pairs of Locations. The Location table is normally the largest in the database, so there number of Location pairs that fullfill the conditions can be very high. You even include Location pairs from unrelated files, although I think the predicate would still be too large if you restrict to Locations with the same file as Element e. A potentially large relation is not always a problem, the CodeQL compiler is pretty "smart" and will avoid computing the thing in full if possible, but if it can't avoid it, the entire relation may be materialized and end up in the cache.

If looking for any of the common performance problems does not help, then the next step would be to run the query with evaluator logging and tuple counting switched on. The resulting logs should give insight in what predicates are taking lots of time and produce lots of (intermediate) results. You may want to find a smaller code base to test on though, get performance logs, identify and fix the slowest predicates, and gradually try on larger projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants