Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java: add support for alert location restrictions #17190

Merged
merged 3 commits into from
Sep 23, 2024

Conversation

cklin
Copy link
Contributor

@cklin cklin commented Aug 9, 2024

This PR modifies Java queries in the Code Scanning suite to support restricting alerts based on source location, with the restrictions configured through extensible predicates.

@github-actions github-actions bot added the Java label Aug 9, 2024
@cklin cklin force-pushed the cklin/diff-informed-java-queries branch 2 times, most recently from e087562 to cf44894 Compare August 9, 2024 21:21
@cklin cklin force-pushed the cklin/diff-informed-java-queries branch from cf44894 to 3c68245 Compare August 12, 2024 18:53
@cklin cklin force-pushed the cklin/diff-informed-java-queries branch from 3c68245 to 29ca5c3 Compare August 13, 2024 21:46
@aschackmull
Copy link
Contributor

aschackmull commented Aug 14, 2024

Generally looks reasonable, but I do have several stylistic and algorithmic comments.
I don't like the code duplication introduced in the queries and the way that they shadow the flow computation from the *Query.qll files - in those cases it's probably better just to modify the *Query.qll files instead.
There's some confusion about whether to calculate the restriction for only relevant locations or for all locations, which results in both things happening at the moment. It might be fine to calculate for all locations, in which case the shared lib should be modified to do that in a better way.
The ad-hoc'ness in how the locations are matched with the diff should probably be fixed (the somewhat arbitrary cutoff where diff-ranges of size less than 1000 gets expanded seems like a poor algorithm, we can do much better with some rank-tricks).

@cklin cklin force-pushed the cklin/diff-informed-java-queries branch from 29ca5c3 to 3f4f18e Compare August 14, 2024 13:17
@cklin
Copy link
Contributor Author

cklin commented Aug 14, 2024

Generally looks reasonable, but I do have several stylistic and algorithmic comments. I don't like the code duplication introduced in the queries and the way that they shadow the flow computation from the *Query.qll files - in those cases it's probably better just to modify the *Query.qll files instead.

I will give that a try. The right way to apply alert restrictions on a query depends on what goes into the alert. For a dataflow query, that means whether the alert contains only the sources, the sinks, both the sources and the sinks, or with additional locations in addition to sources or sinks. That context is available only in the .ql files, and I was originally concerned that performing alert restrictions in .qll files could lead to the query and the alert restrictions diverging in the future.

I was originally also concerned about alert-restricted predicates and flow configurations from .qll files being accidentally reused in other improper contexts. But perhaps that is not a serious concern, especially for *Query.qll files.

There's some confusion about whether to calculate the restriction for only relevant locations or for all locations, which results in both things happening at the moment. It might be fine to calculate for all locations, in which case the shared lib should be modified to do that in a better way.

I am not sure I understand. Can you say more?

The ad-hoc'ness in how the locations are matched with the diff should probably be fixed (the somewhat arbitrary cutoff where diff-ranges of size less than 1000 gets expanded seems like a poor algorithm, we can do much better with some rank-tricks).

Sounds interesting! Where can I find out more?

@cklin
Copy link
Contributor Author

cklin commented Aug 14, 2024

Generally looks reasonable, but I do have several stylistic and algorithmic comments.
I don't like the code duplication introduced in the queries and the way that they shadow the flow computation from the *Query.qll files - in those cases it's probably better just to modify the *Query.qll files instead.

This is done. Notes:

  • The predicate CommandLineQuery::execIsTainted, which calculates flow path on InputToArgumentToExecFlow, is used in two queries.
    • ExecTainted.ql uses execIsTainted as a positive term. Since the query returns both source and sink (execArg is part of sink), for this query we want to apply "source or sink" restrictions to InputToArgumentToExecFlow.
    • ExecUnescaped.ql uses execIsTainted as a negative conjunct, to exclude exec calls associated with InputToArgumentToExecFlow paths. For this query we need to ensure that execIsTainted compute all flows associated with StringArgumentToExec argument, as excessive filtering would lead to false positives (inclusion of alerts that should have been excluded).
    • For ExecUnescaped.ql, as long as we apply alert filtering on StringArgumentToExec argument, it is also safe to apply alert filtering to InputToArgumentToExecFlow sink. (With InputToArgumentToExecFlow sink restrictions being no stronger than alert filtering on argument—so restrictAlertsTo needs to be matched to the whole [startLine .. endLine] range instead of just to startLine.)
    • Since "source or sink" restrictions are no stronger than sink restrictions, it is safe to apply "source or sink" restrictions to InputToArgumentToExecFlow, for both ExecTainted.ql and ExecUnescaped.ql.

For now I am keeping it a separate commit in case we want to further refine the approach. I will fold it into the previous commit before merge.

@cklin
Copy link
Contributor Author

cklin commented Aug 15, 2024

I am able to reproduce the CWE-927/ImplicitPendingIntentsTest.ql test failure locally and it is due to 1b0c1d6. I will track down the problem and come up with a fix.

@aschackmull
Copy link
Contributor

There's some confusion about whether to calculate the restriction for only relevant locations or for all locations, which results in both things happening at the moment. It might be fine to calculate for all locations, in which case the shared lib should be modified to do that in a better way.

I am not sure I understand. Can you say more?

The filterByLocation predicate has a bindingset, which indicates that you're attempting to restrict the computation to those locations that are relevant. But then there's also filterByLocatable, which will get a substantial body due to the inlining of filterByLocation and hence is likely to be materialised in full, which in turn causes us to do the computation for all Locations. In general, relying on inlining for getting proper context is a brittle design as it's very easy to break without noticing.
But it's probably worth it to challenge the notion that we need to inline filterByLocation for performance - doing a little bit of global computation per Location is likely completely fine. We should probably just inspect the performance of these predicates regardless of the design we end up with.

The ad-hoc'ness in how the locations are matched with the diff should probably be fixed (the somewhat arbitrary cutoff where diff-ranges of size less than 1000 gets expanded seems like a poor algorithm, we can do much better with some rank-tricks).

Sounds interesting! Where can I find out more?

Have you noticed any performance issue if you simply drop the 1000 limit? It might be completely fine without it. If not then there are tricks to apply. The one thing that I alluded to was that we can skip the materialisation of irrelevant line numbers if we replace the [startLine .. endLine] range generation with the corresponding range for some rank-indices, i.e. we could potentially rank all relevant line numbers, which would allow us to skip lines without Locations. But now that I've written that, I realise that that's likely not that relevant if we operate on all Locations as there'll likely be Locations starting on most lines. But again, run some numbers and check.

If we do somehow run into performance issues dealing with all Locations then the safe solution is to expose the filtering as a parameterised module capable of taking e.g. a locatable as input in order to ensure that we only consider interesting locations. Combining that with rank to renumber the lines should allow us to do the least amount of computation (but constant factors may be higher than simply calculating the entire set of allowed locations).

@cklin cklin force-pushed the cklin/diff-informed-java-queries branch from 1b0c1d6 to 888db84 Compare August 16, 2024 18:24
@cklin
Copy link
Contributor Author

cklin commented Aug 16, 2024

I am able to reproduce the CWE-927/ImplicitPendingIntentsTest.ql test failure locally and it is due to 1b0c1d6. I will track down the problem and come up with a fix.

The problem was due to FilteredStateConfig missing pass-through aliases for some default predicates in StateConfigSig. I added the missing predicate pass-throughs and also verified that FilteredConfig has the appropriate pass-throughs. And I added reminders to the end of ConfigSig and StateConfigSig that newly added predicates need corresponding pass-through aliases in the filter wrappers.

The filterByLocation predicate has a bindingset, which indicates that you're attempting to restrict the computation to those locations that are relevant. But then there's also filterByLocatable, which will get a substantial body due to the inlining of filterByLocation and hence is likely to be materialised in full, which in turn causes us to do the computation for all Locations. In general, relying on inlining for getting proper context is a brittle design as it's very easy to break without noticing.
But it's probably worth it to challenge the notion that we need to inline filterByLocation for performance - doing a little bit of global computation per Location is likely completely fine. We should probably just inspect the performance of these predicates regardless of the design we end up with.

Have you noticed any performance issue if you simply drop the 1000 limit? It might be completely fine without it.

What I am hearing is that we should not spend too much effort on premature optimization right now. Instead, we can defer improvements until we do observe poor performance, at which point we will have concrete data on exactly what needs to be optimized. That sounds like a good approach.

Also, thanks for the edit suggestions! I have incorporated them in the latest push.

@aschackmull
Copy link
Contributor

The problem was due to FilteredStateConfig missing pass-through aliases for some default predicates in StateConfigSig. I added the missing predicate pass-throughs and also verified that FilteredConfig has the appropriate pass-throughs. And I added reminders to the end of ConfigSig and StateConfigSig that newly added predicates need corresponding pass-through aliases in the filter wrappers.

That seems too brittle - and also not very nice to have all those pass-throughs. Perhaps filtering should merely be a flag on the existing configuration instead of a configuration-transforming module. I.e. something like:

default predicate filterAlerts() { none() }

@cklin
Copy link
Contributor Author

cklin commented Aug 19, 2024

The problem was due to FilteredStateConfig missing pass-through aliases for some default predicates in StateConfigSig. I added the missing predicate pass-throughs and also verified that FilteredConfig has the appropriate pass-throughs. And I added reminders to the end of ConfigSig and StateConfigSig that newly added predicates need corresponding pass-through aliases in the filter wrappers.

That seems too brittle - and also not very nice to have all those pass-throughs. Perhaps filtering should merely be a flag on the existing configuration instead of a configuration-transforming module. I.e. something like:

default predicate filterAlerts() { none() }

Thanks for the suggestion! Yes, this approach would be cleaner. I will try to restructure the PR using this new approach in the next few weeks.

@cklin cklin force-pushed the cklin/diff-informed-java-queries branch from 93ba79a to 30bbd12 Compare September 11, 2024 20:18
@cklin
Copy link
Contributor Author

cklin commented Sep 11, 2024

That seems too brittle - and also not very nice to have all those pass-throughs. Perhaps filtering should merely be a flag on the existing configuration instead of a configuration-transforming module. I.e. something like:

default predicate filterAlerts() { none() }

Hi @aschackmull — the switch for dataflow source+sink filtering is now done via a flag as suggested. Please take another look. Thanks!

@cklin cklin force-pushed the cklin/diff-informed-java-queries branch from 30bbd12 to 2e6f34d Compare September 12, 2024 20:27
@aschackmull
Copy link
Contributor

I had a lot of comments about things to tweak and simplify, so I ended up just pushing a commit.

@aschackmull aschackmull force-pushed the cklin/diff-informed-java-queries branch 2 times, most recently from a999980 to 2970da0 Compare September 19, 2024 13:25
Copy link
Contributor

@aschackmull aschackmull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, provided that this passes the usual performance testing in dca.

Copy link
Contributor

@aschackmull aschackmull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should limit ourselves to filtering that affects data flow, as the other query result filtering cannot be expected to improve performance.

@aschackmull
Copy link
Contributor

Dca looks fine, although there is a slight indication that something might be up with java/ql/src/Security/CWE/CWE-078/ExecUnescaped.ql. But if we simply revert the change to that query (and others like it), then there's no reason to look into it.

@cklin cklin force-pushed the cklin/diff-informed-java-queries branch from 2970da0 to 75ec8ce Compare September 20, 2024 14:58
@cklin
Copy link
Contributor Author

cklin commented Sep 20, 2024

I think we should limit ourselves to filtering that affects data flow, as the other query result filtering cannot be expected to improve performance.

I have updated the PR accordingly, with a little bit of commit cleanup in the latest force push.

@cklin cklin marked this pull request as ready for review September 20, 2024 15:03
@cklin cklin requested review from a team as code owners September 20, 2024 15:03
@cklin cklin added the no-change-note-required This PR does not need a change note label Sep 20, 2024
Copy link
Contributor

@aschackmull aschackmull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cklin cklin merged commit 1cd8af5 into main Sep 23, 2024
62 of 63 checks passed
@cklin cklin deleted the cklin/diff-informed-java-queries branch September 23, 2024 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants