Java: add support for alert location restrictions #17190

cklin · 2024-08-09T15:57:53Z

This PR modifies Java queries in the Code Scanning suite to support restricting alerts based on source location, with the restrictions configured through extensible predicates.

java/ql/lib/semmle/code/java/dataflow/DataFlowFiltering.qll

aschackmull · 2024-08-14T07:44:43Z

Generally looks reasonable, but I do have several stylistic and algorithmic comments.
I don't like the code duplication introduced in the queries and the way that they shadow the flow computation from the *Query.qll files - in those cases it's probably better just to modify the *Query.qll files instead.
There's some confusion about whether to calculate the restriction for only relevant locations or for all locations, which results in both things happening at the moment. It might be fine to calculate for all locations, in which case the shared lib should be modified to do that in a better way.
The ad-hoc'ness in how the locations are matched with the diff should probably be fixed (the somewhat arbitrary cutoff where diff-ranges of size less than 1000 gets expanded seems like a poor algorithm, we can do much better with some rank-tricks).

cklin · 2024-08-14T13:38:25Z

Generally looks reasonable, but I do have several stylistic and algorithmic comments. I don't like the code duplication introduced in the queries and the way that they shadow the flow computation from the *Query.qll files - in those cases it's probably better just to modify the *Query.qll files instead.

I will give that a try. The right way to apply alert restrictions on a query depends on what goes into the alert. For a dataflow query, that means whether the alert contains only the sources, the sinks, both the sources and the sinks, or with additional locations in addition to sources or sinks. That context is available only in the .ql files, and I was originally concerned that performing alert restrictions in .qll files could lead to the query and the alert restrictions diverging in the future.

I was originally also concerned about alert-restricted predicates and flow configurations from .qll files being accidentally reused in other improper contexts. But perhaps that is not a serious concern, especially for *Query.qll files.

There's some confusion about whether to calculate the restriction for only relevant locations or for all locations, which results in both things happening at the moment. It might be fine to calculate for all locations, in which case the shared lib should be modified to do that in a better way.

I am not sure I understand. Can you say more?

The ad-hoc'ness in how the locations are matched with the diff should probably be fixed (the somewhat arbitrary cutoff where diff-ranges of size less than 1000 gets expanded seems like a poor algorithm, we can do much better with some rank-tricks).

Sounds interesting! Where can I find out more?

cklin · 2024-08-14T21:46:04Z

Generally looks reasonable, but I do have several stylistic and algorithmic comments.
I don't like the code duplication introduced in the queries and the way that they shadow the flow computation from the *Query.qll files - in those cases it's probably better just to modify the *Query.qll files instead.

This is done. Notes:

The predicate CommandLineQuery::execIsTainted, which calculates flow path on InputToArgumentToExecFlow, is used in two queries.
- ExecTainted.ql uses execIsTainted as a positive term. Since the query returns both source and sink (execArg is part of sink), for this query we want to apply "source or sink" restrictions to InputToArgumentToExecFlow.
- ExecUnescaped.ql uses execIsTainted as a negative conjunct, to exclude exec calls associated with InputToArgumentToExecFlow paths. For this query we need to ensure that execIsTainted compute all flows associated with StringArgumentToExec argument, as excessive filtering would lead to false positives (inclusion of alerts that should have been excluded).
- For ExecUnescaped.ql, as long as we apply alert filtering on StringArgumentToExec argument, it is also safe to apply alert filtering to InputToArgumentToExecFlow sink. (With InputToArgumentToExecFlow sink restrictions being no stronger than alert filtering on argument—so restrictAlertsTo needs to be matched to the whole [startLine .. endLine] range instead of just to startLine.)
- Since "source or sink" restrictions are no stronger than sink restrictions, it is safe to apply "source or sink" restrictions to InputToArgumentToExecFlow, for both ExecTainted.ql and ExecUnescaped.ql.

For now I am keeping it a separate commit in case we want to further refine the approach. I will fold it into the previous commit before merge.

cklin · 2024-08-15T14:21:22Z

I am able to reproduce the CWE-927/ImplicitPendingIntentsTest.ql test failure locally and it is due to 1b0c1d6. I will track down the problem and come up with a fix.

aschackmull · 2024-08-16T11:53:41Z

There's some confusion about whether to calculate the restriction for only relevant locations or for all locations, which results in both things happening at the moment. It might be fine to calculate for all locations, in which case the shared lib should be modified to do that in a better way.

I am not sure I understand. Can you say more?

The filterByLocation predicate has a bindingset, which indicates that you're attempting to restrict the computation to those locations that are relevant. But then there's also filterByLocatable, which will get a substantial body due to the inlining of filterByLocation and hence is likely to be materialised in full, which in turn causes us to do the computation for all Locations. In general, relying on inlining for getting proper context is a brittle design as it's very easy to break without noticing.
But it's probably worth it to challenge the notion that we need to inline filterByLocation for performance - doing a little bit of global computation per Location is likely completely fine. We should probably just inspect the performance of these predicates regardless of the design we end up with.

The ad-hoc'ness in how the locations are matched with the diff should probably be fixed (the somewhat arbitrary cutoff where diff-ranges of size less than 1000 gets expanded seems like a poor algorithm, we can do much better with some rank-tricks).

Sounds interesting! Where can I find out more?

Have you noticed any performance issue if you simply drop the 1000 limit? It might be completely fine without it. If not then there are tricks to apply. The one thing that I alluded to was that we can skip the materialisation of irrelevant line numbers if we replace the [startLine .. endLine] range generation with the corresponding range for some rank-indices, i.e. we could potentially rank all relevant line numbers, which would allow us to skip lines without Locations. But now that I've written that, I realise that that's likely not that relevant if we operate on all Locations as there'll likely be Locations starting on most lines. But again, run some numbers and check.

If we do somehow run into performance issues dealing with all Locations then the safe solution is to expose the filtering as a parameterised module capable of taking e.g. a locatable as input in order to ensure that we only consider interesting locations. Combining that with rank to renumber the lines should allow us to do the least amount of computation (but constant factors may be higher than simply calculating the entire set of allowed locations).

java/ql/src/Security/CWE/CWE-020/OverlyLargeRange.ql

java/ql/src/Security/CWE/CWE-730/ReDoS.ql

shared/util/codeql/util/AlertFiltering.qll

cklin · 2024-08-16T19:01:10Z

I am able to reproduce the CWE-927/ImplicitPendingIntentsTest.ql test failure locally and it is due to 1b0c1d6. I will track down the problem and come up with a fix.

The problem was due to FilteredStateConfig missing pass-through aliases for some default predicates in StateConfigSig. I added the missing predicate pass-throughs and also verified that FilteredConfig has the appropriate pass-throughs. And I added reminders to the end of ConfigSig and StateConfigSig that newly added predicates need corresponding pass-through aliases in the filter wrappers.

The filterByLocation predicate has a bindingset, which indicates that you're attempting to restrict the computation to those locations that are relevant. But then there's also filterByLocatable, which will get a substantial body due to the inlining of filterByLocation and hence is likely to be materialised in full, which in turn causes us to do the computation for all Locations. In general, relying on inlining for getting proper context is a brittle design as it's very easy to break without noticing.
But it's probably worth it to challenge the notion that we need to inline filterByLocation for performance - doing a little bit of global computation per Location is likely completely fine. We should probably just inspect the performance of these predicates regardless of the design we end up with.

Have you noticed any performance issue if you simply drop the 1000 limit? It might be completely fine without it.

What I am hearing is that we should not spend too much effort on premature optimization right now. Instead, we can defer improvements until we do observe poor performance, at which point we will have concrete data on exactly what needs to be optimized. That sounds like a good approach.

Also, thanks for the edit suggestions! I have incorporated them in the latest push.

aschackmull · 2024-08-19T07:17:14Z

The problem was due to FilteredStateConfig missing pass-through aliases for some default predicates in StateConfigSig. I added the missing predicate pass-throughs and also verified that FilteredConfig has the appropriate pass-throughs. And I added reminders to the end of ConfigSig and StateConfigSig that newly added predicates need corresponding pass-through aliases in the filter wrappers.

That seems too brittle - and also not very nice to have all those pass-throughs. Perhaps filtering should merely be a flag on the existing configuration instead of a configuration-transforming module. I.e. something like:

default predicate filterAlerts() { none() }

cklin · 2024-08-19T15:31:42Z

The problem was due to FilteredStateConfig missing pass-through aliases for some default predicates in StateConfigSig. I added the missing predicate pass-throughs and also verified that FilteredConfig has the appropriate pass-throughs. And I added reminders to the end of ConfigSig and StateConfigSig that newly added predicates need corresponding pass-through aliases in the filter wrappers.

That seems too brittle - and also not very nice to have all those pass-throughs. Perhaps filtering should merely be a flag on the existing configuration instead of a configuration-transforming module. I.e. something like:
default predicate filterAlerts() { none() }

Thanks for the suggestion! Yes, this approach would be cleaner. I will try to restructure the PR using this new approach in the next few weeks.

shared/dataflow/codeql/dataflow/internal/DataFlowImpl.qll

cklin · 2024-09-11T21:26:06Z

That seems too brittle - and also not very nice to have all those pass-throughs. Perhaps filtering should merely be a flag on the existing configuration instead of a configuration-transforming module. I.e. something like:
default predicate filterAlerts() { none() }

Hi @aschackmull — the switch for dataflow source+sink filtering is now done via a flag as suggested. Please take another look. Thanks!

aschackmull · 2024-09-19T11:31:56Z

I had a lot of comments about things to tweak and simplify, so I ended up just pushing a commit.

aschackmull

LGTM now, provided that this passes the usual performance testing in dca.

java/ql/src/Security/CWE/CWE-020/OverlyLargeRange.ql

aschackmull

I think we should limit ourselves to filtering that affects data flow, as the other query result filtering cannot be expected to improve performance.

aschackmull · 2024-09-20T07:44:11Z

Dca looks fine, although there is a slight indication that something might be up with java/ql/src/Security/CWE/CWE-078/ExecUnescaped.ql. But if we simply revert the change to that query (and others like it), then there's no reason to look into it.

cklin · 2024-09-20T15:02:47Z

I think we should limit ourselves to filtering that affects data flow, as the other query result filtering cannot be expected to improve performance.

I have updated the PR accordingly, with a little bit of commit cleanup in the latest force push.

aschackmull

LGTM

github-actions bot added the Java label Aug 9, 2024

cklin force-pushed the cklin/diff-informed-java-queries branch 2 times, most recently from e087562 to cf44894 Compare August 9, 2024 21:21

hvitved reviewed Aug 12, 2024

View reviewed changes

java/ql/lib/semmle/code/java/dataflow/DataFlowFiltering.qll Outdated Show resolved Hide resolved

cklin force-pushed the cklin/diff-informed-java-queries branch from cf44894 to 3c68245 Compare August 12, 2024 18:53

github-actions bot added the DataFlow Library label Aug 12, 2024

cklin force-pushed the cklin/diff-informed-java-queries branch from 3c68245 to 29ca5c3 Compare August 13, 2024 21:46

cklin force-pushed the cklin/diff-informed-java-queries branch from 29ca5c3 to 3f4f18e Compare August 14, 2024 13:17