Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Internal Execution Errors in Apex Classes while running Graph Scan #1171

Closed
AnkitBhatnagar123 opened this issue Sep 5, 2023 · 31 comments
Labels
BUG P3 Rarely Malfunction duplicate This issue or pull request already exists SFGE Issues related to the Salesforce Graph Engine

Comments

@AnkitBhatnagar123
Copy link

AnkitBhatnagar123 commented Sep 5, 2023

Describe the bug
We are getting error in different apex classes :
Graph Engine identified your source and sink, but you must manually verify that you have a sanitizer in this path. Then, add an engine directive to skip the path. Next, create a Github issue for the Code Analyzer team that includes the error and stack trace. After we fix this issue, check the Code Analyzer release notes for more info. Error and stacktrace: UnexpectedException: null: com.salesforce.graph.vertex.SFVertexFactory.load(SFVertexFactory.java:245);com.salesforce.graph.vertex.SFVertexFactory.access$000(SFVertexFactory.java:56);com.salesforce.graph.vertex.SFVertexFactory$1.apply(SFVertexFactory.java:186);com.salesforce.graph.vertex.SFVertexFactory$1.apply(SFVertexFactory.java:183);com.salesforce.graph.cache.AbstractVertexCacheImpl.get(AbstractVertexCacheImpl.java:101);com.salesforce.graph.vertex.SFVertexFactory.loadVertices(SFVertexFactory.java:180)

To Reproduce
Getting these errors on the first line of method Definitions. Eg: Getting the error on public static void revokeUserAccess(String userId) in code :

public static void revokeUserAccess(String userId) {
String access_token = NDAuthController.getAccessToken(userId);

    //revoke api access token
    HttpResponse res = NDAuthUtility.basicAuthCallout(access_token, 'POST', '/v1/oauth/revoke', 'token='+access_token, 'application/x-www-form-urlencoded');
    
    // delete saved user auth info
    delete NDAuthController.getOAuthInfo(userId);
}

"Urgency": We have to resolve this issues for our package's security review.

@jfeingold35
Copy link
Collaborator

@AnkitBhatnagar123 , are NetDocumentsUtility and NetDocumentsAuthController custom classes?
Also, could you please delete your ~/.sfdx-scanner/sfge.log, re-run the command, and then post the resulting logs here? That would give us more to go on.

@AnkitBhatnagar123
Copy link
Author

AnkitBhatnagar123 commented Sep 5, 2023

Deleted the previous logs. This is the new log. I have also updated the issue description with the new class names
CodeAnalyzerDFA.csv

@jfeingold35
Copy link
Collaborator

@AnkitBhatnagar123 , that's actually the results file. I need the logs, which will live in ~/.sfdx-scanner/sfge.log.

@AnkitBhatnagar123
Copy link
Author

sfge.log

@jfeingold35
Copy link
Collaborator

@AnkitBhatnagar123 , please provide the following information:

  1. What's the exact command you're running?
  2. What's happening at line 50 of NDUtility.cls? That's one of several locations that crops up pretty consistently before an error, so maybe there's something notable happening there.
    Thanks!

@vc4u
Copy link

vc4u commented Sep 5, 2023

Josh- I'll wait for my colleague to respond to your Ques #1. Though for Ques. #2, I can provide some insights.

Under that class at line 50- we're returning a ternary operator based statement that either returns a New instance of Custom metadata (when called under Apex Testing context) OR return Custom Metadata record via SOQL.

Format of that line 50 statement looks like this:

public static ND_Config__mdt getNDConfig() {
**LINE 50 --->**	return Test.isRunningTest() ? new ND_Config__mdt(
		DeveloperName = 'test',
		Label = 'Test',
		Location__c = getNDAppConfigsList()[0].Id,
		Location__r = getNDAppConfigsList()[0]
	) : [SELECT id, Location__c FROM ND_Config__mdt WHERE DeveloperName = 'My_Config'];
}
public static ND_AppConfigs__mdt[] getNDAppConfigsList() {
	return Test.isRunningTest() ? new List<ND_AppConfigs__mdt> { new ND_AppConfigs__mdt(
		DeveloperName = 'test',
		Label = 'Test'
	)} : [SELECT  Id, MasterLabel, QualifiedApiName, DeveloperName FROM ND_AppConfigs__mdt ORDER BY MasterLabel, DeveloperName];
}

@jfeingold35
Copy link
Collaborator

@AnkitBhatnagar123 , @vc4u , I've done some local testing, and I believe I've isolated the problem to the Location__r = getNDAppConfigsList()[0].
The following caused the same error you're seeing:

public class MyClass {
    @AuraEnabled
    public static CustomObj__c entrypointMethod() {
        CustomObj__c obj = new CustomObj__c(Name = 'asdf', Child__r = getChildArray()[0]);
        insert obj;
        return obj;
    }

    public static ChildObj__c[] getChildArray() {
        return [SELECT Name FROM ChildObj__c];
    }
}

After adjusting it to the following, I stopped seeing the error:

...
        CustomChild__c child = getChildArray()[0];
        CustomObj__c obj = new CustomObj__c(Name = 'asdf', Child__r = child);
...

Please let me know if making a similar adjustment in your code resolves the error.
Examining the logs indicates that the overwhelming majority of the exceptions are happening at lines 9, 25, and 50 of NDUtility.cls. If fixing those files doesn't resolve all the problems, I can examine for others.
Thanks for your patience.

@AnkitBhatnagar123
Copy link
Author

Hi @jfeingold35
We made the changes suggested by you. Then our error count reduced. We also fixed some timout errors by adding params to our command. Now, we're getting :Path evaluation timed out after 900000 ms even after adding timout params.
sfge.log
Command that we're using now is : sfdx scanner:run:dfa --format=csv --outfile=CodeAnalyzerDFA.csv --target="./" --projectdir="./" --category="Security" --pathexplimit -1 --rule-thread-timeout=300000

@jfeingold35
Copy link
Collaborator

Recommendations:

  • You probably don't need to use --pathexplimit, and definitely shouldn't set it to 1 as this will cause errors basically immediately.
  • Your value for --rule-thread-timeout appears to be 300,000, which is shorter than the default value of 900,000. Try using a larger value (e.g., 1,800,000, which would be 30 minutes).

@jfeingold35 jfeingold35 added the BUG P3 Rarely Malfunction label Sep 6, 2023
@git2gus
Copy link

git2gus bot commented Sep 6, 2023

This issue has been linked to a new work item: W-14080002

@jfeingold35
Copy link
Collaborator

@AnkitBhatnagar123 , @vc4u , let me know if the recommendations I provided work.
In the meantime, the issue with Location__r is a genuine bug, so I'm adding it to our backlog to fix, along with the note that it has a workaround.

@AnkitBhatnagar123
Copy link
Author

After Using : sfdx scanner:run:dfa --format=csv --outfile=CodeAnalyzerDFA.csv --target="./" --projectdir="./" --category="Security" --rule-thread-timeout=1800000

"Graph Engine reached the path expansion upper limit (743). The analysis preemptively stopped running on this path to prevent an OutOfMemory error. Rerun Graph Engine and target this entry method with a larger heap space." comes back.

@jfeingold35
Copy link
Collaborator

@AnkitBhatnagar123 , our documentation has some information about this.
Basically, that error means your memory isn't great enough to hold all the paths being expanded. This limit is dynamically calculated based on memory, and in your case it seems to be 743.
You can either:

  1. Use method-level targeting to scan individual entrypoints, thereby decreasing the number of paths being expanded at once.
  2. Use --sfgejvmargs to set a larger heap size, thereby allowing more paths to fit in memory.
  3. Use --pathexplimit to increase the limit to some number greater than 743.

@AnkitBhatnagar123
Copy link
Author

Hi @jfeingold35
We made some changes in the code and also ran the cmd as per your suggestions :
sfdx scanner:run:dfa --format=csv --outfile=CodeAnalyzerDFA.csv --target="./force-app" --projectdir="./force-app" --category="Security" --rule-thread-timeout=1800000 --sfgejvmargs "-Xmx60g"
This is the max heap size limit we are able to go with.
Still we are getting a couple of Internal Execution errors.

sfge (1).log.zip

@AnkitBhatnagar123
Copy link
Author

Hey @jfeingold35
Did you get a chance to look into it?
Thanks!

@jfeingold35
Copy link
Collaborator

@AnkitBhatnagar123 , if you've maxed out your heap size and you're still seeing errors about path expansion limit, I recommend either using --pathexplimit specify a greater upper limit than the dynamically calculated one (25081), or else breaking the scan into multiple smaller scans of individual files (or even individual methods, using the method-level targeting syntax, eg --target "path/to/someFIle.cls#yourMethod").

@vc4u
Copy link

vc4u commented Sep 12, 2023

@jfeingold35 We tried with targeted approach to scan just the said method only. But we're running into errors still. Either it runs into Timeout error or it runs into OutOfMemory error. We raised the Heap memory limit to as high as 80G-100G and increased Timeout to 3-4 hours as well.

Latest error we just received is this (log file attached):

ERROR running scanner:run:dfa:  java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError, Caused by:
java.lang.OutOfMemoryError

We've tested by raising the Heap memory max Limit to 80G, which seems to be a lot in our opinion. Is there any way we can understand why a single method would consume this much memory. How could be structure our method or code to maybe not to generate that many paths or may be something is causing a loop in graph paths?

My CLI command was:
sf scanner run dfa -f html -o CodeAnalyzerDFA.html -p ./force-app/main/default -t "./force-app/main/default/classes/NDConnectController.cls#initConfig" --category "Security" --rule-thread-count 16 --sfgejvmargs "-Xmx80g" --pathexplimit "-1" --rule-thread-timeout 8100000

Log:
sfge.zip

SFDX Scanner v3.16.0

@jfeingold35
Copy link
Collaborator

jfeingold35 commented Sep 12, 2023

@vc4u , a few things.
First, it's generally a bad idea to allocate more heap than your machine has RAM, or more threads than your machine can run at once. Exceeding these limits can actually slow things down, as it forces the machine to simulate resources that don't exist.

Second, the logs have tons of references to response.auth.refresh_token and response.auth.access_token at lines 280-281 and 293-294 of NDAuthController. Can you replace these length-3 references with length-2 references instead?
E.g., instead of response.auth.refresh_token, can you do auth = response.auth and then auth.refresh_token? I'm not sure if that will fix the OutOfMemory problem, but it will resolve many of the warnings I'm seeing in the logs.

@vc4u
Copy link

vc4u commented Sep 12, 2023

And yes, that is understood :-), we're not allocating resources more than what we presently have as hardware. We're using a dedicated machine with 128GB RAM and a 10 Core CPU with hyperthreading support so we are just trying various combination as and when the error occurs. Sometimes it's a Timeout so we increase timeout but then it ends up sometimes with OutOfMemory so we increase the Heap memory a bump, but not more than what our machine have.

We'll try your suggestions, probably need some time to change the core code, but will get back to you soon, at the end of the day, this tool is really helpful, it's just we're kind of stuck in this singular error and would really like our code to be clean without any violations so any guidance is appreciated and helpful.

@jfeingold35
Copy link
Collaborator

@vc4u , dang, that's a powerful machine!

@vc4u
Copy link

vc4u commented Sep 12, 2023

Yes, and thus our assumption that either our code is really in bad shape that scanner not able to work it out or some other factors like Java version might be incompatible. We noticed this Info/Warning a lot repeatedly in logs but did not think it was coming from 3-level deep class structure. So we’ll work something on this and see if some traction is removed from logs and hopefully function scans completes.

the underlying method we re getting issues with is hardly 20-30 lines deep but yes it does call different class methods based on certain input params.

@vc4u
Copy link

vc4u commented Sep 13, 2023

@jfeingold35 Here are our findings with variables converted from length-3 to length-2. The command we ran this time was this:

sf scanner run dfa -f html -o CodeAnalyzerDFA.html -p ./force-app/main/default -t "./force-app/main/default/classes/NDConnectController.cls#initConfig" --category "Security" --rule-thread-count 8 --sfgejvmargs "-Xmx80g" --pathexplimit "-1" --rule-thread-timeout 9000000

With this the log generated is this:
sfge.zip

After taking a look at it, it seems the scanner is still somehow stuck in a loop of sort?

Because the log keeps repeating following statement:

WARN ApexPathExpanderUtil:282 - expand-NullAccess. ex=NullValueAccessedException{apexValue=ApexValue(ApexSingleValue) {status=INITIALIZED, declarationVertex=null, valueVertex=LiteralExpression{properties={FirstChild=true, BeginLine=336, DefiningType_CaseSafe=ndutility, LastChild=true, DefiningType=NDUtility, EndLine=336, childIdx=0, LiteralType=NULL, BeginColumn=16}}, resolvedValues={}, returnedFrom=null, invocableExpression=null, method=null}, vertex=MethodCallExpressionVertex{fullMethodName=r.getBody, referenceVertex=LazyVertex{result=ReferenceExpression{properties={FirstChild=true, Names=[r], BeginLine=16, DefiningType_CaseSafe=ndapi, LastChild=true, DefiningType=NDAPI, EndLine=16, Name_CaseSafe=r, childIdx=0, BeginColumn=31, ReferenceType=METHOD, Name=r}}}, chainedNames=[r], properties={FirstChild=true, FullMethodName=r.getBody, BeginLine=16, FullMethodName_CaseSafe=r.getbody, DefiningType_CaseSafe=ndapi, LastChild=false, DefiningType=NDAPI, EndLine=16, MethodName_CaseSafe=getbody, childIdx=0, BeginColumn=33, MethodName=getBody}}}

There are two Line Numbers mentioned in this statement from two classes - (NDAPI.cls) Line 16 & (NDUtility.cls) Line 336.

So somehow the scanner is on a loop, where first NDUtility's method is called, which at Line 16 calls the NDUtility method (provided below) and in this method at line 336 is this return null statement. I'm wondering why would the scanner keeps getting null return and why it would not go into the method code to actually execute other statements, but simply go to null. Or may be Graph scan is suppose to work like this and we need to patch this method as well?

public static HttpResponse basicAuthCallout(String access_token, string method, string actionUrl, string payload, string contentType) {
	// Get Read-only mode status
	ApplicationReadWriteMode mode = System.getApplicationReadWriteMode();
	if (mode == ApplicationReadWriteMode.READ_ONLY) {
		// Prevent the callout
		throw new CustomException('Unable to connect. Your Salesforce instance is currently in Read-only/Maintenance mode!');
	} else if (mode == ApplicationReadWriteMode.DEFAULT) {
		HttpRequest req = new HttpRequest();
		req.setEndpoint( NDConfig.API_URL + actionUrl );
		req.setMethod( method );
		req.setTimeout( 120000 );
		
		req.setHeader('Content-Type', contentType);
		req.setHeader('Accept', 'application/json');
		if (String.isNotBlank(access_token)) {
			req.setHeader('Authorization', 'Bearer '+access_token);
		}
		
		if (!String.isBlank(payload)) {
			payload = sanitizeSystemKeywords(payload, false);
			req.setBody( payload );
		}
		
		return new Http().send(req);
	}
	return null;
}

@vc4u
Copy link

vc4u commented Sep 13, 2023

@johnbelosf New update:

We updated our method the method to not call ApplicationReadWriteMode verification (for testing purposes here):

public static HttpResponse basicAuthCallout(String access_token, string method, string actionUrl, string payload, string contentType) {
	HttpRequest req = new HttpRequest();
	req.setEndpoint( NDConfig.API_URL + actionUrl );
	req.setMethod( method );
	req.setTimeout( 120000 );

	req.setHeader('Content-Type', contentType);
	req.setHeader('Accept', 'application/json');
	if (String.isNotBlank(access_token)) {
		req.setHeader('Authorization', 'Bearer '+access_token);
	}

	if (!String.isBlank(payload)) {
		payload = sanitizeSystemKeywords(payload, false);
		req.setBody( payload );
	}

	return new Http().send(req);
}

Then ran following command:
sf scanner run dfa -f html -o CodeAnalyzerDFA.html -p ./force-app/main/default -t "./force-app/main/default/classes/NDConnectController.cls#initConfig" --category "Security" --rule-thread-count 10 --sfgejvmargs "-Xmx80g" --pathexplimit "-1" --rule-thread-timeout 9000000

Still the OutOfMemory limit error generated. And this time the Log file seems too clean, no Error or warning for any method, just one line repeated infinitely till the execution error:

2023-09-13 21:18:53 847a37e4-d0ba-467e-9f30-9da3a75bf732 WARN ApexPathExpanderUtil:233 - expand-Finished.

Error generated in console output:

ERROR running scanner:run:dfa: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError, Caused by:
java.lang.OutOfMemoryError

Log File:
sfge.log

@jfeingold35
Copy link
Collaborator

@vc4u , I can't reproduce from just the one method you sent, so my suspicion is that the problem isn't with this specific method, but rather with how the method is being invoked.
So we really need to narrow it down to a particular code flow, so we can figure out what about that flow is causing the problem and hopefully find a way to reproduce internally without needing your exact codebase.
Since the logs aren't giving us the information we need, the best option I can think of is for you guys to do some manual debugging.

I recommend using the following code snippet for this:

String s = null;
System.debug(s.length());

Since this code snippet always throws a NullPointerException (which prevents further expansion of the current path), if placing this code snippet somewhere in your codebase causes you to get a bunch of ApexNullPointerExceptionRule violations instead of the OutOfMemory error, then that tells us imporant information.
For example, if putting the snippet at line 5 of a method resolves the error but the error returns if you move the snippet to line 7, then that means the path explosion happened at line 6. And if the error resolves when you put the snippet in an if but not the else, that means the explosion is happening in the if.

@vc4u
Copy link

vc4u commented Sep 13, 2023

@jfeingold35 We tried by putting this statement in two different positions inside the method we're trying to scan here.

String s = null;
System.debug(s.length());
  1. When we placed this at Line 1 inside the method:
  • Scan finished in about 10-20 seconds of execution with result in command line as Executed engines: sfge. No rule violations found.
  • log file:
    sfge-1.log
  1. When we placed this at the some random position (15 lines within the method) inside the method:
  • Scan did not finish, it continued running, so we hard stopped it after few minutes.
  • log file:
    sfge-2.log

@jfeingold35
Copy link
Collaborator

@vc4u , questions:

  1. Which method did you put the snippet in? Can you post the body of the method, along with an indicator of where you put the snippet?
  2. 15 lines is still a pretty wide range. Can you narrow it down any further than that?
    a. Note that if those 15 lines include calls to other methods, you can put the snippet in those methods to further narrow down where the problem might be arising.

@vc4u
Copy link

vc4u commented Sep 17, 2023

@jfeingold35 So we tried, and tried hard. So many different combinations tried to work out limit. The worst is, we have to wait this long like hours for a try to finish things properly or with an error to find where the bottleneck is.

Ultimately, we decided to re-write the method completely, and we broken down the method to multiple small subs. With this change, ultimately we're able to get past the errors and generate report. We succeeded only with timeout of 7200000.

The current situation of our report preparation is that we're blocked due to following 1 bugs - #1175 & #1180 - Can you plz take a look at these and suggest workaround for them?

@jfeingold35
Copy link
Collaborator

@vc4u Glad you could at least resolve the timeout!
We'll leave this bug open until we fix the __r issue, but for now, neither this issue nor the other two you indicated should stop you from submitting for security review, as part of the process involves documenting false positives.

@vc4u
Copy link

vc4u commented Sep 18, 2023

@jfeingold35 Yes, I agree, with the __r logged already as a bug at your end, we don't need to keep this open, as we do have our path forward on our app security review submission, and we're not blocked as of now.

There was one more thing that came out in our course of discussion on this bug. We had modified some return parameter data types to not use inner-class structures so that we don't get Warnings in scanner output for 2-length structures to be enforced while we had some inner class variables that made our programming a bit clean, as the scanner seems to not like the variables that generate a JSON structure which may contain variables that makes overall variable data types to evaluate with more than 2-length structures say 3 or 4-length structures. Do you think, it should have allowed us to sue that kind of programming structure without us having to actually modify and make all variables limit to 2-length/level structures and that made us not to use inner classes. Thoughts? (I'm thinking may be that should either be an idea or a bug to support for many ISVs like us)

And thank you so much for assistance on this, from your end. Really appreciated your responses and timely workarounds.

@jfeingold35
Copy link
Collaborator

@vc4u , in the fullness of time, we intend to add support for reference chains longer than 2. For now, it's a known limitation.
Happy to help, and glad you guys are unblocked now.

@johnbelosf johnbelosf changed the title Internal Execution Errors in Apex Classes while running Graph Scan [BUG] Internal Execution Errors in Apex Classes while running Graph Scan Feb 20, 2024
@rmohan20 rmohan20 marked this as a duplicate and then as not a duplicate of #1294 Mar 12, 2024
@stephen-carter-at-sf stephen-carter-at-sf added the SFGE Issues related to the Salesforce Graph Engine label May 23, 2024
@stephen-carter-at-sf stephen-carter-at-sf added the duplicate This issue or pull request already exists label Jun 3, 2024
@stephen-carter-at-sf
Copy link
Collaborator

Marking this as a duplicate of #1497

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG P3 Rarely Malfunction duplicate This issue or pull request already exists SFGE Issues related to the Salesforce Graph Engine
Projects
None yet
Development

No branches or pull requests

4 participants