Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to make functional tests more reliable #5874

Merged
merged 6 commits into from
Jul 2, 2024

Conversation

jeffkl
Copy link
Contributor

@jeffkl jeffkl commented Jun 26, 2024

Bug

Fixes: https://github.com/NuGet/Client.Engineering/issues/2501
Fixes: https://github.com/NuGet/Client.Engineering/issues/2866

Regression? Last working version:

Description

The Functional tests seem to very unreliable with a few symptoms that this pull request hopefully addresses:

  1. Test host process crashing
  2. A particular command times out
  3. A particular test or set of tests hang, causing ADO to cancel the test run

To attempt to address the issues, I've tweaked the parallelism in the test runner. There are three settings in the .runsettings for parallelism:

Setting Purprose Old value New Value
MaxCpuCount The number or processes that vstest.console.exe spawns to run tests in 0 4
MaxParallelThreads The number of threads that the xunit runner can use at a time 0 4
ParallelizeAssembly Whether or not the xunit test runner should run tests in assemblies in parallel true false

Since we are currently using the default values for MaxCpuCount and MaxParallelThreads, as well as enabling ParallelizeAssembly, I suspect that vstest.console.exe is launching 12 process, each one then runs the tests for each assembly with 12 threads, and if one process is running multiple assemblies, they're happening in parallel. My suspicion is that we're over parallelizing tests, leading to crashes. By changing the values to lower, I do not see any increase in test run time and in fact, sometimes the runs are faster. However, in my testing, I rarely see a test host process crashing anymore.

I also deleted build\TestShared\xunit.runner.json and moved the settings to build\xunit.runsettings so that everything is defined in one place. Some projects were opting into test parallelization with the UseParallelXunit property which would copy the xunit.runner.json to the output directory, while other projects were copying it themselves. I found it confusing to determine why some projects were allowed to run in parallel and others weren't so I changed how we run tests in parallel and made it uniform across the repo.

Finally, I've disabled the crash dump helper when running tests since all of the memory dumps it provided were always for the main vstest.console.exe which wasn't helpful since the tests themselves are running in a different process. Instead, I've added the --diag switch to the call to dotnet test and now there's a diagnostic log for the main process and all of the child processes with valuable information. These logs go to the binlog location and get uploaded when a test run fails or if you enable diagnostics on a particular pipeline run.

PR Checklist

  • PR has a meaningful title

  • PR has a linked issue.

  • Described changes

  • Tests

    • Automated tests added
    • OR
    • Test exception
    • OR
    • N/A
  • Documentation

    • Documentation PR or issue filled
    • OR
    • N/A

@jeffkl jeffkl added the Engineering Changes related to the build infrastructure of the repo and that do not change product code label Jun 26, 2024
@jeffkl jeffkl self-assigned this Jun 26, 2024
@jeffkl jeffkl requested a review from a team as a code owner June 26, 2024 21:57
@kartheekp-ms kartheekp-ms added the Learning Opportunity Triggers an in-depth conversation in the weekly PR review meeting so that reviewers can learn more label Jun 27, 2024
<MaxParallelThreads>4</MaxParallelThreads>

<!-- Set this to true if this assembly is willing to participate in parallelization with other assemblies. Test runners can use this information to automatically enable parallelization across assemblies if all the assemblies agree to it. -->
<ParallelizeAssembly>false</ParallelizeAssembly>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this setting nullifies the MaxcpuCount (4) property because as per the comment it looks like MaxcpuCount specifies the number of distinct DLLs in the run determines the actual number of testhosts started.. Given that the multiple assemblies are not running parallel then my understanding is that only one test host process is started. I guess we should change to true so that 4 DLLs can run in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since MaxCpuCount is set to 4 (here), there will be 4 vstest.console.exe processes spawned. I think it then launches each test assembly in those processes in parallel. Then with MaxParallelThreads set to 4, each test assembly runs with 4 threads (if there are enough CPUs).

This is really confusing since there are multiple test runners. You can run tests under xunit.runner.exe, vstest.console.exe, etc. I think the ParallelizeAssembly tells xunit specifically that it should run tests from multiple assemblies at the same time when running under certain test runners. But since we're running the tests under vstest.console.exe, it seems to me that we should not be using xunit's parallelization. I spent a lot of time read https://xunit.net/docs/runsettings but only experimenting with these numbers did I find a good balance between speed and reliability. Do you know more about how all of this works?

Copy link
Contributor

@kartheekp-ms kartheekp-ms Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a similar issue has been discussed here. I think we are good because it appears that xUnit parallelization is ignored by the dotnet test command. One interesting thing I learned from this link is that executing dotnet test without specifying assemblies runs test assemblies in parallel, whereas dotnet test assembly1.dll assembly2.dll does not run test assemblies in parallel. I am not sure if this is accurate in .NET 8, but at least that was the case earlier because the linked comment explained the different behaviors with an example.

@jeffkl jeffkl force-pushed the dev-jeffkl-test-parallelism branch from 965bbb3 to e6aabcc Compare June 30, 2024 20:01
@jeffkl jeffkl force-pushed the dev-jeffkl-test-parallelism branch from e6aabcc to ce678f8 Compare July 1, 2024 21:52
zivkan
zivkan previously approved these changes Jul 2, 2024
@@ -53,6 +53,9 @@ public static CommandRunnerResult Run(string filename, string workingDirectory =
process.StartInfo.Environment["NUGET_SHOW_STACK"] = bool.TrueString;
process.StartInfo.Environment["NuGetTestModeEnabled"] = bool.TrueString;
process.StartInfo.Environment["UseSharedCompilation"] = bool.FalseString;
process.StartInfo.Environment["DOTNET_SKIP_FIRST_TIME_EXPERIENCE"] = bool.TrueString;
process.StartInfo.Environment["DOTNET_CLI_TELEMETRY_OPTOUT"] = bool.TrueString;
process.StartInfo.Environment["SuppressNETCoreSdkPreviewMessage"] = bool.FalseString;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a mistake? Why wouldn't we want to suppress the .NET (Core) preview message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 Yes, thank you for catching that!

@@ -78,7 +78,7 @@ jobs:
condition: succeeded()
inputs:
command: test
arguments: --no-restore --restore:false --no-build --framework $(TestTargetFramework) --blame --logger trx --binarylogger:"$(BinlogDirectory)02-RunTests.binlog"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, I've disabled the crash dump helper when running tests since all of the memory dumps it provided were always for the main vstest.console.exe which wasn't helpful since the tests themselves are running in a different process

Have you talked to the team who owns dotnet test? This sounds very surprising so sounds like something that should be fixed for all customers, or if we're doing something wrong, they can help us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few issues in the vstest repo:
microsoft/vstest#3936
microsoft/vstest#2593
microsoft/vstest#4376
microsoft/vstest#2952

Nothing specific to the problem I was seeing. Supposedly the blame stuff also doesn't supply enough command-line arguments to capture other exceptions than StackOverflowException and AccessViolationException

https://github.com/microsoft/vstest/blob/main/src/Microsoft.TestPlatform.Extensions.BlameDataCollector/ProcDumpArgsBuilder.cs#L33

Although now there's an env var we can set to include more arguments to proc dump. The main point in all of this is it seems like adding --diag to the vstest.console.exe execution is better. I don't think we have that many "test crashes". But if we feel like we need the memory dumps back, we can do so at a later time in my opinion.

<ParallelizeAssembly>false</ParallelizeAssembly>

<!-- Set this to true if the assembly is willing to run tests inside this assembly in parallel against each other. Tests in the same test collection will be run sequentially against each other, but tests in different test collections will be run in parallel against each other. Set this to false to disable all parallelization within this test assembly. -->
<ParallelizeTestCollections>false</ParallelizeTestCollections>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared the CI run for this PR to a different CI run, because I would have expected that setting ParallelizeTestCollections and ParallelizeAssembly both to false would make the tests run very slowly. But somehow this PR's CI run was margionally faster then the dev branch run.

It's totally not intuitive to me why, but I didn't see this mentioned in the PR description or anywhere else, so I wanted to point out that apparently the CI is not going to take a long longer with this PR merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggle to understand it as well. Its still running tests assemblies in parallel with each other, but I guess not running tests themselves in parallel? A machine with only 2 cores would only be able to run two tests at the same time anyway so I think the adjustments don't affect the maximum parallelization we could already achieve on CI agents. It really just lowers the amount of parallelization that's queued up waiting for CPU.

@jeffkl jeffkl merged commit 7a1e5b1 into dev Jul 2, 2024
27 of 28 checks passed
@jeffkl jeffkl deleted the dev-jeffkl-test-parallelism branch July 2, 2024 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Engineering Changes related to the build infrastructure of the repo and that do not change product code Learning Opportunity Triggers an in-depth conversation in the weekly PR review meeting so that reviewers can learn more
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants