Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QEMUv7: xtest pkcs11_1007 sporadic failure #6952

Open
etienne-lms opened this issue Jul 18, 2024 · 14 comments
Open

QEMUv7: xtest pkcs11_1007 sporadic failure #6952

etienne-lms opened this issue Jul 18, 2024 · 14 comments

Comments

@etienne-lms
Copy link
Contributor

xtest pkcs11_1007 (requires to build with CFG_PKCS11_TA=y) sporadically fails with unexpected error code CKR_HOST_MEMORY on various platforms, including qemu_virt, qeu_armv8a, stm32mp1 (for I one i've tests).
References found in Github optee_os forum:
#5395 (comment) (July 2022, maybe unrelated)
#5359 (comment) (May 2023)
#6125 (comment) (June 2023)
#6922 (comment) (June 2024)
#6881 (comment) (July 2024)

I open this Issue to track it. I've started to look closer at this and will post my current findings (that are still weak).

@etienne-lms
Copy link
Contributor Author

I was able to reproduce the issue. Traces indeed seen show a "host memory" allocation failure, while we would expect that this test exhausts the "device memory" (aka OP-TEE core or TA memory) .

.../out-br/build/optee_test_ext-1.0/host/xtest/pkcs11_1000.c:1792: rv has an unexpected value: 0x2 = CKR_HOST_MEMORY, expected 0xb3 = CKR_SESSION_HANDLE_INVALID
.../out-br/build/optee_test_ext-1.0/host/xtest/pkcs11_1000.c:1794: rv has an unexpected value: 0x2 = CKR_HOST_MEMORY, expected 0xb3 = CKR_SESSION_HANDLE_INVALID
.../out-br/build/optee_test_ext-1.0/host/xtest/pkcs11_1000.c:1796: rv has an unexpected value: 0x2 = CKR_HOST_MEMORY, expected 0xb3 = CKR_SESSION_HANDLE_INVALID
...

They relate to pkcs11_1000.c source file tests:

line 1769:	for (n = 0; n < ARRAY_SIZE(sessions); n++) {
			...
line 1783	}
		...

line 1791:	/* Closing session with out bound and invalid IDs (or negative ID) */
line 1792:	rv = C_CloseSession(sessions[n - 1] + 1024);
line 1793:	ADBG_EXPECT_CK_RESULT(c, CKR_SESSION_HANDLE_INVALID, rv);
line 1794:	rv = C_CloseSession(CK_INVALID_HANDLE);
line 1795:	ADBG_EXPECT_CK_RESULT(c, CKR_SESSION_HANDLE_INVALID, rv);
line 1796:	rv = C_CloseSession(~0);
line 1797:	ADBG_EXPECT_CK_RESULT(c, CKR_SESSION_HANDLE_INVALID, rv);

I saw also some unexpected error traces from OP-TEE core that are emitted in the loop (that attempts to open more than a hundred sessions) before line 1792 above is called, so I think the issue is deeper than a libckteec (PKCS#11 crypoki client lib) issue:

E/TC:1 0 std_entry_with_parg:234 Bad arg address 0xc2451000
E/TC:1 0 std_entry_with_parg:234 Bad arg address 0xc266c000

When the issue occurs, that loop stops on a "host memory" error, whereas when pkcs11_1007 test succeeds (at it most often does), the loop stops on a "device memory" error, as expected.

I'll continue to investigate. I think they are some Linux OS resources that are not properly released in xtest.

@jenswi-linaro
Copy link
Contributor

E/TC:1 0 std_entry_with_parg:234 Bad arg address 0xc2451000

This error suggests that the normal world uses a non-secure address that OP-TEE isn't aware of.

@etienne-lms
Copy link
Contributor Author

E/TC:1 0 std_entry_with_parg:234 Bad arg address 0xc2451000

This error suggests that the normal world uses a non-secure address that OP-TEE isn't aware of.

I looks like client is invoking OP-TEE with an shm buffer that failed to be registered, but I can't find where this can happen.

Regarding pkcs11_1007 sporadic failures, i think it comes from that fact we try to invoke OP-TEE for closing invalid session handlers (lines 1792 to 1797 shown in the comment above) while OP-TEE core heap memory is already exhausted. I'll propose a change to release some resources before we use make these attempts.

etienne-lms added a commit to etienne-lms/optee_test that referenced this issue Sep 9, 2024
Close opened session after secure service memory is exhausted before
testing closure on invalid session. This change attempts to fix a
non systematic issue found as described in [1].

A test case is slightly changed: where a likely invalid session
ID (valid ID + 1024) was tried to be closed, this change now
tries to close a session that has been already closed.

Link: OP-TEE/optee_os#6952 [1]
Signed-off-by: Etienne Carriere <[email protected]>
etienne-lms added a commit to etienne-lms/optee_test that referenced this issue Sep 9, 2024
Close opened session after secure service memory is exhausted before
testing closure on invalid session. This change attempts to fix a
non systematic issue found as described in [1].

A test case is slightly changed: where a likely invalid session
ID (valid ID + 1024) was tried to be closed, this change now
tries to close a session that has been already closed.

Link: OP-TEE/optee_os#6952 [1]
Signed-off-by: Etienne Carriere <[email protected]>
Acked-by: Jens Wiklander <[email protected]>
etienne-lms added a commit to etienne-lms/optee_test that referenced this issue Sep 9, 2024
Close opened sessions after secure service memory is exhausted before
testing closure on invalid session. This change attempts to fix a
non systematic issue found as described in [1].

A test case is slightly changed: where a likely invalid session
ID (valid ID + 1024) was tried to be closed, this change now
tries to close a session that has been already closed.

Link: OP-TEE/optee_os#6952 [1]
Signed-off-by: Etienne Carriere <[email protected]>
Acked-by: Jens Wiklander <[email protected]>
jforissier pushed a commit to OP-TEE/optee_test that referenced this issue Sep 9, 2024
Close opened sessions after secure service memory is exhausted before
testing closure on invalid session. This change attempts to fix a
non systematic issue found as described in [1].

A test case is slightly changed: where a likely invalid session
ID (valid ID + 1024) was tried to be closed, this change now
tries to close a session that has been already closed.

Link: OP-TEE/optee_os#6952 [1]
Signed-off-by: Etienne Carriere <[email protected]>
Acked-by: Jens Wiklander <[email protected]>
Copy link

This issue has been marked as a stale issue because it has been open (more than) 30 days with no activity. Remove the stale label or add a comment, otherwise this issue will automatically be closed in 5 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Sep 21, 2024
@etienne-lms
Copy link
Contributor Author

FYI, P-R OP-TEE/optee_test#758 attempts to address the issue but it's not obvious to me it will really fix it. I'll let this Issue be closed is no such failure is reported by new P-Rs' CI tests.

@github-actions github-actions bot removed the Stale label Sep 24, 2024
@GseoC
Copy link
Contributor

GseoC commented Sep 27, 2024

Log: Reproduced today on https://github.com/OP-TEE/optee_os/pull/7014. However, I resubmitted my patches so I lost the CI logs :/

@jforissier
Copy link
Contributor

Log: Reproduced today on https://github.com/OP-TEE/optee_os/pull/7014. However, I resubmitted my patches so I lost the CI logs :/

Still visible here: https://github.com/OP-TEE/optee_os/actions/runs/11071874169/job/30764736296
Previous runs for PRs can be found on the GitHub Actions page: https://github.com/OP-TEE/optee_os/actions
But the logs are cleaned up after some time, that is true.

@etienne-lms
Copy link
Contributor Author

Thanks for reporting the issue. So it was not addressed.
The CI Test logs do not provide much information: CI test is not very verbose, I understand that. Only says "pkcs11_1007 FAIL". I'll respend time on this issue.

In the mean time, to not bother contributors with this sporadic issue, @jforissier would you be okay to specifically disable xtest pkcs11_1007. It does not cover any pkcs11 crypto or login tests, only session open/close specific cases.

@jforissier
Copy link
Contributor

Thanks for reporting the issue. So it was not addressed.
The CI Test logs do not provide much information: CI test is not very verbose, I understand that. Only says "pkcs11_1007 FAIL". I'll respend time on this issue.

We could make the CI show the full console output of the container (out/bin/serial0.log and out/bin/serial1.log) by setting DUMP_LOGS_ON_ERROR to a non-empty value. Not sure it would help much though.

In the mean time, to not bother contributors with this sporadic issue, @jforissier would you be okay to specifically disable xtest pkcs11_1007. It does not cover any pkcs11 crypto or login tests, only session open/close specific cases.

Fine with me. Adding XTEST_ARGS="-x pkcs11_1007" should be enough to skip that one.

etienne-lms added a commit to etienne-lms/optee_os that referenced this issue Sep 30, 2024
Exclude xtest case pkcs11_1007 from qemuv7 CI tests since we're facing
sporadic failures on this test on PKCS#11 sessions opening and release.
Occurrences of this issue have been found only on this Armv7-A platform.
Once the issue is solved, we be able to restore this test.

Link: OP-TEE#6952
Signed-off-by: Etienne Carriere <[email protected]>
@etienne-lms
Copy link
Contributor Author

Created #7061.
FYI, digging into CI test failures in https://github.com/OP-TEE/optee_os/actions history, I only saw occurrences for qemuv7 platform.

@jforissier
Copy link
Contributor

I don't recall seeing this failure on other platforms than qemuv7 indeed. Let's add QEMUv7 to the title.

@jforissier jforissier changed the title xtest pkcs11_1007 sporadic failure QEMUv7: xtest pkcs11_1007 sporadic failure Sep 30, 2024
etienne-lms added a commit to etienne-lms/optee_os that referenced this issue Sep 30, 2024
Exclude xtest case pkcs11_1007 from qemuv7 CI tests since we're facing
sporadic failures on this test on PKCS#11 sessions opening and release.
Occurrences of this issue have been found only on this Armv7-A platform.
Once the issue is solved, we be able to restore this test.

Link: OP-TEE#6952
Signed-off-by: Etienne Carriere <[email protected]>
Acked-by: Jens Wiklander <[email protected]>
Acked-by: Jerome Forissier <[email protected]>
jforissier pushed a commit that referenced this issue Oct 2, 2024
Exclude xtest case pkcs11_1007 from qemuv7 CI tests since we're facing
sporadic failures on this test on PKCS#11 sessions opening and release.
Occurrences of this issue have been found only on this Armv7-A platform.
Once the issue is solved, we be able to restore this test.

Link: #6952
Signed-off-by: Etienne Carriere <[email protected]>
Acked-by: Jens Wiklander <[email protected]>
Acked-by: Jerome Forissier <[email protected]>
Copy link

This issue has been marked as a stale issue because it has been open (more than) 30 days with no activity. Remove the stale label or add a comment, otherwise this issue will automatically be closed in 5 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Oct 31, 2024
@etienne-lms
Copy link
Contributor Author

The issue seems to have disappeared. However, I suggest we keep this entry opened, at least until next release tag (jan'25).

@jforissier
Copy link
Contributor

The issue seems to have disappeared. However, I suggest we keep this entry opened, at least until next release tag (jan'25).

If I'm not mistaken, 1007 has been disabled in CI, so what makes you think that the issue has disappeared?

@jforissier jforissier removed the Stale label Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants