Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: 海光DCU上测试分布式训练llama3报错 #9653

Open
zx214 opened this issue Dec 17, 2024 · 0 comments
Open

[Question]: 海光DCU上测试分布式训练llama3报错 #9653

zx214 opened this issue Dec 17, 2024 · 0 comments
Assignees
Labels
question Further information is requested

Comments

@zx214
Copy link

zx214 commented Dec 17, 2024

1、环境:
Z100,dtk-24.04.2
容器内部paddle环境相关版本:
paddle2onnx 1.3.1
paddlefsl 1.1.0
paddlenlp 3.0.0b2.post20241217
paddlenlp-ops 0.0.0
paddlepaddle-dcu 3.0.0.dev20241215
2、单机多卡可以正常训练,分布式测试时候报如下错误:
python: /paddle/third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorExecutor.h:612: static void Eigen::internal::TensorEx ecutor<const Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1>, 0>, const Eigen::TensorCwiseBinaryOp<Eigen::inter nal::scalar_product_op, const Eigen::TensorCwiseUnaryOp<Eigen::internal::bind2nd_op<Eigen::internal::scalar_product_op>, const Eigen::TensorMap<Eigen::Tensor<const float, 1, 1>, 0>>, const Eigen::TensorCwiseUnaryOp<Eigen::internal::bind2n d_op<Eigen::internal::scalar_pow_op<const float, const float>>, const Eigen::TensorMap<Eigen::Tensor<const float, 1, 1>, 0>>>>, E igen::GpuDevice, false, Eigen::internal::Off>::run(const Expression &, const Eigen::GpuDevice &) [Expression = const Eigen::Tenso rAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1>, 0>, const Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op, const Eigen::TensorCwiseUnaryOp<Eigen::internal::bind2nd_op<Eigen::internal::scalar_product_op>, const Eigen:: TensorMap<Eigen::Tensor<const float, 1, 1>, 0>>, const Eigen::TensorCwiseUnaryOp<Eigen::internal::bind2nd_op<Eigen::internal::sca lar_pow_op<const float, const float>>, const Eigen::TensorMap<Eigen::Tensor<const float, 1, 1>, 0>>>>, Device = Eigen::GpuDevice, Vectorizable = false, Tiling = Eigen::internal::Off]: Assertion `hipGetLastError() == hipSuccess' failed.
image

@zx214 zx214 added the question Further information is requested label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants