-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] 部署 3.11.7 ,GPU 宿主机加入节点 , host 容器无法启动 #21463
Comments
@khw934 打开 host-agent debug日志重启下服务看看 |
怎么打开这个 debug 日志? |
kubectl edit cm -n onecloud default-host |
是看 host 里面 sdnagent 容器 的日志吗? |
看 host 容器日志 |
root@test1024:~# kubectl logs -f default-host-6ggnn -n onecloud -c host [error 2024-10-29 06:34:05 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:768)] exit status 1 [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled yunion-host-sdnagent] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status yunion-host-sdnagent] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled yunion-host-sdnagent] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status yunion-host-deployer] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled yunion-host-deployer] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status yunion-host-deployer] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled yunion-host-deployer] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status telegraf] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled telegraf] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status telegraf] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled telegraf] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status openvswitch-switch] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status openvswitch-switch] [debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status openvswitch-switch] [info 2024-10-29 06:34:06 hostinfo.(*SHostInfo).Init(hostinfo.go:205)] Start parseConfig |
这是 host 容器日志, 麻烦看下, 是什么问题 |
@khw934 看下 dmesg 有啥信息吗,0a:00.0 是什么设备?看起来可能是卡在这个设备绑定驱动 |
root@hnrsjia-node:~# dmesg | grep '0a:00.0' |
@khw934 0a:00.0 已经绑了nvidia驱动?要先卸载 nvidia驱动 |
是先要把英伟达的驱动先卸载?宿主机不能先安装驱动? |
对,要绑vfio驱动 |
想要透传的都要卸载 |
问题描述/What happened:
部署 3.11.7 ,GPU 宿主机加入节点 , host 容器无法启动
环境/Environment:
cat /etc/os-release
):root@hnrsjia-node:
# ovs-vsctl list# cat /etc/os-releaseovs-vsctl: 'list' command requires at least 1 arguments
root@hnrsjia-node:
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@hnrsjia-node:~#
uname -a
):root@hnrsjia-node:
# uname -a#Linux hnrsjia-node 5.15.0-25-generic #25-Ubuntu SMP Wed Mar 30 15:54:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
root@hnrsjia-node:
dmidecode | egrep -i 'manufacturer|product' |sort -u
)root@hnrsjia-node:
# dmidecode | egrep -i 'manufacturer|product' |sort -u#Manufacturer: 3C0A
Manufacturer: DELTA
Manufacturer: Giga Computing
Manufacturer: Intel(R) Corporation
Manufacturer: Micron
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Module Manufacturer ID: Bank 1, Hex 0x2C
Module Product ID: 0x2C00
Product Name: G593-SD2-AAX1-000
Product Name: MSB3-G40-000
idProduct: 0xffb0
root@hnrsjia-node:
kubectl exec -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- climc version-list
):The text was updated successfully, but these errors were encountered: