Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 部署 3.11.7 ,GPU 宿主机加入节点 , host 容器无法启动 #21463

Open
khw934 opened this issue Oct 25, 2024 · 14 comments
Open
Labels

Comments

@khw934
Copy link

khw934 commented Oct 25, 2024

问题描述/What happened:

部署 3.11.7 ,GPU 宿主机加入节点 , host 容器无法启动

环境/Environment:

  • OS (e.g. cat /etc/os-release):

root@hnrsjia-node:# ovs-vsctl list
ovs-vsctl: 'list' command requires at least 1 arguments
root@hnrsjia-node:
# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@hnrsjia-node:~#

  • Kernel (e.g. uname -a):

root@hnrsjia-node:# uname -a
Linux hnrsjia-node 5.15.0-25-generic #25-Ubuntu SMP Wed Mar 30 15:54:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
root@hnrsjia-node:
#

  • Host: (e.g. dmidecode | egrep -i 'manufacturer|product' |sort -u)

root@hnrsjia-node:# dmidecode | egrep -i 'manufacturer|product' |sort -u
Manufacturer: 3C0A
Manufacturer: DELTA
Manufacturer: Giga Computing
Manufacturer: Intel(R) Corporation
Manufacturer: Micron
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Module Manufacturer ID: Bank 1, Hex 0x2C
Module Product ID: 0x2C00
Product Name: G593-SD2-AAX1-000
Product Name: MSB3-G40-000
idProduct: 0xffb0
root@hnrsjia-node:
#

  • Service Version (e.g. kubectl exec -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- climc version-list):

image

image

@khw934 khw934 added the bug Something isn't working label Oct 25, 2024
@wanyaoqi
Copy link
Member

@khw934 打开 host-agent debug日志重启下服务看看

@khw934
Copy link
Author

khw934 commented Oct 28, 2024

@khw934 打开 host-agent debug日志重启下服务看看

怎么打开这个 debug 日志?

@wanyaoqi
Copy link
Member

wanyaoqi commented Oct 28, 2024

kubectl edit cm -n onecloud default-host
修改 log_level: debug
然后重启host服务

@khw934
Copy link
Author

khw934 commented Oct 28, 2024

log_level

kubectl edit cm -n onecloud default-host 修改 log_level: debug 然后重启host服务

是看 host 里面 sdnagent 容器 的日志吗?

@wanyaoqi
Copy link
Member

看 host 容器日志

@khw934
Copy link
Author

khw934 commented Oct 29, 2024

看 host 容器日志

root@test1024:~# kubectl logs -f default-host-6ggnn -n onecloud -c host
[info 241029 06:34:04 procutils.WaitZombieLoop(zombie_others.go:36)] My pid is not 1 and no need to wait zombies
[info 241029 06:34:04 options.parseOptions(options.go:336)] Use configuration file: /etc/yunion/host.conf
[warning 241029 06:34:04 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1215)] Cannot find argument enable-rbac
[warning 241029 06:34:04 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1215)] Cannot find argument health-driver
[warning 241029 06:34:04 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1215)] Cannot find argument start-host-ignore-sys-error
[warning 241029 06:34:04 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1215)] Cannot find argument enable-health-checker
[warning 241029 06:34:04 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1215)] Cannot find argument enable-qmp-monitor
[warning 241029 06:34:04 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1215)] Cannot find argument disk-is-ssd
[info 241029 06:34:04 options.parseOptions(options.go:359)] Set log level to "info"
[info 2024-10-29 06:34:04 options.parseOptions(options.go:336)] Use configuration file: /etc/yunion/common/common.conf
[info 2024-10-29 06:34:04 options.parseOptions(options.go:359)] Set log level to "debug"
[info 2024-10-29 06:34:04 hostman.(*SHostService).InitService(host_services.go:64)] exec socket path: /var/run/onecloud/exec.sock
[info 2024-10-29 06:34:04 app.InitApp(app.go:32)] RequestWorkerCount: 8
[info 2024-10-29 06:34:04 appsrv.NewApplication(appsrv.go:121)] App hostId: 0sDXtS_7MHXsPwLc5O45HUYhuBA= (host,hnrsjia-node,10.21.8.23)
2024/10/29 06:34:04 Allow hosts []
[info 2024-10-29 06:34:04 appsrv.(*Application).SetDefaultTimeout(appsrv.go:137)] adjust application default timeout to 60.000000 seconds
[info 2024-10-29 06:34:04 hostinfo.DetectCpuInfo(hostinfohelper.go:78)] cpuinfo freq 2000
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: dmidecode [-t 4]
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: uname [-m]
[info 2024-10-29 06:34:04 hostinfo.NewHostInfo(hostinfo.go:2445)] CPU Model Intel(R) Xeon(R) Platinum 8480+ Microcode 0x2b0005c0
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: dmidecode [-t 17]
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: cat [/var/lib/kubelet/config.yaml]
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: docker [info --format {{json .}}]
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: findmnt [-n -o SOURCE --target /opt/docker]
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: findmnt [-n -o SOURCE --target /]
[info 2024-10-29 06:34:04 hostinfo.NewHostInfo(hostinfo.go:2465)] Get kubelet container image Fs: /opt/docker, eviction config: {"evictionHard":{"imagefs.available":{"Signal":"imagefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05}},"memory.available":{"Signal":"memory.available","Operator":"LessThan","Value":{"Quantity":"100Mi","Percentage":0}},"nodefs.available":{"Signal":"nodefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05}},"nodefs.inodesFree":{"Signal":"nodefs.inodesFree","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05}}}}
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: mkdir [-p /opt/cloud/workspace/servers]
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: mkdir [-p /opt/cloud/workspace/memory_snapshots]
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: mkdir [-p /opt/cloud/workspace/run/backups]
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: ethtool [-h]
[debug 2024-10-29 06:34:04 procutils.(*Command).Output(procutils.go:95)] Exec command: tuned-adm [profile virtual-host]
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[info 2024-10-29 06:34:05 hostinfo.(*SHostInfo).prepareEnv(hostinfo.go:411)] I/O Scheduler switch to none
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: mount []
[info 2024-10-29 06:34:05 fileutils2.ChangeBlkdevParameter(fileutils.go:203)] Set queue/scheduler of sda to none
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: modprobe [tun]
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: modprobe [vhost_net]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/cpuset]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/cpu,cpuacct]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/cpu,cpuacct]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/blkio]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/memory]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/devices]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/freezer]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/net_cls,net_prio]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/perf_event]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/net_cls,net_prio]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/hugetlb]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/pids]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/rdma]
[debug 2024-10-29 06:34:05 procutils.(*Command).Run(procutils.go:86)] Exec command: mountpoint [/sys/fs/cgroup/misc]
[info 2024-10-29 06:34:05 hostinfo.(*SHostInfo).getKubeReservedMemMb(hostinfo.go:1572)] Kubelet memory threshold subtracted: 100MB
[info 2024-10-29 06:34:05 hostinfo.(*SHostInfo).Init(hostinfo.go:196)] Start detectHostInfo
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: dmidecode [-t 1]
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: dmidecode [-t 2]
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: lsmod []
[info 2024-10-29 06:34:05 hostinfo.(*SHostInfo).detectKVMMaxCpus(hostinfo.go:885)] KVM API VERSION 12
[info 2024-10-29 06:34:05 hostinfo.(*SHostInfo).detectKVMMaxCpus(hostinfo.go:890)] KVM CAP MAX VCPUS: 1024
[info 2024-10-29 06:34:05 hostinfo.(*SHostInfo).detectKVMMaxCpus(hostinfo.go:898)] KVM CAP NR VCPUS: 710
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: modinfo [kvm-intel]
[info 2024-10-29 06:34:05 sysutils.detectNestSupport(kvm.go:146)] Host is support kvm nest ...
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: lsmod []
[info 2024-10-29 06:34:05 sysutils.detectNestSupport(kvm.go:151)] Host kvm nest is enabled ...
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: sh [-c ls /etc/*elease]
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: cat [/etc/lsb-release]
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: cat [/etc/os-release]
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: cat []
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:98)] Execute command "cat " , error: exit status 1 , output: cat: '': No such file or directory

[error 2024-10-29 06:34:05 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:768)] exit status 1
[info 2024-10-29 06:34:05 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:778)] DetectOsDist
[error 2024-10-29 06:34:05 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:780)] Failed to detect distribution info
[debug 2024-10-29 06:34:05 procutils.(*Command).Output(procutils.go:95)] Exec command: cat [/etc/os-release]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [cat -- openvswitch]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:89)] Execute command "systemctl cat -- openvswitch" , error: exit status 1
[warning 2024-10-29 06:34:06 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:799)] system_service.SetOpenvswitchName to openvswitch-switch
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: uname [-r]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: ls [-la1n --full-time /usr/local]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: stat [-c {"file_size":%s,"file_name":"%n","file_type":"%F"} /usr/local/qemu-4.2.0/bin/qemu-system-x86_64]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: ls [-la1n --full-time /usr/local]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: stat [-c {"file_size":%s,"file_name":"%n","file_type":"%F"} /usr/local/qemu-4.2.0/bin/qemu-system-x86_64]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: /usr/local/qemu-4.2.0/bin/qemu-system-x86_64 [-version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: stat [-c {"file_size":%s,"file_name":"%n","file_type":"%F"} /usr/local/qemu-4.2.0/bin/qemu-system-x86_64]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: /usr/local/qemu-4.2.0/bin/qemu-system-x86_64 [--version]
[info 2024-10-29 06:34:06 hostinfo.(*SHostInfo).detectQemuVersion(hostinfo.go:852)] Detect qemu version is 4.2.0
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: ovs-vsctl [--version]
[info 2024-10-29 06:34:06 hostinfo.(*SHostInfo).detectOvsVersion(hostinfo.go:993)] Detect OVS version is 2.12.4
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: modinfo [openvswitch]
[info 2024-10-29 06:34:06 hostinfo.(*SHostInfo).detectOvsKOVersion(hostinfo.go:1010)] kernel module openvswitch vermagic: 5.15.0-25-generic SMP mod_unload modversions
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status yunion-host-sdnagent]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl status yunion-host-sdnagent" , error: exit status 4 , output: Unit yunion-host-sdnagent.service could not be found.

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled yunion-host-sdnagent]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl is-enabled yunion-host-sdnagent" , error: exit status 1 , output: Failed to get unit file state for yunion-host-sdnagent.service: No such file or directory

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status yunion-host-sdnagent]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl status yunion-host-sdnagent" , error: exit status 4 , output: Unit yunion-host-sdnagent.service could not be found.

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled yunion-host-sdnagent]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl is-enabled yunion-host-sdnagent" , error: exit status 1 , output: Failed to get unit file state for yunion-host-sdnagent.service: No such file or directory

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status yunion-host-deployer]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl status yunion-host-deployer" , error: exit status 4 , output: Unit yunion-host-deployer.service could not be found.

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled yunion-host-deployer]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl is-enabled yunion-host-deployer" , error: exit status 1 , output: Failed to get unit file state for yunion-host-deployer.service: No such file or directory

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status yunion-host-deployer]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl status yunion-host-deployer" , error: exit status 4 , output: Unit yunion-host-deployer.service could not be found.

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled yunion-host-deployer]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl is-enabled yunion-host-deployer" , error: exit status 1 , output: Failed to get unit file state for yunion-host-deployer.service: No such file or directory

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status telegraf]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl status telegraf" , error: exit status 4 , output: Unit telegraf.service could not be found.

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled telegraf]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl is-enabled telegraf" , error: exit status 1 , output: Failed to get unit file state for telegraf.service: No such file or directory

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status telegraf]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl status telegraf" , error: exit status 4 , output: Unit telegraf.service could not be found.

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled telegraf]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl is-enabled telegraf" , error: exit status 1 , output: Failed to get unit file state for telegraf.service: No such file or directory

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status openvswitch-switch]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled openvswitch-switch]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl is-enabled openvswitch-switch" , error: exit status 1 , output: disabled

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status openvswitch-switch]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled openvswitch-switch]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl is-enabled openvswitch-switch" , error: exit status 1 , output: disabled

[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [status openvswitch-switch]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: systemctl [is-enabled openvswitch-switch]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "systemctl is-enabled openvswitch-switch" , error: exit status 1 , output: disabled

[info 2024-10-29 06:34:06 hostinfo.(*SHostInfo).Init(hostinfo.go:205)] Start parseConfig
[info 2024-10-29 06:34:06 hostinfo.NewNIC(hostinfohelper.go:241)] IP 10.21.8.23/br0/bond0
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: ethtool [-K bond0 tso on gso on ufo on lro on gro on tx on rx on sg on]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: ovs-vsctl [list-br]
[info 2024-10-29 06:34:06 hostbridge.(*SBaseBridgeDriver).ConfirmToConfig(hostbridge.go:180)] bridge br0 already has ip 10.21.8.23
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: ovs-vsctl [list-ifaces br0]
[info 2024-10-29 06:34:06 hostinfo.NewNIC(hostinfohelper.go:291)] Confirm to configuration!!
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: ovs-vsctl [set Bridge br0 other-config:hwaddr=e8:eb:d3:55:66:42]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: ovs-vsctl [set Interface br0 mtu_request=1450]
[info 2024-10-29 06:34:06 hostinfo.(*SNIC).SetupDhcpRelay(hostinfohelper.go:203)] Not enable dhcp relay on nic: &hostinfo.SNIC{Inter:"bond0", Bridge:"br0", Ip:"10.21.8.23", Wire:"", WireId:"", Mask:24, Bandwidth:1000, BridgeDev:(*hostbridge.SOVSBridgeDriver)(0xc0016fd920), dhcpServer:(*hostdhcp.SGuestDHCPServer)(0xc0018bc750)}
[info 2024-10-29 06:34:06 hostinfo.(*SHostInfo).setupOvnChassis(hostinfo.go:223)] Start setting up ovn chassis
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: systemctl [--version]
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: ovs-vsctl [set Open_vSwitch . external_ids:ovn-bridge=brvpc external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=10.21.8.23 external_ids:ovn-remote=tcp:10.102.135.143:32242]
[error 2024-10-29 06:34:06 auth.(*authManager).startRefreshRevokeTokens(auth.go:193)] refreshRevokeTokens: No valid admin token credential
[info 2024-10-29 06:34:06 hostman.(*SHostService).RunService.func1(host_services.go:85)] Auth complete!!
[debug 2024-10-29 06:34:06 policy.(*SPolicyManager).init(policy.go:148)] DefaultPolicyFetcher: b58000 RemotePolicyFetcher: b58000
[info 2024-10-29 06:34:06 policy.(*SPolicyManager).init(policy.go:160)] policy fetch worker count 1
[info 2024-10-29 06:34:06 consts.SetNonDefaultDomainProjects(consts.go:109)] set non_default_domain_projects to false
[debug 2024-10-29 06:34:06 syncman.(*SSyncManager).SyncOnce(sync.go:80)]AuthManager: SyncOnce isFirst false isTimeout false
[info 2024-10-29 06:34:06 options.StartOptionManagerWithSessionDriver(manager.go:68)] OptionManager start to fetch service configs with interval 30m0s ...
[info 2024-10-29 06:34:06 watcher.(*SInformerSyncManager).startWatcher(watcher.go:83)]EndpointChangeManager: Start resource informer watcher for endpoint
[info 2024-10-29 06:34:06 options.optionsEquals(manager.go:120)] Options added: {"api_server":"https://10.21.8.17"}
[debug 2024-10-29 06:34:06 options.OnBaseOptionsChange(changes.go:63)] api_server changed from to https://10.21.8.17
[info 2024-10-29 06:34:06 watcher.(*SInformerSyncManager).startWatcher(watcher.go:83)]ServiceConfigManager: Start resource informer watcher for service
[debug 2024-10-29 06:34:06 procutils.(*Command).Run(procutils.go:86)] Exec command: mkdir [-p /opt/cloud/workspace/servers/logs]
[info 2024-10-29 06:34:06 guestman.(*SGuestManager).InitQemuMaxCpus(guestman.go:147)] KVM max cpus count: 710
[info 2024-10-29 06:34:06 guestman.(*SGuestManager).InitQemuMaxCpus(guestman.go:165)] Machine type pc max cpus: 240
[info 2024-10-29 06:34:06 guestman.(*SGuestManager).InitQemuMaxCpus(guestman.go:165)] Machine type q35 max cpus: 240
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: which [python]
[debug 2024-10-29 06:34:06 etcd.(*SEtcdClient).Unwatch(etcd.go:369)] prefix / not watched!!
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:98)] Execute command "which python" , error: exit status 1 , output:
[info 2024-10-29 06:34:06 guestman.(*SGuestManager).InitPythonPath(guestman.go:187)] No python found : exit status 1
[debug 2024-10-29 06:34:06 procutils.(*Command).Output(procutils.go:95)] Exec command: which [python3]
[info 2024-10-29 06:34:06 informer.(*EtcdBackendForClient).StartClientWatch(etcd_client.go:84)] /onecloud/informer watched
[info 2024-10-29 06:34:06 guestman.(*SGuestManager).InitPythonPath.func1(guestman.go:180)] Python path /usr/bin/python3
[info 2024-10-29 06:34:06 informer.NewWatchManagerBySessionBg.func1(watcher.go:51)] callback with watchMan success.
[debug 2024-10-29 06:34:06 etcd.(*SEtcdClient).Unwatch(etcd.go:369)] prefix / not watched!!
[info 2024-10-29 06:34:06 informer.(*EtcdBackendForClient).StartClientWatch(etcd_client.go:84)] /onecloud/informer watched
[info 2024-10-29 06:34:06 informer.NewWatchManagerBySessionBg.func1(watcher.go:51)] callback with watchMan success.
[info 2024-10-29 06:34:08 hostinfo.(*SHostInfo).ensureMasterNetworks(hostinfo.go:1208)] Master ip 10.21.8.23 to fetch wire
[info 2024-10-29 06:34:08 hostinfo.(*SHostInfo).initZoneInfo(hostinfo.go:1252)] Start GetZoneInfo e176c0ea-6728-4801-8b4c-fd3d1f14f957
[info 2024-10-29 06:34:08 hostinfo.(*SHostInfo).ensureHostRecord(hostinfo.go:1294)] Master MAC: e8:eb:d3:55:66:42
[info 2024-10-29 06:34:08 hostinfo.(*SHostInfo).initHostRecord(hostinfo.go:1153)] host health manager on host down
[warning 2024-10-29 06:34:08 hostinfo.(*SHostInfo).isVirtualFunction(hostinfo.go:1650)] failed get nic eno1 phys_port_name: read /sys/class/net/eno1/phys_port_name: operation not supported
[info 2024-10-29 06:34:08 hostinfo.(*SHostInfo).doSendPhysicalNicInfo(hostinfo.go:1730)] upload physical nic: eno1(10:ff:e0:30:5e:4e)
[info 2024-10-29 06:34:08 hostinfo.(*SHostInfo).doUploadNicInfoInternal(hostinfo.go:1747)] Upload NIC br: if:eno1
[warning 2024-10-29 06:34:08 hostinfo.(*SHostInfo).isVirtualFunction(hostinfo.go:1650)] failed get nic eno2 phys_port_name: read /sys/class/net/eno2/phys_port_name: operation not supported
[info 2024-10-29 06:34:08 hostinfo.(*SHostInfo).doSendPhysicalNicInfo(hostinfo.go:1730)] upload physical nic: eno2(10:ff:e0:30:5e:4f)
[info 2024-10-29 06:34:08 hostinfo.(*SHostInfo).doUploadNicInfoInternal(hostinfo.go:1747)] Upload NIC br: if:eno2
[info 2024-10-29 06:34:08 hostinfo.(*SHostInfo).isVirtualFunction(hostinfo.go:1657)] nic ens9f0np0 is not virtual function
[info 2024-10-29 06:34:08 hostinfo.(*SHostInfo).isVirtualFunction(hostinfo.go:1657)] nic ens9f1np1 is not virtual function
[info 2024-10-29 06:34:08 hostinfo.(*SHostInfo).doUploadNicInfoInternal(hostinfo.go:1747)] Upload NIC br:br0 if:bond0
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: modprobe [vfio]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: modprobe [vfio_iommu_type1]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: modprobe [vfio-pci]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -nnmm]
[info 2024-10-29 06:34:08 isolated_device.getPassthroughGPUs(gpu.go:86)] filter address [], enableWhiteList: false
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:00.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:00.1]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:00.2]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:00.4]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:0f.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:14.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:14.2]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:14.4]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:15.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:16.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:16.1]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:16.4]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:17.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:18.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:1a.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:1f.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:1f.4]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 00:1f.5]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 01:00.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 02:00.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -nnmm -s 01:00.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 01:00.0]
[debug 2024-10-29 06:34:08 procutils.(*Command).Output(procutils.go:95)] Exec command: find [/sys/devices -name boot_vga]
[info 2024-10-29 06:34:09 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:385)] PCI address 02:00.0 is boot_vga: /sys/devices/pci0000:00/0000:00:0f.0/0000:01:00.0/0000:02:00.0/boot_vga
[info 2024-10-29 06:34:09 isolated_device.getPassthroughGPUs(gpu.go:118)] skip boot vga device 02:00.0
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 03:00.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 03:00.1]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 04:00.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 04:00.1]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 04:00.2]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 04:00.4]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 04:01.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 05:00.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 06:00.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 06:04.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 06:08.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 06:0c.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 06:10.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 07:00.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 08:00.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 08:10.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 09:00.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -vvv -s 0a:00.0]
[debug 2024-10-29 06:34:09 procutils.(*Command).Output(procutils.go:95)] Exec command: find [/sys/devices -name boot_vga]
[debug 2024-10-29 06:34:10 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -k -s 0a:00.0]
[debug 2024-10-29 06:34:10 procutils.(*Command).Output(procutils.go:95)] Exec command: bash [-c lspci -k -s 0a:00.0]

@khw934
Copy link
Author

khw934 commented Oct 29, 2024

这是 host 容器日志, 麻烦看下, 是什么问题

@wanyaoqi
Copy link
Member

@khw934 看下 dmesg 有啥信息吗,0a:00.0 是什么设备?看起来可能是卡在这个设备绑定驱动

@khw934
Copy link
Author

khw934 commented Oct 30, 2024

@khw934 看下 dmesg 有啥信息吗,0a:00.0 是什么设备?看起来可能是卡在这个设备绑定驱动

root@hnrsjia-node:~# dmesg | grep '0a:00.0'
[ 19.690812] pci 0000:0a:00.0: [10de:2330] type 00 class 0x030200
[ 19.690829] pci 0000:0a:00.0: reg 0x10: [mem 0x5e042000000-0x5e042ffffff 64bit pref]
[ 19.690840] pci 0000:0a:00.0: reg 0x18: [mem 0x5a000000000-0x5bfffffffff 64bit pref]
[ 19.690851] pci 0000:0a:00.0: reg 0x20: [mem 0x5e040000000-0x5e041ffffff 64bit pref]
[ 19.690864] pci 0000:0a:00.0: enabling Extended Tags
[ 19.690896] pci 0000:0a:00.0: Enabling HDA controller
[ 19.691006] pci 0000:0a:00.0: reg 0x274: [mem 0x9e900000-0x9e93ffff]
[ 19.691008] pci 0000:0a:00.0: VF(n) BAR0 space: [mem 0x9e900000-0x9f0fffff] (contains BAR0 for 32 VFs)
[ 19.691017] pci 0000:0a:00.0: reg 0x278: [mem 0x5c000000000-0x5c0ffffffff 64bit pref]
[ 19.691018] pci 0000:0a:00.0: VF(n) BAR1 space: [mem 0x5c000000000-0x5dfffffffff 64bit pref] (contains BAR1 for 32 VFs)
[ 19.691027] pci 0000:0a:00.0: reg 0x280: [mem 0x5e000000000-0x5e001ffffff 64bit pref]
[ 19.691028] pci 0000:0a:00.0: VF(n) BAR3 space: [mem 0x5e000000000-0x5e03fffffff 64bit pref] (contains BAR3 for 32 VFs)
[ 19.691186] pci 0000:0a:00.0: 64.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x16 link at 0000:06:00.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
[ 20.091881] pci 0000:0a:00.0: Adding to iommu group 31
[ 359.513847] nvidia 0000:0a:00.0: enabling device (0100 -> 0102)
[ 2201.566019] NVRM: Attempting to remove device 0000:0a:00.0 with non-zero usage count!

@wanyaoqi
Copy link
Member

@khw934 0a:00.0 已经绑了nvidia驱动?要先卸载 nvidia驱动

@khw934
Copy link
Author

khw934 commented Oct 30, 2024

@khw934 0a:00.0 已经绑了nvidia驱动?要先卸载 nvidia驱动

是先要把英伟达的驱动先卸载?宿主机不能先安装驱动?

@wanyaoqi
Copy link
Member

wanyaoqi commented Oct 30, 2024

对,要绑vfio驱动

@khw934
Copy link
Author

khw934 commented Oct 30, 2024

卸载了 gpu 的驱动, 就可以了,想问下, NVLink NVSwitch 的驱动需要卸载吗? 在创建gpu 云主机选卡的时候会考虑 NVLink 吗?
image

@wanyaoqi
Copy link
Member

想要透传的都要卸载

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants