Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to replicate mgmt01 #533

Open
marcheschi opened this issue Jan 7, 2021 · 18 comments
Open

Unable to replicate mgmt01 #533

marcheschi opened this issue Jan 7, 2021 · 18 comments
Assignees
Labels
Milestone

Comments

@marcheschi
Copy link
Contributor

marcheschi commented Jan 7, 2021

Hi
I tried several ways to do a replication from the DC web interface, but I got the error:

Got bad return code (1). Error: msg='Command failed: cannot create 'zones/mgmtreplica-disk0': 'refreservation' is greater than current volume size

', time_elapsed=2, rc=1, repname='mgmtreplica'  

But it is strange the error is: 'refreservation' is greater than current volume size but
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0 volsize 10G
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0 refreservation 8G

this is the output of zfs get all command:

NAME                                              PROPERTY              VALUE                                             SOURCE
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  type                  volume                                            -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  creation              Fri Aug 14 10:04 2020                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  used                  19.3G                                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  available             1.30T                                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  referenced            7.11G                                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  compressratio         1.24x                                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  origin                zones/cc4c3755-5615-4f2d-9b37-a286c17bb66b@final  -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  reservation           none                                              default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  volsize               10G                                               local
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  volblocksize          4K                                                -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  checksum              on                                                default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  compression           lz4                                               local
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  readonly              off                                               default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  createtxg             449                                               -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  copies                1                                                 default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  refreservation        9G                                                local
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  guid                  16702331967044663069                              -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  primarycache          all                                               default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  secondarycache        all                                               default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  usedbysnapshots       4.83G                                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  usedbydataset         5.46G                                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  usedbychildren        0                                                 -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  usedbyrefreservation  8.98G                                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  logbias               latency                                           default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  dedup                 off                                               default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  mlslabel              none                                              default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  sync                  standard                                          default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  refcompressratio      1.20x                                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  written               20.7M                                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  logicalused           12.5G                                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  logicalreferenced     8.46G                                             -
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  snapshot_limit        none                                              default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  snapshot_count        none                                              default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  redundant_metadata    all                                               default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  encryption            off                                               default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  keylocation           none                                              default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  keyformat             none                                              default
zones/f7860689-c435-4964-9f7d-2d2d70cfe389-disk0  pbkdf2iters           0                                                 default

I tried also to reboot the mgmt01 and to

reload new VM settings from node

es put /vm/mgmt01.local/status/current -force

or setting auto to refreservation.
If I try to create another centos 7 Linux VM, the replica command works fine
DC version is 4.4
Any help is appreciated
Thank you
Paolo

@marcheschi marcheschi reopened this Apr 1, 2021
@marcheschi
Copy link
Contributor Author

I tried again
same error:
Got bad return code (1). Error: msg='Command failed: cannot create 'zones/replica1-disk0': 'refreservation' is greater than current volume size

', time_elapsed=3, rc=1, repname='replica1'

also with mon01

Got bad return code (1). Error: msg='cannot receive: local origin for clone zones/replica1-disk0@is-22075 does not exist ', time_elapsed=15, rc=1, repname='replica1'

@marcheschi
Copy link
Contributor Author

replica works with other vm
I tried with a kvm and it worked

@marcheschi
Copy link
Contributor Author

Tried again today for mon01:
[2021-04-02 07:17:12,671: ERROR/MainProcess] Task api.vm.replica.tasks.vm_replica_cb[9e1d2-228bf285-843c-4d0e-8812] raised unexpected: TaskException({u'message': 'Got bad return code (1)', u'returncode': 1, u'meta': {u'exec_time': u'2021-04-02T07:16:58.803556', u'slave_vm_uuid': u'87b9b90d-94d6-41d9-a021-5aecc5dfa207', 'caller': u'9e1d2-b3bdb940-2170-4726-9fdb', u'apiview': {u'hostname': u'mon01.local', u'view': u'vm_replica', u'method': u'POST', u'repname': u'replica1'}, u'nolog': False, u'finish_time': u'2021-04-02T07:17:08.168644', u'vm_uuid': u'a28faa4d-d0ee-4593-938a-f0d062022b02', u'msg': u'Create server replica'}, 'detail': u"Got bad return code (1). Error: msg='cannot receive: local origin for clone zones/replica1-disk0@is-22075 does not exist\n', time_elapsed=10, rc=1, repname='replica1'"},)
Traceback (most recent call last):
File "/opt/erigones/envs/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/opt/erigones/que/mgmt.py", line 48, in call
return super(MgmtCallbackTask, self).call(*args, **kwargs) # run()
File "/opt/erigones/envs/lib/python2.7/site-packages/celery/app/trace.py", line 438, in protected_call
return self.run(*args, **kwargs)
File "/opt/erigones/api/task/utils.py", line 284, in inner
raise e
TaskException: {u'message': 'Got bad return code (1)', u'returncode': 1, u'meta': {u'exec_time': u'2021-04-02T07:16:58.803556', u'slave_vm_uuid': u'87b9b90d-94d6-41d9-a021-5aecc5dfa207', 'caller': u'9e1d2-b3bdb940-2170-4726-9fdb', u'apiview': {u'hostname': u'mon01.local', u'view': u'vm_replica', u'method': u'POST', u'repname': u'replica1'}, u'nolog': False, u'finish_time': u'2021-04-02T07:17:08.168644', u'vm_uuid': u'a28faa4d-d0ee-4593-938a-f0d062022b02', u'msg': u'Create server replica'}, 'detail': u"Got bad return code (1). Error: msg='cannot receive: local origin for clone zones/replica1-disk0@is-22075 does not exist\n', time_elapsed=10, rc=1, repname='replica1'"}

cannot receive: local origin for clone zones/replica1-disk0@is-22075 does not exist\n'
Do you know how to fix this?
Paolo

@YanChii
Copy link
Contributor

YanChii commented Apr 2, 2021

I suspect it really has to do something with refreservation. When you create a fresh VM, what is the refreservation setting? Try to set the same number also for mgmt01 refreservation and reload the VM using API as you did before.

@marcheschi
Copy link
Contributor Author

The refreservation of a newly created vm has the same size of disk, and is the same for mon01 or mgmt01
the migration of mgmt01 works fine instead.

@YanChii
Copy link
Contributor

YanChii commented Apr 2, 2021

That should be the ideal settings. Is it possible to set such refreservation value for existing mgmt01?

@marcheschi
Copy link
Contributor Author

I find out that with certain nodes it works , and after it worked , it works again on the node it was not working.
I think that there is a bug or a misconfiguration.
Paolo

@marcheschi
Copy link
Contributor Author

I tried to replicate mon01 in the same way but after an hour it was stuck in pending status, I had to kill the task and restart the vm.
Now it appears it has a replica but it is impossible to delete
Got bad return code (1). Error: msg='zoneadm: replica1: No such zone configured
', time_elapsed=1, rc=1, repname='replica1'

How can I clear it?

@marcheschi
Copy link
Contributor Author

marcheschi commented Apr 7, 2021

I tried to delete via es:
./es DELETE /vm/a28faa4d-d0ee-4593-938a-f0d062022b02/replica/replica1 -force
but I got the same error:

 "result": {
            "message": "Got bad return code (1)", 
            "returncode": 1, 
            "detail": "Got bad return code (1). Error: msg='zoneadm: replica1: No such zone configured\n', time_elapsed=1, rc=1, repname='replica1'"
        }, 

Maybe if I restore the VM from a snapshot or backup it clears it, but I do not want to make things worst.
Paolo

@YanChii
Copy link
Contributor

YanChii commented Apr 7, 2021

If you are sure the actual destination replica VM is not created, it should be enough to delete the replicated VM entry in DC database. But we need to find it first.

Log in to mgmt01.local from the 1st compute node and run

ctl.sh shell
> from vms.models import Vm, SlaveVm
> Vm.objects.all()
> vm = Vm.objects.get(hostname='mon01.local')
> vm.slave_vms      # see if the replica VM is there
# if it's there, select it
> ghost_vm = SlaveVm.get_by_uuid(vm.slave_vms[0])
> ghost_vm.delete()    # and delete

# if it's not there, search the replica DB
> SlaveVm.objects.all()
> ghost_vm = SlaveVm.objects.all()[0]       # "0" means select the first VM. If you have multiple slave VMs, select the appropriate one
> ghost_vm      # see if it's the right one
> ghost_vm.delete()
> exit()

Then check GUI and see if the VM is unlocked.

Jan

@YanChii
Copy link
Contributor

YanChii commented Apr 7, 2021

I was running some tests and the 'refreservation' is greater than current volume size issue goes away when you grow disk of mon01 or mgmt01. Then it sets the correct refreservation setting.

Either way, incorrectly set refreservation is a bug after install and we'll fix it for new installations.

Jan

@marcheschi
Copy link
Contributor Author

ok I did it it was there:

[root@mgmt01 ~]# ctl.sh shell
Python 2.7.5 (default, Aug  7 2019, 00:51:29) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from vms.models import Vm, SlaveVm
>>> Vm.objects.all()
[<Vm: para2>, <Vm: _replica-1-mgmt01.local>, <Vm: fallot>, <Vm: dcm4cheiehr>, <Vm: bwapp>, <Vm: openrom>, <Vm: dc4drivedb>, <Vm: prova2>, <Vm: sunbuild>, <Vm: mgmt01.local>, <Vm: medivisor>, <Vm: _replica-1-mirto391lx>, <Vm: gazzella>, <Vm: _replica-1-sunbuild>, <Vm: qaz1>, <Vm: smart1>, <Vm: python>, <Vm: imgapi>, <Vm: esdc-factory>, <Vm: hybrid>, '...(remaining elements truncated)...']
>>> vm = Vm.objects.get(hostname='mon01.local')
>>> vm.slave_vms
[UUID('d12b8b22-88b7-4970-81ec-5655a27b0a7f')]
>>> ghost_vm = SlaveVm.get_by_uuid(vm.slave_vms[0])
>>> ghost_vm.delete() 
>>> vm = Vm.objects.get(hostname='mon01.local')
>>> vm.slave_vms
[]

And doing that It unlocked the DC interface
But now I grew disk from 20GB to 25GB on mon01 and I tried again to create a replica but with the same error:

Got bad return code (1). Error: msg='cannot receive: local origin for clone zones/replica1-disk0@is-22461 does not exist ', time_elapsed=10, rc=1, repname='replica1'

replica is still not working.
Are you sure this is a problem o refreservation settings?
Thank you
Paolo

@YanChii
Copy link
Contributor

YanChii commented Apr 8, 2021

I was able to fix it on a fresh installation. Check this:

vmadm get a28faa4d-d0ee-4593-938a-f0d062022b02 | json disks.0.size
vmadm get a28faa4d-d0ee-4593-938a-f0d062022b02 | json disks.0.refreservation

They should be the same.
If they are not, set refreservation equal to size

zfs set refreservation=10G "zones/a28faa4d-d0ee-4593-938a-f0d062022b02-disk0"

But the problem is still propagated in DC database:

[root@mgmt01 ~]# ctl.sh shell
>>> from vms.models import Vm, SlaveVm
>>> vm = Vm.objects.get(hostname='mon01.local')
>>> vm.json_get_disks()

If the refreservation is bigger than size, replication will still not work because the slave vm is created from database data. This should be fixed by growing the disk from GUI (a few hundred MB is enough, just to recalculate the numbers).

After that exit the ctl shell and start it again to check the new numbers.

Jan

@marcheschi
Copy link
Contributor Author

They are the same:

[root@node01 (pacs) ~]# vmadm get a28faa4d-d0ee-4593-938a-f0d062022b02 | json disks.0.size
25600
[root@node01 (pacs) ~]# vmadm get a28faa4d-d0ee-4593-938a-f0d062022b02 | json disks.0.refreservation
25600

and

>>> from vms.models import Vm, SlaveVm
>>> vm = Vm.objects.get(hostname='mon01.local')
>>> vm.json_get_disks()
[{u'compression': u'lz4', u'zfs_filesystem': u'zones/a28faa4d-d0ee-4593-938a-f0d062022b02-disk0', u'image_size': 10240, u'boot': True, u'image_uuid': u'024e2cc9-9050-4e15-b708-1506699046a8', u'refreservation': 25600, u'media': u'disk', u'zpool': u'zones', u'path': u'/dev/zvol/rdsk/zones/a28faa4d-d0ee-4593-938a-f0d062022b02-disk0', u'block_size': 4096, u'model': u'virtio', u'size': 25600}]

But it is not working.
Paolo

@YanChii
Copy link
Contributor

YanChii commented Apr 8, 2021

This is strange. Pls look into /opt/erigones/log/main.log in mgmt01. You should see there a full json that is used to create replica VM. These data are taken from DC database so I don't understand the reason why would it still retain old refreservation value.

@marcheschi
Copy link
Contributor Author

I sent the full json on gitter chat, there is no refreservation setting

@YanChii
Copy link
Contributor

YanChii commented Apr 8, 2021

Just to sum up:
There were two problems. The refreservation problem is fixed by above commands.
The next problem was promoted clone of mon01 VM that has been preventing the proper manipulation with the volume.
zfs promote zones/a28faa4d-d0ee-4593-938a-f0d062022b02-disk0 was the solution for that.

I'm leaving this issue open until we fix the refreservation for new installs.

@YanChii YanChii self-assigned this Apr 13, 2021
@YanChii YanChii added this to the 4.5bhyve milestone Apr 13, 2021
@YanChii YanChii added the bug label Apr 13, 2021
@YanChii YanChii modified the milestones: 4.5bhyve, 5.0 Sep 30, 2021
@marcheschi
Copy link
Contributor Author

Now I can create replica with all the admin vm .
It seems to work .
I had a problem because I had 3 vm in provisioning state:

[root@node01 (pacs) ~]# vmadm list
UUID                                  TYPE  RAM      STATE             ALIAS
69913851-c65b-40a2-bfd4-ffb558049000  OS    256      provisioning      dns01.local
a4231cee-7417-4c4b-8a37-4d3751887318  OS    256      provisioning      cfgdb01.local
a465f29a-a933-4a50-b263-622b48f6527d  OS    256      provisioning      img01.local
d0bd78f3-8efd-4cd1-bd94-d46147b57b75  KVM   2048     stopped           mgmt01.local
2eadb150-cea2-43e9-ac79-f56bbf905004  KVM   4096     stopped           mon01.local

After deleting them, I successfully created the replica.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants