Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceph: Detect small DB partition sizes and unused partitions #974

Open
lathiat opened this issue Sep 26, 2024 · 5 comments
Open

Ceph: Detect small DB partition sizes and unused partitions #974

lathiat opened this issue Sep 26, 2024 · 5 comments

Comments

@lathiat
Copy link
Contributor

lathiat commented Sep 26, 2024

A common fault in Ceph deployments is that the DB devices are incorrectly configured (missed or allocate from the wrong device), or not big enough. The majority of the time these would be picked up by looking for:

  • DB partitions which are obviously far too small, e.g. the default 1GB. Ideally we'd report the DB-to-OSD size ratio informationally
  • Empty partitions that have not been used
  • Empty space on a disk that is not partition
  • Volume groups which are not mostly used (basically the same as empty space on a disk)
@pponnuvel
Copy link
Member

Shouldn't this be handled by the deployer i.e. the Ceph charms and/or the FE team? A hotsos check might be useful but probably not the most effective place for this.

@lathiat
Copy link
Contributor Author

lathiat commented Sep 26, 2024

Yes ideally, but in practice it keeps getting missed. So we need to catch it. Both for analysing new deployments but also detecting the issue on old deployments.

@lathiat
Copy link
Contributor Author

lathiat commented Sep 26, 2024

It can also happen because the charm will create OSDs with no DB if it can't find any space, so if you add new OSDs, and the old DB devices were full, a customer could silently have this happen. Or that can happen in a field deployment due to a weird issue even if they designed it right.

@dosaboy
Copy link
Member

dosaboy commented Sep 26, 2024

@pponnuvel i agree that the charm should be doing this as a first port of call and we should open a bug on the charm to get this done. In the interim, if it is a small enough addition to the checks we could add this to cover the cases where the charm does not yet support it since it has been cropping up repeatedly in deployments and this will help reduce time of analysis by flagging the issue at the start.

@dosaboy
Copy link
Member

dosaboy commented Sep 26, 2024

@lathiat this almost looks like several checks and it might make sense to break it into smaller chunks to make it easier to implement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants