-
Notifications
You must be signed in to change notification settings - Fork 55
incorrect volume size in direct mode #933
Comments
@okartau : this namespace alignment logic is your area of expertise. Can you investigate this? |
We never had unit tests for that code either. Would it perhaps now be possible to write some unit tests after I added mocking code for ndctl? Not merged yet, though, see pkg/ndctl/fake in #919 |
I did some more experiments and I think I found the root cause. First, I tried to create 1500GiB fsdax namespace manually using ndctl (version 67):
The namespace size is 1476.56 GiB instead of requested 1500GiB - similar what we are observing in pmem-csi. After some investigation I found the answer in the ndctl documentation. There is a
I created another namespace using
So in default (dev) map mode, the space is taken from the namespace itself and resulting namespace size is lower than requested. This behavior is counter-intuitive, since one should expect the additional space for the map should be allocated from the free space in the region, not the requested namespace, but I don't think this will be fixed anytime soon (if ever), so you need to compensate for this in pmem-csi and increase request size by the map overhead, then align properly. Also, I have found in the code that you are checking for available space in the region before adjusting size (alignment), which is incorrect - this should be done after all alignments. |
One more thing - I found the space for the map is also somehow aligned - for 1500GiB request the map should be 24000MiB (= 23.4375GiB), but if I try to request 1635778560000 namespace (1500GiB + 24000MiB) I'm getting only 1499.63 GiB useable space. Unfortunately I don't have enough time / knowledge to analyse libndctl code to find the correct way to compute this overhead. After several tries I found that to get the actual 1500GiB volume I had to request for 1636181213184 bytes - it's still 2MiB larger than needed (1636174921728 bytes instead of 1610612736000), but this might be actual alignment, so in the end I had to request for 1500GiB+24384MiB to actually get the 1500GiB namespace. |
Thanks for good analysis and proposals!
you mean code in pkg/ndctl/region.go ? Overall, as you already found out, there is space overhead in ndctl lower layers which always gives us smaller volume than we ask, so we compensate by asking more. That also causes side-issue that I tried to document in #399 : It difficult to satisfy one CSI spec req. about exact allocation as we can't. Result is typically bigger, or we have to report failure. About mapping: yes we use default mapping. I did notice libndctl mapping argument during development, but when I tried it, I was hitting more and new errors, so I gave up and reverted to default mode which worked reliably. Your case is probably first which tries on real DIMMs with such size, we had smaller ones during development. BTW, is it possible for you (if not too much effort) to extract logging code from pmem-csi driver about code where these namespace creations are made? It would be interesting to see what are logged values when K8s asks a volume, what alignment decisions are made by pmem-csi, and such, from the actual run. As I cant repeat right away, dont have such big devices. |
Here are some logs I grabbed previously with a 1500GiB request and log level 4 (correction: it's actually log level 5):
|
We don't need exact allocations. Kubernetes only requests a minimal size, never a maximum size, so creating some larger volume than requested is always fine in practice. |
Actually, the CSI spec says the request has two fields: required_bytes and limit_bytes, but I think that in practice the later is rarely used, so the driver has to provide at least required_bytes or more. The spec also says that for exact volume size both parameters should be equal. |
Thats exactly the point I recorded in #399, if a test writer wants to exactly follow specification, there will be lots of sizes with what an allocation test would prove pmem-csi not spec-compliant, for example all non-aligned sizes. In practice we are saved by the fact that method of exact allocation is not used. The current issue of creating smaller volume, is of course clearly a bug. I looked up values in the posted log (thanks, very helpful!): Our "size compensate" method is not scalable, we use static 1GiB there always regardless of request size, It was introduced based on false hope 1GiB is quite big, and overhead is small (thats written in a comment), then it was tested with small size cases only. Actually (and it was known the overhead is caused by mapping), the size of request will affect the overhead (point which I missed during that time). |
For reference, I have also opened an issue on pmem/ndctl: pmem/ndctl#167 As for the algorithm, according to the documentation, there is a namespace configuration data (not sure about the size), then the map with PFNs (should be: metadata _{(bytes)} = (size _{(bytes)}/4096)*64), probably aligned, then the actual space for user data (aligned again?). But the numbers do not match reality - when I request 1500GiB+24000MiB volume, im still getting less than 1500GiB useable space, so the structure size is larger than specified 64b per 4KiB, or there is additional alignment involved which is not documented there. |
math from traced case (1506 asked, result 1482) shows actual overhead per 4k block is approx. 65.3 bytes (1506GiB has 394788864 4Ki blocks). Still open is the extra margin coming per-4Ki-unit or per-something else, or even single? In practice, we can we use above "additional margin" formula with value 64+safety_margin, like 66 or 68? Extra margin dynamics can be determined with few experiments using various sizes. I think trials can be made with cmdline ndctl which should show same results compared to pmem-csi allocations via libndctl. |
I made few example allocations with overhead calc, using a host I have readily available, my trials remain smaller, but we want to cover lower end as well, so it should help. on host with 4x128 DIMMs, interleaved to form 2x256 regions 2M-aligned cases show overhead exactly 64 or very little over 64, 1G-aligned cases show 66..69 B overhead |
Instead of using trial-and-error, can you work with Dan Williams in pmem/ndctl#167 to come up with an exact specification of the overhead? Ideally, they provide an implementation in libndctl itself. |
Here's a perhaps controversial thought: when someone asks for a volume of size 1480GiB, what do they expect? Being able to memory-map a file of size 1480GiB? Or a block device of size 1480GiB? My point is this: there's always some overhead that leads to less usable space than requested. Getting a block device that is smaller than requested is perhaps unexpected, but not that different from the extra overhead caused by the filesystem. Whoever deploys an app must be prepared for this. They might even be in a better position to estimate that overhead beforehand than PMEM-CSI. For example, they might have run experiments with the exact cluster configuration and kernel that they will use and thus might know what the overhead will be. So here's food for thought: what if we simply document that volumes in direct mode have this additional overhead and not try to compensate for it? The size that we report for the volume can be the size of the namespace. That's not wrong. Nothing specifies that it has to be the size of block device that is later going to be mounted. |
Where this thought experiment might break down is when the namespace size is the size of the block device. Not sure how ndctl then handles the overhead if it is not part of the namespace itself. Suppose I have a region of 10GiB. Can I create 10 namespaces of 1GiB each? |
So the reported namespace size is indeed "requested size minus overhead". I tried with a 64GiB region and requested 64 namespaces of size 1GiB. That worked (i.e. I now have 64 namespaces), but each namespace is said to have 1006MiB. My plan could still work. PMEM-CSI simply would have to report the requested volume size as actual volume size instead of reading the size from the created namespace. |
The advantage of not asking for a larger namespace is that volume creation becomes predictable ("a region of size XX can hold n volumes of size YY", at least when ignoring fragmentation and alignment). This is important for storage capacity tracking where Kubernetes will compare reported free space and maximum volume size against the requested size of the PVC. |
This is the outcome of the discussion in intel#933 (comment)
This is the outcome of the discussion in intel#933 (comment)
This is the outcome of the discussion in intel#933 (comment)
This is the outcome of the discussion in intel#933 (comment)
I now consider this a feature change and don't intend to backport to 0.9. Therefore we are done with this. |
My expectation (which is consistent with CSI spec) was that the block device will be not smaller than requested, no matter how much overhead the "storage subsystem" requires internally. It's the same situation as when you request a volume from enterprise storage system, where you can have some RAID protection, or maybe even compression and deduplication. The user doesn't care how much RAW space needs to be used to provision such volume, and with compression / deduplication it is not even possible to guess and it can change based on the actual data. If the subsystem is unable to satisfy the request it should just return an error stating there is not enough space available. You should never expect that you can request specific number of logical volumes having particular RAW space available. Does it make sense? Since we are using the same API (CSI) as the other storage types, the behaviour should be consistent, even though applications might consume PMEM differently in app direct mode. |
Where in the CSI spec is it described how capacity is measured? I don't think it is defined anywhere and therefore up to the CSI driver implementers to define. I understand that the namespace overhead is unexpected, both in PMEM-CSI and in ndctl. But that's a matter of setting expectations, which can be done by clearly documenting how "size" is defined. Both ndctl and PMEM-CSI do that now. For the example with RAID, the "size" is the number of usable blocks. That's one way of doing it, but not required by the spec. Counting the used PMEM is another. I prefer to be consistent with ndctl instead of some different technology.
The filesystem is created by PMEM-CSI as part of provisioning. That there is a block device involved is an implementation detail. |
CSI spec: https://github.com/container-storage-interface/spec/blob/master/spec.md The CreateVolumeRequest contains CapacityRange, which is defined this way:
Since most requests contains only required_bytes field set, it means that "volume MUST be at least this big". The common meaning is you should be able to write this many bytes to the volume in block mode (unless it it read-only volume or snapshot). I don't know of any other CSI driver which implements this differently. I know of the ndctl limitations and understand there is no easy way to address this issue, but ndctl is just internal implementation and it shouldn't matter how it is dealing with the allocations. The tool is not even consistent, so if you create namespace with metadata stored in memory instead of on the media, the available volume size will be different despite the same request size, so it is clear this implementation is broken and should not be used as a reference. I don't understand the part regarding RAID and blocks. When dealing with enterprise storage array you are not requesting number of blocks - from the CSI spec perspective you are always requesting specified number of bytes. How it is translated internally by the driver and storage backend can vary and is not relevant. The driver can translate the request to specific number of blocks and the array can do RAID / compression / deduplication allocating necessary space on multiple drives - all this is invisible and does not matter. You can either provision logical volume with requested size, or not. If you are reusing existing volume or snapshot which was smaller than requested, you can either resize it properly or fail the request. |
That's the point: it's "common", but not "required".
The same can be said about the underlying block device...
My argument was that for such a driver it makes sense to define "volume size" as "number of usable blocks times block size" and hide the overhead but that this does not mean that PMEM-CSI must do the same. The difference is that the underlying storage for PMEM-CSI is limited compared to a more dynamic storage pool where disks might be added or removed as needed. That limitation is exposed to the users and admins. To me it makes more sense to use "amount of PMEM" as definition of "capacity" because then "free space" and "volume size" become comparable. It's consistent and predictable. The alternative would be to define "capacity" as "usable bytes". We then would have to report free space differently to avoid surprises like "16GB free, but cannot create a 16GB volume". The problem with that is that we can't implement it unless we get more information from libndctl and the currently running kernel. If pmem/ndctl#167 gets resolved, then we can reconsider the approach in PMEM-CSI, but until then any solution in PMEM-CSI would just be a band-aid that will break down occasionally - not what I want in a driver that is supposed to reach 1.0. I understand that we now might not meet the initial user expectations and have to explain how PMEM-CSI works to adapt those expectations, but the same is true when we do it the other way around and then have to explain why volumes can only be smaller than expected. Ignoring expectations for a second, does the current approach lead to any real usability issues when deploying applications? In the end, that is what matters. |
I have a 5-node cluster of dual-socket machines fully populated with 512GB PMems (two regions of 6x512GB in app direct interleaved mode per node) running OpenShift 4.6.6 (kubernetes 1.19) and pmem-csi 1.9.0.
I'm requesting four 1480GiB volumes per node in direct mode, but I'm getting 1458GiB volumes instead.
When I try to request 1500GiB volumes, I'm getting two 1482GiB volumes in one region and two 1481GiB volumes in the other one. Not only requests are not fulfilled properly, but the allocated space is also not consistent. On top of that, reported volume size is not the actual size.
The empty regions look like this:
The PVC spec and status for 1480GiB volumes:
The status says it's 1480Gi, but in reality its 1565515579392 bytes = 1458GiB:
And confirmation from ndctl (four volumes per node):
PmemCSIDeployment:
Storege class:
Notice I had to disable 'eraseafter' since on large volumes like ~1.5TB the shred is extremely slow.
The text was updated successfully, but these errors were encountered: