-
Notifications
You must be signed in to change notification settings - Fork 108
Is the String kind conceptually a series of bytes or characters? #363
Comments
There are more details about string in the data model document in the String kind section. There has been a long discussion about it and the conclusion was that string |
UTF-8 is a specific binary encoding though, which seems out of scope for a conceptual serialization-agnostic data model. In my case I was planning on storing it in a I will of course serialize it to UTF-8 when encoding it to dag-cbor, but that's because of the cbor/dag-cbor specs and has nothing to do with the ipld-data-model spec. Should I be storing it as a |
Agreed. Sadly there isn't a sharp boundary between the conceptual level and then encoding. The intention is to say that strings should be something that can be converted to UTF-8 easily as this is what all major serialization formats I know of (like CBOR, Protocol Buffers, Ion etc) are using for their text representation. I think the current wording (with saying UTF-8) makes it easier for people who are not deeply into the text encoding business.
"storing" seems to refer to some intermediate step of your tooling. I don't think the spec does (or should) dictate which representation you use. If you follow the "SHOULD" about strings being Unicode, you can encode things internally in any way you want, it is only important that you can get valid UTF-8 in and out. Given you have UTF-16 has that property, you can keep them internally as UTF-16. |
Strings should be considered as sequences of 8-bit bytes. UTF-8 is a subset of that definition, and so is UTF-16 :) so defining the Data Model as being a superset of these is quite helpful. There are also situations where people have used arbitrary bytes as map keys, and since we define map keys as "String", this means it's important to support arbitrary sequences of 8-bit bytes where we talk about "String". You can use whatever representation you like in-memory of a program. If all bytes can be escaped into that internal representation, and round-trip, it's fine. I'd probably recommend choosing a type that's as close to raw bytes as possible, though, for sheer efficiency reasons -- flipping string encodings back and forth and back again is costly. |
But if a protocol/codec for some reason wanted a non-UTF8 encoding, then that wouldn't be a problem if the abstraction layer was "unicode characters". However with the byte abstraction you're more or less locked into a single text encoding for all codecs indefinitely.
IMO something like "unicode string" or "sequence of unicode characters" would be a clearer abstraction and makes it clear that encoding is totally out of scope, and the responsibility of codecs.
On the contrary I would actually say that it will make programs more buggy in practice. This thread is a perfect example: One person tells me that my opaque text type is fine, as it can roundtrip UTF-8 just fine. Another person tells me that UTF-16 being possible is a good thing, however my program will completely mangle any incoming UTF-16 due to the first person's advice.
Doesn't this more or less contradict the following?
For example as far as I'm aware using arbitrary bytestrings as keys in JavaScript objects/dicts isn't possible.
So would you recommend not using the built in string type in Python, JavaScript or Haskell? All three of those would not round-trip bytes successfully. It's also worth noting that CBOR specifically requires text strings to be UTF-8, so if non-UTF-8 bytes get converted to CBOR via DAG-CBOR, they would be in violation of the CBOR spec. |
Hang on right there. That would be a problem, actually. People often misunderstand this about unicode (and I say that without judgement, because I also misunderstood this about unicode until very recently), but: unicode does not guarantee all bytes are encodable. Unicode has significantly less expressiveness than the definition we get with "sequences of 8-bit bytes". It is trivial to fit any of the unicode encodings inside "sequences of 8-bit bytes". The reverse is not true. If this seems like an incredible claim, my favourite fixture to demonstrate it is the sequence which would be written in escaped hex as " |
Yes. It sucks, but yes. We've had to discuss this a lot around Rust, too. And there, we actually have some very interesting illustrations available, because Rust, being extremely diligent, also ran into this issue when defining their filesystem APIs. Filesystems, it turns out, do not generally guarantee UTF-8. (We all act like they do. Many filesystems do. But some don't. And if you try to enforce this, You Will Find That Out when you have Problems.) So, what was the Rust solution around filesystems? Make a "raw" "string" type that doesn't enforce UTF-8. It's just a sequence of 8-bit bytes. And if you want Rust's other String type that is UTF-8, well... you use this: https://doc.rust-lang.org/std/path/struct.Path.html#method.to_string_lossy . Notice how the method name itself even says "lossy", there. Aye. What Rust is doing there is actually the honest truth at the bottom of the world. It's unfortunate that, yes, the "String" types in some language's standard libraries do not make this easy. The answer is, as Rust did with Path, to make a String type for this purpose that contains sequences of 8-bit bytes. And make conversion methods -- which will be lossy; this is unavoidable -- to other string types as needed. I feel like I should bend over backwards one more time here to say: yes, this sucks. Unfortunately, computers. |
Do you mean decodable? UTF-8/UTF-16 encode characters into bytes, but they operate on unicode characters, which are unrelated to any specific byte encoding.
Comparing the expressiveness of unicode and bytestrings seems pretty weird, as they have the exact same cardinality, and it's trivial to write injective functions in either direction (e.g. base64). The real question is what you are trying to model, characters or bytes.
I think we're miscommunicating a bit here. I was talking about non-UTF-8 encoding of unicode strings (e.g. UTF-16). The unicode string For a concrete example of what I mean by problems with different unicode encodings. Let's look at the following concrete example. I type "foo" into my IPLD text editor, I then transfer it over to an independent IPLD viewing software. If the abstraction is unicode characters, then as long as the codec defines the right encoding to use, there are no problems, as the two programs both have to agree on the same codec. Now on the other hand if the abstraction is bytes, then the editor might choose to store "foo" in UTF-8 as "\x66\x6f\x6f", so if the viewer were to use UTF-16, then it would garble the rendering. |
Given the existence of a true bytestring in the IPLD spec it seems to me as though that should be utilized instead of making text not a sequence of unicode characters. I know that maps only accept text keys, but that seems to be a very intentional design decision ("...lowest common denominator...") which allowing for arbitrary bytes directly fights against, due to arbitrary byte keys being similarly non-universal (JavaScript, JSON). If the priority is universality it seems like you should explicitly make sure maps have unicode character non-byte keys, and if flexibility is more important then it seems like you should just allow arbitrary IPLD values as keys, similar to CBOR. I agree that it generally makes sense to interact with file paths as sequences of bytes, but that does not mean you need to use the string type to do so. For example in Haskell I would use the |
One point I think is worth emphasizing quite heavily is that if strings in IPLD are conceptually arbitrary sequences of bytes, then DAG-CBOR either violates the CBOR spec, or does not support the full IPLD data model. Invalid UTF-8 is not a valid CBOR string, and CBOR parsing programs are welcome to reject them. |
The state of play is this:
Therefore the recommendation goes something like this: if you value interop and less developer pain then just use valid UTF-8 in your DAG-CBOR We could be having the same argument about the map key sorting rules which were baked as it arose out of a RFC 7049 recommendation and whose sanity was not considered much at the time (and which recommendation has since been overridden by RFC 8949). It is what it is and we have to work with what's live unless we want to make a DAG-CBOR2. |
Thanks for the information. It is rather unfortunate that a decent amount of DAG-CBOR in the wild isn't valid CBOR. I'll go with using a The CBOR library I use breaks on invalid UTF-8 CBOR, so it's difficult to go the other route anyway. |
Being able to successfully read Filecoin blocks was, until recently, on my priority list and I had thought up some ways to get around it by digging right into the decoder, see #364 (comment) - in the case of |
That is an understandable need. The one downside is that it may encourage others to put non-UTF-8 data in their DAG-CBOR as well. Which will make generic CBOR libraries and services less useful for working with DAG-CBOR. With regards to the map key ordering aspect, couldn't you specify that future usage should use the new ordering, but mention compatibility with old ordering systems. The new RFC 8949 spec mentions such a thing. That way it avoids proliferating obsolete CBOR. On the documentation side what are your thoughts on something like:
I think that should make it clear that using an opaque unicode type (that may or may not be utf-8) is fine as long as codecs are properly followed when encoding it. However this avoids closing the door on CBOR non-compliant DAG-CBOR that exists in the wild. |
@tysonzero This is what I was about to suggest. The current specification is heavily influenced by the Go implementation and the problem that the libraries used for Protocol Buffers and CBOR don't enforce the production of spec compliant data. As long as you use only spec compliant encoders/decoders for Protocol Buffers/CBOR/JSON, you won't have this problem and will only get strings which are a sequence of valid Unicode characters. This is also what other IPLD implementations (e..g in JS or Rust) do, so they can use the native String types.
Yes, exactly.
That sounds good to me. I think the more we can emphasize that strings should really be a sequence of unicode characters and arbitrary bytes happen only due to non-spec compliant encoders/decoders the better. |
It seems to me as though it should be a list of unicode character points, and serialization into bytes should be specified in the respective codecs.
However the section seems to talk a lot about byte-specific concerns.
The text was updated successfully, but these errors were encountered: