Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify encoding of colon between scheme and type #361

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pombredanne
Copy link
Member

The colon separator between scheme and type must not be encoded.

This is also using, a new, and not yet adopted wording for MUST according to its definition in RFC2119 as clarified in RFC8174

Suggested-by: @gernot-h
Reference: #39 (comment)

The colon separator between scheme and type must not be encoded.

This is also using, a new, and not yet adopted wording for MUST according to its definition in RFC2119 as clarified in RFC8174

Suggested-by: @gernot-h
Reference: #39 (comment)
Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne pombredanne marked this pull request as draft December 5, 2024 18:26
@johnmhoran johnmhoran added the Ecma specification Work on the core specification label Dec 6, 2024
@pombredanne pombredanne mentioned this pull request Dec 8, 2024
@gernot-h
Copy link
Contributor

Great, thx for addressing this! I think this makes very clear that special characters must not be encoded as separators.

@@ -247,8 +247,9 @@ Use these rules for percent-encoding and decoding ``purl`` components:
- the '#', '?', '@' and ':' characters must NOT be encoded when used as
separators. They may need to be encoded elsewhere
Copy link
Contributor

@gernot-h gernot-h Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fully address the confusion of #39, I would probably also change the last sentence in the para before:

Suggested change
separators. They may need to be encoded elsewhere
separators. Some of them need to be encoded elsewhere as specified in the rules below.

I think, this would also make clearer where they need to be encoded.

It is unambiguous unencoded everywhere
- The colon ':' separator between ``scheme`` and ``type`` MUST NOT be encoded.
For example, in the PURL snippet ``pkg:npm`` the colon ':' MUST NOT be encoded,
and the PURL snippet ``pkg%3Anpm`` is invalid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gernot-h @pombredanne Consider adding at the top of the file, perhaps as a new one-line paragraph following the current first paragraph, something along the lines of the following:

This specification uses RFC 2119 (https://datatracker.ietf.org/doc/html/rfc2119), as clarified in RFC 8174 (https://datatracker.ietf.org/doc/html/rfc8174), for the interpretation of certain terms, e.g., MUST NOT.

Or perhaps a slight modification to the example provided by RFC 2119:

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119, as clarified in RFC 8174.

(Note that the core spec currently contains a great deal of language that will need to be modified to implement RFC 2119/8174.)

@@ -247,8 +247,9 @@ Use these rules for percent-encoding and decoding ``purl`` components:
- the '#', '?', '@' and ':' characters must NOT be encoded when used as
separators. They may need to be encoded elsewhere

- the ':' ``scheme`` and ``type`` separator does not need to and must NOT be encoded.
It is unambiguous unencoded everywhere
- The colon ':' separator between ``scheme`` and ``type`` MUST NOT be encoded.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that following @JimFuller-RedHat suggestion in today's triage call #360 ... this could reworded as:

Suggested change
- The colon ':' separator between ``scheme`` and ``type`` MUST NOT be encoded.
- The colon ':' separator between ``scheme`` and ``type`` MUST be used as-is, unencoded.

@JimFuller-RedHat ?

Copy link

@JimFuller-RedHat JimFuller-RedHat Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, I was (badly) trying say that specs typically define grammar, schema and behavior (ex. processing rules) ... in this case I think how you had it is ok though it is easier to define a 'thing' precisely which has the side effect of defining all the things it cannot be ;)

From grammar pov we could define (in EBNF) as a literal colon ... any parser built with such an EBNF would bork on encoded input ... we might suggest defining pre processing rules to normalise odd input into canonical purl. or define a looser grammar allowing encoding though a separate matter altogether.

I would propose:

"Thescheme and type MUST be separated by a colon ':' "

Armed with a grammar (EBNF) and the spec one can discriminate exactly what is intended.

FWIW, wherever encoding is allowed in a pURL we can explicitly state that in the spec and adjust grammar to reflect that though I think the challenge is unpicking pURL set of encoding rules against the general set of encoding rules on URLs (because people use url stuff on pURL).

Maybe with v1 we are bound to explicitly define encoding behaviour - that is a set of processing rules combined with grammar to get what we have today. It might be that we define the grammar in separate section from processing rules in the spec would help untangle - eg. often a spec (with grammar, processing) is not very readable (and spawns 'annotated spec' or tutorials).

Sorry for not being very helpful!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JimFuller-RedHat

I would propose:

"Thescheme and type MUST be separated by a colon ':' "

clean and neat.

Yet someone would likely immediately bet a nickel that someone else will open an issue asking whether the colon can or must or must not be encoded. ;)

Here is the way out:
We can keep these negative assertions as comments in a new FAQ document. I think that we can create a leaner spec this way.

I think it makes sense for the spec to avoid specifying all the things that it does not support as you rightly suggest, because there is an infinity of these. And we can push "negative" support questions in that FAQ.

So here:

  • In the spec:

    • The scheme and type MUST be separated by a colon ':'
  • In the FAQ:

    • Q: Is the colon between scheme and type encoded? Or can it be encoded? If yes, how?
    • A: The spec has no mention of encoding there, so the colon should be used as-is, never encoded and never requiring any decoding since this is not encoded. It is a parsing error if the colon does not come directly after pkg; Lenient parsing tools are welcome to recover from this error to help process and sanitize damaged purls, but that's not a requirement, and not part of the spec.

That approach can become the standard. The spec will be leaner and crisper and a the FAQ comprehensive to address the questions that will come up for sure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool ! fantastic suggestion - brevity and conciseness enables implementators to focus ... a good example is the json spec - https://ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf , which only contains normative text ;)

@johnmhoran
Copy link
Member

I added a test for the : between scheme and type. More to follow tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants