Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on Fullwidth Exclamation Mark (U+FF01) in comment #134810

Open
dtolnay opened this issue Dec 27, 2024 · 6 comments
Open

Warn on Fullwidth Exclamation Mark (U+FF01) in comment #134810

dtolnay opened this issue Dec 27, 2024 · 6 comments
Labels
A-lints Area: Lints (warnings about flaws in source code) such as unused_mut. A-Unicode Area: Unicode T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.

Comments

@dtolnay
Copy link
Member

dtolnay commented Dec 27, 2024

//! A
//!B
//! C

For the above Rust source code, rustdoc produces documentation containing only "A C". The line containing "B" is only a comment. The comment starts with https://www.compart.com/en/unicode/U+FF01.

I caught this in a real-world documentation PR from a Chinese contributor. See #134241 (review).

In GitHub, the distinction is practically invisible.

In my text editor, it is more obvious.

Would it be reasonable for rustc and/or rustdoc to have reported a lint on such code?

@dtolnay dtolnay added A-lints Area: Lints (warnings about flaws in source code) such as unused_mut. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 27, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 27, 2024
@dtolnay
Copy link
Member Author

dtolnay commented Dec 27, 2024

@workingjubilee
Copy link
Member

workingjubilee commented Dec 27, 2024

Yes, we should either treat the characters identically (normalizing both to match //!) or lint on this. This character is emitted because someone starts writing something like

//! それから

and then notices they're writing using a CJK input method which will preferentially emit fullwidth characters, even when using "English" punctuation inputs. They then switch linguistic gears (or at least, those of their HID/IME), back up, and write,

//! Therefore,

If they don't delete the entire line, they leave behind the original fullwidth character.

I'm not aware of a fullwidth /, or at least, common input methods don't use it, so the original comment token is left intact.

@workingjubilee workingjubilee added A-Unicode Area: Unicode T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Dec 27, 2024
@est31
Copy link
Member

est31 commented Dec 27, 2024

We do error for the fullwidth exclamation mark if we encounter it in normal Rust code:

fn foo() -> ! {
    let v = „hi“;
}
output (expand)
error: unknown start of token: \u{ff01}
 --> src/lib.rs:1:13
  |
1 | fn foo() -> ! {
  |             ^^
  |
help: Unicode character '!' (Fullwidth Exclamation Mark) looks like '!' (Exclamation Mark), but it is not
  |
1 | fn foo() -> ! {
  |             ~

error: unknown start of token: \u{201e}
 --> src/lib.rs:2:13
  |
2 |     let v = „hi“;
  |             ^

error: unknown start of token: \u{201c}
 --> src/lib.rs:2:16
  |
2 |     let v = „hi“;
  |                ^
  |
help: Unicode character '“' (Left Double Quotation Mark) looks like '"' (Quotation Mark), but it is not
  |
2 |     let v = „hi";
  |                ~

error[E0425]: cannot find value `hi` in this scope
 --> src/lib.rs:2:14
  |
2 |     let v = „hi“;
  |              ^^ not found in this scope

error[E0308]: mismatched types
 --> src/lib.rs:1:13
  |
1 | fn foo() -> ! {
  |    ---      ^^ expected `!`, found `()`
  |    |
  |    implicitly returns `()` as its body has no tail or `return` expression
  |
  = note:   expected type `!`
          found unit type `()`
But there is no error or lint here. I don't see any existing unicode lint have this in its scope (say the ones [here](https://doc.rust-lang.org/stable/nightly-rustc/src/rustc_lint/non_ascii_idents.rs.html)).

We could maybe make a specialized lint like how there is text_direction_codepoint_in_comment.

This also gives an error btw:

fn foo() {
    /∗∗/
}
output (expand)
error: unknown start of token: \u{2217}
 --> src/lib.rs:2:6
  |
2 |     /∗∗/
  |      ^^
  |
  = note: character appears once more
help: Unicode character '∗' (Asterisk Operator) looks like '*' (Asterisk), but it is not
  |
2 |     /**/
  |      ~~

This doesn't lint either and on a first glance it looks like hello isn't commented out (but it is).

/*
some comment here
∗/
fn hello() {}
/*
some comment here
*/

There is no lint here either (I don't have a chinese keyboard, maybe one can hit this when trying for `?):

/*!
```
assert_eq!(true, false);
```
*/

or here (less visible and occurs once in the wild):

/*!
ˋˋˋ
assert_eq!(true, false);
ˋˋˋ
*/

/ has a bunch of confusables too, so this is not a doc comment:

//⁄ hello
fn foo() {}

Maybe one could make a confusables_in_comments lint? Maybe doc_comment_confusables?

@workingjubilee
Copy link
Member

workingjubilee commented Dec 27, 2024

There is no lint here either (I don't have a chinese keyboard, maybe one can hit this when trying for `?):

While specialty keyboards for these languages do exist, using them is actually... er... a specialization? Like steganography keyboards are to English. These input methods are mostly software-driven, so people often use perfectly ordinary QWERTY keyboards, and the "switch input method" keys that are sometimes present (in e.g. the 109 key layout) can be replicated by chorded commands using some combination of the Shift, Ctrl, Alt, and Fn keys.

@jieyouxu
Copy link
Member

jieyouxu commented Dec 27, 2024

So while unlikely, doesn't doing so technically warn/lint on "valid" comments? E.g. what if I happen to intentionally write a non-ASCII exclamation mark? Likely more unsual (but putting on an adversarial user's hat)

//难道正常的感叹号就不存在吗?当然有
//!左边这个不就是吗?

(The punctuation being formatted like this is not usual, but it is "technically" valid)

I'd be personally in favor of adding such a warning, just thinking if it can be broken (specifically false positives).

@Noratrieb Noratrieb removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 27, 2024
@Noratrieb
Copy link
Member

"proper" formatting of comments puts a space after the //, which resolves the problem. I don't think this can have false positives, at worst it will complain about your comment formatting and you can fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-lints Area: Lints (warnings about flaws in source code) such as unused_mut. A-Unicode Area: Unicode T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants