This is a tracking issue for the rustdoc lint check_invalid_html_tags.
There is no feature gate for this lint, but it is currently nightly-only.
check_invalid_html_tagsThis lint warns about unclosed HTML tags and comments. See https://doc.rust-lang.org/nightly/rustdoc/lints.html#invalid_html_tags for more information.
Tracking issues are used to record the overall progress of implementation.
They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions.
A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature.
Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.
Original issue
Rustdoc currently allows HTML in markdown, which is fine. The problem is that it doesn't sanitize the HTML at all, which means you can do wacky stuff like:
/// This is basically a Vec<Script>
pub enum ScriptExtension { /* ... */ }
impl ScriptExtension {
/// docs
pub fn some_func() {}
}
in which the Vec<Script> parses as an opening script tag, and the rest of the docs just ... disappear

This isn't great, and it's kind of surprising. Perhaps we should sanitize these, or at least warn when it happens? This may not be easy or cheap, though.
cc @GuillaumeGomez @kinnison
Historically, the decision was to do nothing. That doesn't mean that maybe changing it isn't the right choice, but this feels like internals thread / RFC territorry rather than a bug, personally.
@steveklabnik my understanding is that the historical discussion has been _mostly_ about whether rustdoc should allow arbitrary HTML at all, whereas this is about a narrower set of behavior -- allowing _unclosed_ tags (which basically will never do what you want, and are often the result of mistyped generics).
There are really only a few unclosed tags that can cause problems (namely, script, video, object, audio, and some other similar ones), a simpler check would be one that just scans for <script with no corresponding </script> or something
I'd personally do nothing. From my point of view, this is up to users to master what they're doing when using HTML since we're following markdown commonmark spec.
Again, the problem is _not_ users "using HTML", it's users _accidentally_ introducing HTML into their code. I am okay with users _intentionally_ using HTML in rustdoc. I know that has been a common complaint, but I am not making that complaint. I'm talking about something different.
Typing Vec<Script> without backticks is enough to break the docs. I've seen plenty of rustdocs that avoid religiously putting backticks on code, and I've made that mistake myself a bunch of times. Unclosed tags are an easy metric for this.
The problem is that a lot of HTML5 tags can be unclosed. Keeping this list up-to-date might be tricky. We can do it but in case a user is using an unclosed HTML5 tag, what should we do in this case? If it's a type we're supposed to not convert it to HTML but what if it's intended to be HTML? This is where I have an issue.
I also covered that in a previous comment. We can do this for specific tags only, as a start perhaps _only_ the tags that will make the following content completely disappear (script, object, audio, video, etc).
We can also keep a list of all closeable HTML tags. If it's out of date, it's no big deal, this can just be a lint.
Yes, it should. This is not the same as forbidding HTML or sanitizing what HTML features are allowed.
Unclosed tags can break the entire page, which is highly undesirable.
It might be used intentionally to open a tag in one doc comment and close it in another, but such uses would couple docs very tightly to a particular ordering of items and specific markup used by rustdoc itself. Allowing to rely on such a hacky thing seems like a bad idea.
Rustdoc could parse and reserialize the markup. It would guarantee the markup is syntactically correct without changing its meaning.
If we keep this as a warning, that could be a big plus. Does it sound OK?
@Manishearth: Do you mind listing the tags which need to be checked please?
Yes, I already suggested a warning, I'm fine with that.
script, object, video, audio are the most problematic. We can get an implementation in and figure out the full list later, it should be something that can be found in the HTML spec somewhere.
Ok, I'll try to send something in as soon as possible.
Mentoring instructions: In src/librustdoc/passes add a new pass (maybe named lint_unclosed_tags). For an example of a pass that looks at doc-comments, see unindent_comments. For an example of a pass that lints, see doc_test_lints.
The pass should check for the following unclosed tags:
As an extension, you could also look for other unclosed tags; any tag that isn't self-closing should warn here.
You can see the documentation by calling item.attrs.doc_strings.as_str().
For an initial implementation, you don't have to worry about <scri\npt> across newlines. However, you should definitely allow multi-line tags, such as
/// <script>
/// </script>
Unassigning @GuillaumeGomez since it's been 9 months.
Reassigning @GuillaumeGomez since he's working on it :)
I repurposed this to be a tracking issue for the lint.
@rust-lang/rustdoc after https://github.com/rust-lang/rust/pull/77753 lands, what do you think about making this warn by default? I think it would be really helpful for things like https://github.com/tokio-rs/tracing/pull/881.
Yes, it should be warn by default
I just tried to turn this on by default and noticed it interacts badly with #76934:
error: unresolved link to `Vec<Box<T>`
--> $DIR/intra-link-malformed-generics.rs:5:6
|
LL | //! [Vec<Box<T>]
| ^^^^^^^^^^ unbalanced angle brackets
warning: unclosed HTML tag `T`
--> $DIR/intra-link-malformed-generics.rs:5:13
|
LL | //! [Vec<Box<T>]
| ^^^
|
= note: `#[warn(invalid_html_tags)]` on by default
I'm not sure what we could do to make this better ... for now maybe it makes sense to gate generic parameters behind a feature gate and then this lint wouldn't need changes? Or even better, keep only the functionality from #76934 and remove all the warnings.
cc @camelid
Or we can prevent this check when we're inside a link?
@GuillaumeGomez we're _not_ inside a link, that's why the lint fires in the first place. [...] is not a link unless it has a corresponding reference link that's not broken.
I think it's marked as a broken link: https://docs.rs/pulldown-cmark/0.8.0/pulldown_cmark/enum.LinkType.html#variant.ShortcutUnknown
Shortcut without destination in the document, but resolved by the broken_link_callback
Vec<Box<T> is not resolved by the callback, that's why it gives the intra-doc error in the first place. This is just a Text node. https://docs.rs/pulldown-cmark/0.8.0/pulldown_cmark/enum.Event.html#variant.Text
I'm not sure if it's the generic errors that are the problem; it seems to me that the HTML lint needs to be changed so it doesn't get triggered by this. Or there needs to be something else outside of either of them. Of course, I am biased :)
Perhaps we sanitize the link text if it doesn't get resolved as an intra-doc link?
I think it should prevent the HTML lint from triggering and it will still allow people to have HTML in their link text because it will only trigger for unresolved links.
Perhaps we sanitize the link text if it doesn't get resolved as an intra-doc link?
We should not change the text that users have in their documentation (except in special cases, like #77277). That would break anyone with allow(broken_intra_doc_links) because they actually do want the text they have.
One possibility is to make these lints mutually exclusive: if the intra-doc link error is reported, silence the check_tags warnings; if intra-doc errors are silenced with allow(...), enable check_tags again. That seems _slightly_ inconsistent but it's better than a duplicate warning IMO.
Sorry, I meant _escape_ the link text; i.e., turn this
[Vec<Box<T>]
into this
[Vec<Box<T>]
. I know rustdoc tries to allow HTML, but maybe it would make sense to escape HTML tags in just this one case (where the link is unresolved)?
I think suppressing the duplicate warning makes sense. In terms of escaping the link text, I don't see why it would be an issue given that people can just escape the brackets if they meant them literally or resolve the link, which they should do anyway.
Sorry if I'm missing something!
turn
[Vec<Box>]into[Vec<Box<T>]
This still breaks; pulldown-cmark will escapes the & to &. And in any case I think this is getting a little side-tracked - I'm asking what the behavior _should_ be, not how we should implement it. It sounds like everyone agrees that if the intra-doc link warning fires check_tags shouldn't?
And in any case I think this is getting a little side-tracked - I'm asking what the behavior _should_ be, not how we should implement it.
Ah, sorry :)
It sounds like everyone agrees that if the intra-doc link warning fires
check_tagsshouldn't?
I think so, unless there are any corner cases we're not thinking of...