Describe the bug
Compiled pages contain an HTML text node at location document.body.childNodes[2] that is superfluous and looks like a mistake.
To Reproduce
Steps to reproduce the behavior:
git clone https://github.com/nathancarter/MA346-course-notes.git and then run jupyter-book build . in that folder and view it locally.)document.body.childNodes[2].Although I linked to a specific page, the problem exists on every page of that book.
Expected behavior
No extraneous elements in the page.
Environment (please complete the following information):
jupyter-book --versionJupyter Book: 0.7.2
MyST-NB: 0.8.1
Sphinx Book Theme: 0.0.32
MyST-Parser: 0.8.1
Jupyter-Cache: 0.2.1
I have found the cause of this bug. View the HTML source for the page linked to above, and see source code line 55. The problem is that the meta tags are created without escaping HTML characters, and so the code <a href="...">...</a> inside the page content creates invalid HTML.
hmmm - we've dealt with issues like this before but thought they had gotten nailed down. When you say "meta tags created without escaping HTML characters", what do you mean? As in, tags in the head of the document?
The exact line in the HTML of the page looks like this:
<meta property="og:description" content="Introduction to Data Science <a href="../../_slides/chapter-1-slides.html">See also the slides that summarize a portion of this content.</a> What is data scie" />
As you can see, it's using content from the start of the page, which happens to contain HTML code, and it's placing it, unescaped, inside an HTML property value. This leads to the extraneous text element sitting in the body and outside any other elements. (See this page for an example. As noted above, widen the browser window or you won't see the mistake in the top left.)
I just tried this locally, and realized that you are including raw html in your page rather than using markdown syntax. That is the source of this problem. Is there a reason you're using raw html? As a general rule, it's discouraged to put raw HTML in there and might cause unpredictable behavior...
Yes, the reason is that I'm using a relative link to a page that's not part of the jupyter-book build process, but is in the same repo. (Each chapter of my book has a corresponding slide deck that's not part of the jupyter-book build, and the slide deck and chapter have links to one another for easy student navigation.) Is there a Markdown way that I should be doing so? When I tried the [...](...) syntax for links, it didn't work because the relative link didn't resolve the same from the source file as it does from the built file.
hmmmm - wouldit be possible to make the markdown-y link point directly to a (relative) .html file? That's what doesn't work?
The problem is that Jupyter Book knows how to extract the "text" from markdown syntax, but this is really hard to do with raw HTML, and so Jupyter Book brings that HTML with it into the page metadata, and that's what is causing this problem.
No, it's not possible to make a markdown link work, for the following reason.
project-root/_build/html/page-name.htmlproject-root/page-name.ipynbproject-root/_slides/example.htmlSo the link (from built page to target) has to be to ../../_slides/example.html. But if you put this in the source page, from there it actually links outside the project folder entirely, and thus the build generates this error:
WARNING: None:any reference target not found: ../../_slides/chapter-1-slides.html
And the corresponding markdown code does not render into a link, but just pink fixed-width font, I guess to indicate that something went wrong.
I have also tried this with a relative reference from the location of the source file, that is, using ./_slides/example.html, thinking perhaps the build process would adjust the relative path, but I get the same error and result.
ahhh interesting. I wonder what @chrisjsewell thinks about this. We're over-riding the "relative" markdown links to assume they point to other sphinx docs, and that's why this error is happening. I wonder if there's some way we could avoid this behavior for "relative links but that don't point to Jupyter Book content"?
I wonder what @chrisjsewell thinks about this
Well the issue here IMO is that you have files been used for the site outside the _build folder.
The way sphinx, and thus jupyter-book, is implemented is that it expects the entire final site to be contained within the _build folder.
I would suggest adding _slides to the sphinx html_extra_path configuration, so that it copies this folder in to _build. In _config.yml I guess it should be something like:
sphinx:
config:
html_extra_path: [_slides]
This seems like the right track; I realize I was kind of hacking this together, and a correct/approved way to do it would be great. But the above doesn't work for me. Here's what I tried and what happened:
_config.yml and it successfully put all slides files into the book build folder. But then links like [example](chapter-1-slides.html) or [example](./chapter-1-slides.html) were still transformed by the book build process into non-links (just pink fixed-width font as if to indicate an error). Furthermore, an error is produced at build time:/Users/nathan/OneDrive - Bentley University/Teaching/MA346 F20/MA346-course-notes/chapter-1-intro-to-data-science.ipynb:10004: WARNING: None:any reference target not found: chapter-1-slides.html
I thought this was perhaps because the slides source files weren't in the same folder as the book source files, so I tried something else:
_config.yml stuff and added my own post-build step in a shell script that copies my _slides folder to also live in _build/html/_slides, so that I could use links like [example](./_slides/chapter-1-slides.html), which would be valid at both build time and view time. But I get the same behavior, just with the new path in the error message.I _think_ the problem here is that even if the file exists, it isn't something that Sphinx knows about, and because we over-ride []() with the :any: role, it's expecting to find something that is in the Sphinx cross-ref database.
ah yeh good point. I think this is maybe a configuration option we can make upstream in myst-parser, such that links with certain extensions (or regexes) are converted to external links, rather than attempting to resolve them as internal links
Right, so things like .md or .ipynb would be attempted to be found within The Sphinx Zone (among the pyramids?) and anything else would just be let straight through. Sounds good to me!
Yeh kind of lol; currently if the link does not match a URL scheme (e.g. 'http://...') then it is treated as an internal cross-reference (to a reference target or sphinx document, etc).
See: https://github.com/executablebooks/MyST-Parser/blob/3d5ae4f94c9d39435d76861b86dc5171ee23c9df/myst_parser/docutils_renderer.py#L412-L415
OK. Then if that issue has been created, I think we can close this one, right?
The only missing part might be to somewhere document (except that maybe you've done it and I just haven't seen it) that using raw HTML in Markdown in Jupyter cells isn't supported.
Well it is generally supported, but discouraged.
In this instance, actually we should probably raise an issue in sphinx; that the text of the content attribute of the meta description tag should be parsed through html.escape.
Actually I can't reproduce (with a basic sphinx build) how to get the opengraph meta data to be added to the page, or see where this occurs in the code (sphinx, docutils, etc):
<!-- Opengraph tags -->
<meta property="og:url" content="https://jupyterbook.org/intro.html" />
<meta property="og:type" content="article" />
<meta property="og:title" content="Books with Jupyter" />
<meta property="og:description" content="Books with Jupyter Jupyter Book is an open source project for building beautiful, publication-quality books and documents from computational material. Jupyter" />
<meta property="og:image" content="https://jupyterbook.org/_static/logo.png" />
@choldgraf?
so there are two issues we're discussing in this thread which might be confusing things:
meta tag to breakTo trigger the original bug in this thread, put some raw HTML at the beginning of a page so that it is in the content= block of og:description
Yes (1) is now encapsulated in https://github.com/executablebooks/MyST-Parser/issues/193, that would be the ideal way to fix this particular issue
More generally though (2) is a bug somewhere
To trigger the original bug in this thread, put some raw HTML at the beginning of a page so that it is in the content= block of og:description
But how does it end up in that block? what code is causing that and in what repo?
As mentioned, if I create a minimal sphinx folder (with sphinx-quickstart) and a page:
Header
======
.. raw:: html
<a href="slides.html">See also the slides that summarize a portion of this content.</a>
I can't replicate it (I also tried adding the sphinx-book-theme). There must be one or more config values and/or extensions that can cause it to be added
Ah no I don't think it'll happen if you use the .. raw:: directive, only if you embed it _directly:
# Header
<a href="slides.html">See also the slides that summarize a portion of this content.</a>
The problem is here: https://github.com/executablebooks/sphinx-book-theme/blob/master/sphinx_book_theme/layout.html#L18
page_description includes the HTML syntax that is there, and this bugs out the meta tags. I think we could just sanitize this somehow to fix this particular bug
Ah no I don't think it'll happen if you use the .. raw:: directive, only if you embed it _directly
But that is not what myst-parser is doing, it is adding it as raw html:https://github.com/executablebooks/MyST-Parser/blob/3d5ae4f94c9d39435d76861b86dc5171ee23c9df/myst_parser/docutils_renderer.py#L443
I think we could just sanitize this somehow to fix this particular bug
You might be able to do: <meta property="og:description" content="{{ page_description|e }}" />
(this might also be relevant for some of the other jinja variables)
See: https://jinja.palletsprojects.com/en/2.11.x/templates/#working-with-manual-escaping
oh yeah I bet that'd totally fix it
yep - OK I think this is fixed in https://github.com/executablebooks/sphinx-book-theme/pull/142
If I do a fresh pip install from the git repo, will it get this latest change, or does there have to be a release first? (Still not sure what pip install git+git://... does...
TBH I am still not sure what pip install git+ does either anymore haha. What you'll need to upgrade though is the sphinx-book-theme to master. When a new release is made it'll "automatically" update
Glad I'm not the only one. :) I just type things and press enter and hope they don't take over my computer.
You mean pip install git+git://github.com/executablebooks/sphinx-book-theme.git@master? Tried that just now but it didn't solve the problem in my HTML build.
Aha! Uninstalling sphinx-book-theme and then running the above command fixed it!
Most helpful comment
yep - OK I think this is fixed in https://github.com/executablebooks/sphinx-book-theme/pull/142