Jupyter-book: Extra HTML text node created in output

Created on 29 Jul 2020 · 29Comments · Source: executablebooks/jupyter-book

Describe the bug
Compiled pages contain an HTML text node at location document.body.childNodes[2] that is superfluous and looks like a mistake.

To Reproduce
Steps to reproduce the behavior:

Visit one of the pages of my project, such as this one. (Or check that book project out with git clone https://github.com/nathancarter/MA346-course-notes.git and then run jupyter-book build . in that folder and view it locally.)
Widen your browser view beyond the minimum required to fit the page.
See the erroneous HTML text element in the top left.
Optional: Open your browser's dev tools and find the extra HTML node at location document.body.childNodes[2].

Although I linked to a specific page, the problem exists on every page of that book.

Expected behavior
No extraneous elements in the page.

Environment (please complete the following information):

Python 3.7.3
Output of jupyter-book --version

Jupyter Book: 0.7.2
MyST-NB: 0.8.1
Sphinx Book Theme: 0.0.32
MyST-Parser: 0.8.1
Jupyter-Cache: 0.2.1

bug

Source

nathancarter

Most helpful comment

yep - OK I think this is fixed in https://github.com/executablebooks/sphinx-book-theme/pull/142

choldgraf on 4 Aug 2020

👍2

All 29 comments

I have found the cause of this bug. View the HTML source for the page linked to above, and see source code line 55. The problem is that the meta tags are created without escaping HTML characters, and so the code <a href="...">...</a> inside the page content creates invalid HTML.

nathancarter on 29 Jul 2020

hmmm - we've dealt with issues like this before but thought they had gotten nailed down. When you say "meta tags created without escaping HTML characters", what do you mean? As in, tags in the head of the document?

choldgraf on 29 Jul 2020

The exact line in the HTML of the page looks like this:

<meta property="og:description" content="Introduction to Data Science  <a href="../../_slides/chapter-1-slides.html">See also the slides that summarize a portion of this content.</a>  What is data scie" />

As you can see, it's using content from the start of the page, which happens to contain HTML code, and it's placing it, unescaped, inside an HTML property value. This leads to the extraneous text element sitting in the body and outside any other elements. (See this page for an example. As noted above, widen the browser window or you won't see the mistake in the top left.)

nathancarter on 30 Jul 2020

I just tried this locally, and realized that you are including raw html in your page rather than using markdown syntax. That is the source of this problem. Is there a reason you're using raw html? As a general rule, it's discouraged to put raw HTML in there and might cause unpredictable behavior...

choldgraf on 30 Jul 2020

Yes, the reason is that I'm using a relative link to a page that's not part of the jupyter-book build process, but is in the same repo. (Each chapter of my book has a corresponding slide deck that's not part of the jupyter-book build, and the slide deck and chapter have links to one another for easy student navigation.) Is there a Markdown way that I should be doing so? When I tried the [...](...) syntax for links, it didn't work because the relative link didn't resolve the same from the source file as it does from the built file.

nathancarter on 31 Jul 2020

hmmmm - wouldit be possible to make the markdown-y link point directly to a (relative) .html file? That's what doesn't work?

The problem is that Jupyter Book knows how to extract the "text" from markdown syntax, but this is really hard to do with raw HTML, and so Jupyter Book brings that HTML with it into the page metadata, and that's what is causing this problem.

choldgraf on 31 Jul 2020

No, it's not possible to make a markdown link work, for the following reason.

My built page will be at project-root/_build/html/page-name.html
My page source is at project-root/page-name.ipynb
The target of the link is at project-root/_slides/example.html

So the link (from built page to target) has to be to ../../_slides/example.html. But if you put this in the source page, from there it actually links outside the project folder entirely, and thus the build generates this error:

WARNING: None:any reference target not found: ../../_slides/chapter-1-slides.html

And the corresponding markdown code does not render into a link, but just pink fixed-width font, I guess to indicate that something went wrong.

I have also tried this with a relative reference from the location of the source file, that is, using ./_slides/example.html, thinking perhaps the build process would adjust the relative path, but I get the same error and result.

nathancarter on 31 Jul 2020

ahhh interesting. I wonder what @chrisjsewell thinks about this. We're over-riding the "relative" markdown links to assume they point to other sphinx docs, and that's why this error is happening. I wonder if there's some way we could avoid this behavior for "relative links but that don't point to Jupyter Book content"?

choldgraf on 31 Jul 2020

I wonder what @chrisjsewell thinks about this

Well the issue here IMO is that you have files been used for the site outside the _build folder.
The way sphinx, and thus jupyter-book, is implemented is that it expects the entire final site to be contained within the _build folder.
I would suggest adding _slides to the sphinx html_extra_path configuration, so that it copies this folder in to _build. In _config.yml I guess it should be something like:

sphinx:
  config:
    html_extra_path: [_slides]

chrisjsewell on 31 Jul 2020

This seems like the right track; I realize I was kind of hacking this together, and a correct/approved way to do it would be great. But the above doesn't work for me. Here's what I tried and what happened:

Did what @chrisjsewell said in my _config.yml and it successfully put all slides files into the book build folder. But then links like [example](chapter-1-slides.html) or [example](./chapter-1-slides.html) were still transformed by the book build process into non-links (just pink fixed-width font as if to indicate an error). Furthermore, an error is produced at build time:

/Users/nathan/OneDrive - Bentley University/Teaching/MA346 F20/MA346-course-notes/chapter-1-intro-to-data-science.ipynb:10004: WARNING: None:any reference target not found: chapter-1-slides.html

I thought this was perhaps because the slides source files weren't in the same folder as the book source files, so I tried something else:

I removed the _config.yml stuff and added my own post-build step in a shell script that copies my _slides folder to also live in _build/html/_slides, so that I could use links like [example](./_slides/chapter-1-slides.html), which would be valid at both build time and view time. But I get the same behavior, just with the new path in the error message.

nathancarter on 3 Aug 2020

I _think_ the problem here is that even if the file exists, it isn't something that Sphinx knows about, and because we over-ride []() with the :any: role, it's expecting to find something that is in the Sphinx cross-ref database.

choldgraf on 3 Aug 2020

ah yeh good point. I think this is maybe a configuration option we can make upstream in myst-parser, such that links with certain extensions (or regexes) are converted to external links, rather than attempting to resolve them as internal links

chrisjsewell on 3 Aug 2020

Right, so things like .md or .ipynb would be attempted to be found within The Sphinx Zone (among the pyramids?) and anything else would just be let straight through. Sounds good to me!

nathancarter on 3 Aug 2020

Yeh kind of lol; currently if the link does not match a URL scheme (e.g. 'http://...') then it is treated as an internal cross-reference (to a reference target or sphinx document, etc).
See: https://github.com/executablebooks/MyST-Parser/blob/3d5ae4f94c9d39435d76861b86dc5171ee23c9df/myst_parser/docutils_renderer.py#L412-L415

chrisjsewell on 3 Aug 2020

OK. Then if that issue has been created, I think we can close this one, right?

The only missing part might be to somewhere document (except that maybe you've done it and I just haven't seen it) that using raw HTML in Markdown in Jupyter cells isn't supported.

nathancarter on 3 Aug 2020

Well it is generally supported, but discouraged.
In this instance, actually we should probably raise an issue in sphinx; that the text of the content attribute of the meta description tag should be parsed through html.escape.

chrisjsewell on 3 Aug 2020

👍1

Actually I can't reproduce (with a basic sphinx build) how to get the opengraph meta data to be added to the page, or see where this occurs in the code (sphinx, docutils, etc):

<!-- Opengraph tags -->
<meta property="og:url"         content="https://jupyterbook.org/intro.html" />
<meta property="og:type"        content="article" />
<meta property="og:title"       content="Books with Jupyter" />
<meta property="og:description" content="Books with Jupyter  Jupyter Book is an open source project for building beautiful, publication-quality books and documents from computational material.  Jupyter" />
<meta property="og:image"       content="https://jupyterbook.org/_static/logo.png" />

@choldgraf?

chrisjsewell on 3 Aug 2020

so there are two issues we're discussing in this thread which might be confusing things:

Referencing relative files that _aren't_ part of Sphinx is not possible with markdown syntax in MyST
Because of this, @nathancarter was using a raw HTML link in the top of each page. Raw HTML links at the beginning of a page will cause the meta tag to break

To trigger the original bug in this thread, put some raw HTML at the beginning of a page so that it is in the content= block of og:description

choldgraf on 3 Aug 2020

Yes (1) is now encapsulated in https://github.com/executablebooks/MyST-Parser/issues/193, that would be the ideal way to fix this particular issue

More generally though (2) is a bug somewhere

To trigger the original bug in this thread, put some raw HTML at the beginning of a page so that it is in the content= block of og:description

But how does it end up in that block? what code is causing that and in what repo?
As mentioned, if I create a minimal sphinx folder (with sphinx-quickstart) and a page:

Header
======

.. raw:: html

    <a href="slides.html">See also the slides that summarize a portion of this content.</a>

I can't replicate it (I also tried adding the sphinx-book-theme). There must be one or more config values and/or extensions that can cause it to be added

chrisjsewell on 3 Aug 2020

Ah no I don't think it'll happen if you use the .. raw:: directive, only if you embed it _directly:

# Header

<a href="slides.html">See also the slides that summarize a portion of this content.</a>

choldgraf on 3 Aug 2020

The problem is here: https://github.com/executablebooks/sphinx-book-theme/blob/master/sphinx_book_theme/layout.html#L18

page_description includes the HTML syntax that is there, and this bugs out the meta tags. I think we could just sanitize this somehow to fix this particular bug

choldgraf on 3 Aug 2020

Ah no I don't think it'll happen if you use the .. raw:: directive, only if you embed it _directly

But that is not what myst-parser is doing, it is adding it as raw html:https://github.com/executablebooks/MyST-Parser/blob/3d5ae4f94c9d39435d76861b86dc5171ee23c9df/myst_parser/docutils_renderer.py#L443

chrisjsewell on 3 Aug 2020

I think we could just sanitize this somehow to fix this particular bug

You might be able to do: <meta property="og:description" content="{{ page_description|e }}" />
(this might also be relevant for some of the other jinja variables)
See: https://jinja.palletsprojects.com/en/2.11.x/templates/#working-with-manual-escaping

chrisjsewell on 3 Aug 2020

oh yeah I bet that'd totally fix it

choldgraf on 3 Aug 2020

yep - OK I think this is fixed in https://github.com/executablebooks/sphinx-book-theme/pull/142

choldgraf on 4 Aug 2020

👍2

If I do a fresh pip install from the git repo, will it get this latest change, or does there have to be a release first? (Still not sure what pip install git+git://... does...

nathancarter on 4 Aug 2020

TBH I am still not sure what pip install git+ does either anymore haha. What you'll need to upgrade though is the sphinx-book-theme to master. When a new release is made it'll "automatically" update

choldgraf on 4 Aug 2020

Glad I'm not the only one. :) I just type things and press enter and hope they don't take over my computer.

You mean pip install git+git://github.com/executablebooks/sphinx-book-theme.git@master? Tried that just now but it didn't solve the problem in my HTML build.

nathancarter on 4 Aug 2020

Aha! Uninstalling sphinx-book-theme and then running the above command fixed it!

nathancarter on 4 Aug 2020

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

default book doesn't build properly on github - symlink error

muzny · 4Comments

Add a widgets in-cell button too

choldgraf · 3Comments

Site Search

TomDonoghue · 4Comments

Warnings: `extra_navbar` and `extra_footer`

spring-haru · 5Comments

Move Jupyter Book's documentation to jupyterbook.org ?

choldgraf · 5Comments