When using unicode in a language that accepts it, on a block of code that is accepted by pygments, the doc build breaks.
I am trying to get syntax highlighting for the following code:
Definition logeq_both_false {X Y : UU} : 卢X -> 卢Y -> (X <-> Y).
As you can see, this uses the unicode symbol 卢.
As a test case, I try to compile the following index.rst in an otherwise vanilla sphinx project:
This breaks:
.. code-block:: coq
Definition logeq_both_false {X Y : UU} : 卢X -> 卢Y -> (X <-> Y).
But this works:
.. code-block:: coq
Definition logeq_both_false {X Y : UU} : X -> Y -> (X <-> Y).
$ make html
Running Sphinx v1.6.5
loading pickled environment... done
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 1 source files that are out of date
updating environment: 0 added, 1 changed, 0 removed
reading sources... [100%] index
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] index
/tmp/test/index.rst:3: WARNING: Could not lex literal_block as "coq". Highlighting skipped.
generating indices... genindex
writing additional pages... search
copying static files... done
copying extra files... done
dumping search index in English (code: en) ... done
dumping object inventory... done
build succeeded, 1 warning.
Build finished. The HTML pages are in _build/html.
The original code is processed fine by pygments directly, hence sphinx should, too.
And if sphinx insists on breaking, it should tell me why.
Additional info: the html produced from pygmentize -v -l coq -f html -o temp.coqout temp.coq looks like this
<div class="highlight"><pre><span></span><span class="kn">Definition</span> <span class="n">logeq_both_false</span> <span class="o">{</span><span class="n">X</span> <span class="n">Y</span> <span class="o">:</span> <span class="n">UU</span><span class="o">}</span> <span class="o">:</span> <span class="err">卢</span><span class="n">X</span> <span class="o">-></span> <span class="err">卢</span><span class="n">Y</span> <span class="o">-></span> <span class="o">(</span><span class="n">X</span> <span class="o"><-></span> <span class="n">Y</span><span class="o">).</span>
</pre></div>
The bits <span class="err">卢</span> do not look encouraging... does err stand for "erroneous input"? I inserted this manually in some HTML produced by Sphinx (I used your working input and injected directly in html the above) and here is what it gives;

Similarly I tried with the latex formatter of pygmentize, I picked up its produced code, replaced all \PY macros by \PYG as used by Sphinx, inserted it by manual surgery in a Sphinx LaTeX document and this gives this PDF output:

So, it does seem that even on Pygmentize side the Unicode code-points are treated as some sort of error.
Even if Pygmentize 2.2.0's COQ lexer does not yet handle such input, the question remains why Sphinx would abort syntax highlighting altogether rather than keeping the "err"虁 class tags, which may be better than nothing as user can override the CSS to remove visual display of an error condition of pygmentize.
about the good first issue tag, I don't think we have a definitive policy here, so in the end I removed it as I don't like so much the idea of it actually... ;-)
Now sphinx uses raiseonerror filter of pygments. On my local, pygments considers the unicode characters as an error.
bash-3.2$ cat failed.coq
Definition logeq_both_false {X Y : UU} : 卢X -> 卢Y -> (X <-> Y).
bash-3.2$ pygmentize -l coq -F raiseonerror < failed.coq
*** Error while highlighting:
ErrorToken: \xac
(file "/Users/tkomiya/work/sphinx/.tox/py27/lib/python2.7/site-packages/pygments/filters/__init__.py", line 196, in filter)
*** If this is a bug you want to report, please rerun with -v.
Definition logeq_both_false {X Y : UU} :
bash-3.2$ cat suceeded.coq
Definition logeq_both_false {X Y : UU} : X -> Y -> (X <-> Y).
bash-3.2$ pygmentize -l coq -F raiseonerror < suceeded.coq
Definition logeq_both_false {X Y : UU} : X -> Y -> (X <-> Y).
Oh, I had missed the -F raiseonerror pygmentize option. If Sphinx has no way to instruct Pygments to distinguish Unicode character caused error from other types of error then prospects are bleak here.
I don't know why pygments raises an error because I don't know about coq language. Anyway, current Sphinx always uses raiseonerror mode of pygments. So it is nice to change this settings through options.
The code is valid coq as it is accepted by the compiler. Does that mean this should be additionally be reported with pygments?
@abooij it does look as being primarily a Pygments problem.
@jfbu, why do you say that this is "primarily" a pygments problem? After all, pygments' own output is usable, although indeed not perfect.
@abooij 1) the "raiseonerror" Pygments filter does not seem to have great granularity, it generates an exception when the lexer generates an error token, and 2) if Unicode characters are ok in Coq input (I don't know), then why does the Coq lexer generate an error token? Except if something is wrong in the way Sphinx calls Pygments, this puts most of the problem on the Pygments side.
@jfbu Well indeed there is a problem on the pygments side. However, I highly doubt pygments will ever be able to get Coq syntax (or any other moderately complicated modern syntax, for that matter) completely right. In those cases, it would be preferable if sphinx generated something workable, rather than give up completely as soon as some subtle aspect of a language's syntax is used.
(Unicode is OK in coq and widely used.)
I think @tk0miya concurred that an option to let Sphinx not use raiseonerror Pygments filter could be added to Sphinx. But then the user still has to use CSS for html or a custom redefinition in LaTeX preamble of \PYG@tok@err macro to decide what to do with the error highlighting itself.
Issue #4249 explains that Pygments' LaTeX formatter has a bug in that respect because it uses a \strut inside an \fcolorbox for highlighting error tokens and this causes an increased total height+depth to larger than normal baseline (also the character width is increased). That issue remains invisible currently because Sphinx renounces using the Pygments output if there are token errors reported by the lexer.