Weasyprint: skip_first_whitespace IndexError

Created on 1 Feb 2019 · 12Comments · Source: Kozea/WeasyPrint

Hi there, thanks for creating WeasyPrint, it does a really good job!
I've been using it with CKEditor which generates really ugly HTML (specially when someone pastes a Word document into it).
It worked fine until recently where I hit a bug, it seems the combination of style + characters in this HTML breaks it, I stripped it down all I could (original HTML was huge with lot of attributes).

Seems to be a corner case in skip_first_whitespace.
I've created a minimal test (style + html) which fails, if I change some letters/remove style it works, it seems to be an edge combination.
I'm not sure how to properly fix it, I modified the code to keep rendering (and seems fine so far) by catching the IndexError exception.

This a a test case to reproduce the bug
https://github.com/knixeur/WeasyPrint/commit/64b66f909ce9273388420e6c49c524250f57beb6

@assert_no_logs
def test_page_and_linebox_breaking_out_of_range():
    page_content = (
    '''
    <style>
    @page :right { margin-left: 4cm; }
    @page :left { margin-right: 4cm; }
    @page { size: legal; }
    </style>
    <p><span><span>'''
    + ('&nbsp;' * 42) + ''' </span>*</span><span>* ********* ******* *** ****'''
    + ''' ** ** ******* ***aaaaaaaa aaaaaaaa *** ** aallaala a *** aaalaaaaaa'''
    + '''aaaa aa Juplloia Naalaaai lAo. *** IAIAIAIAIA, aaaaaaa *** Ao. 66/99'''
    + ''') ** ** ******* *** <b><i>l</i></b></span><b><i><span>aa aaaaaaaaaa '''
    + '''aaaaaaaaaa <u>aa aaaaaaaaaaaaaaaaaa</u> aaaaaaa aaaaaa aa aaaaaaa aa'''
    + '''aaaaa aa lollo</span></i></b></p>
    ''')
    # It works with render_pages/FakeHTML, I guess it's because of the lighter
    # stylesheet
    # render_pages(page_content)
    # FakeHTML(string=page_content).render()
    from ... import HTML
    HTML(string=page_content).render()

The "fix"
https://github.com/knixeur/WeasyPrint/commit/340a8c7a4977792a13773233482003f941514f97

------------------------ weasyprint/layout/inlines.py -------------------------
index 240330b0..8a287b01 100644
@@ -204,5 +204,8 @@ def skip_first_whitespace(box, skip_stack):
         if index == 0 and not box.children:
             return None
-        result = skip_first_whitespace(box.children[index], next_skip_stack)
+        try:
+            result = skip_first_whitespace(box.children[index], next_skip_stack)
+        except IndexError:
+            return None
         if result == 'continue':
             index += 1

Stack trace of the test when ran against master

weasyprint/layout/inlines.py:56: in iter_line_boxes
    device_size, absolute_boxes, fixed_boxes, first_letter_style)
weasyprint/layout/inlines.py:73: in get_next_linebox
    skip_stack = skip_first_whitespace(linebox, skip_stack)
weasyprint/layout/inlines.py:206: in skip_first_whitespace
    result = skip_first_whitespace(box.children[index], next_skip_stack)
weasyprint/layout/inlines.py:206: in skip_first_whitespace
    result = skip_first_whitespace(box.children[index], next_skip_stack)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

box = <InlineBox b>, skip_stack = (113, None)

    def skip_first_whitespace(box, skip_stack):
        """Return the ``skip_stack`` to start just after the remove spaces
        at the beginning of the line.

        See http://www.w3.org/TR/CSS21/text.html#white-space-model
        """
        if skip_stack is None:
            index = 0
            next_skip_stack = None
        else:
            index, next_skip_stack = skip_stack

        if isinstance(box, boxes.TextBox):
            assert next_skip_stack is None
            white_space = box.style['white_space']
            length = len(box.text)
            if index == length:
                # Starting a the end of the TextBox, no text to see: Continue
                return 'continue'
            if white_space in ('normal', 'nowrap', 'pre-line'):
                while index < length and box.text[index] == ' ':
                    index += 1
            return (index, None) if index else None

        if isinstance(box, (boxes.LineBox, boxes.InlineBox)):
            if index == 0 and not box.children:
                return None
>           result = skip_first_whitespace(box.children[index], next_skip_stack)
E           IndexError: list index out of range

weasyprint/layout/inlines.py:206: IndexError

Let me know if I can help you in any way, I tried to follow the code to find the real cause but couldn't and have to keep going on other stuff.

Edit: fixed formatting
Edit2: inlined test and "fix"

crash

Source

knixeur

Most helpful comment

Thanks @Tontyna and @liZe !

knixeur on 18 Mar 2019

👍2

All 12 comments

Looks like in skip_first_whitespace the skip_stack and the boxe's children are discoordinated.

When exception happens, the given skip_stack ist (113, None), but the given InlineBox has only 1 child, and consequently in line 206

        result = skip_first_whitespace(box.children[index], next_skip_stack)

the recursive call with box.children[113] fails with IndexError.

The concerned InlineBox is the l. Don't know who and where 113 children could be established in the skip_stack...

BTW: 113 is a prime number :grimacing:

Tontyna on 1 Feb 2019

Lol, IIRC the real HTML crashed with (107, None) :laughing:

Edit: Confirmed, it's _the prime number bug_!

(Pdb) skip_stack
(107, None)

knixeur on 1 Feb 2019

Yep, let's call it the prime number bug

Tontyna on 1 Feb 2019

Another problem, probably related:

<p>*<span>****************************************** *** **** ** ** ******* *********** ************************************************************************* <b>l</b></span><b>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</b></p>

liZe on 1 Mar 2019

Encircling the issue reveals: The prime number isn't meant to be the index of a child box, but the index of the letter (space) in a TextBox's text where the overlong text should be/has been broken.

After splitting all the text snippets from the loooong text into separate LineBoxes, the resume_at aka skip_stack returned by split_inline_box() manages to point at the box after the TextBox, wich in the example given by @knixeur happens to be the bold <InlineBox b>.
But somehow/sadly it leaves the last entry of the skip_stack still pointing to the letter where the last text splitting happened. This last entry, containing (<textbreakpos>, None), should, of course be replaced with None. Dont (yet) know why it isn't.

Subsequently skip_first_whitespace() tries to accesses the inexistent child № textbreakpos of the InlineBox and raises IndexError.

Painstaking preparation is required to trigger this erroneous skip_stack -- particular nesting of InlineBoxes and TextBoxes in combination with the right (wrong?) text content, page width and font size. The TextBox, its text extending close to the right margin, and a following InlineBox () being enclosed in a InlineBox (), immediately followed by non-space (x) seems to be crucial.

This fine-tuned snippet

<p><span>
********************************* *******
********* *************** *** 
********* *************** *** 
********************************** ** ******* *** <b>l</b></span>x</p>

crashes with prime number 73 :smile:

@liZe Your snippet is another issue. It just doesn't break where it ought to. The skip_stack is ok, throughout. No leftovers from TextBoxes.

Tontyna on 15 Mar 2019

Found the bug but cannot fix it. Happens in the most ugly most dirty most abominable part of the inline layout code in split_inline_box:

https://github.com/Kozea/WeasyPrint/blob/fc089f570001cc1c3879e1e7e6552c396868d5f5/weasyprint/layout/inlines.py#L854-L859

BTW: The broken_child-detection was introduced to fix #580.

Under prime number bug conditions broken_child evaluates to true, the subsequent reconstruction of the skip_stack ("adding skip stacks is a bit complicated") produces the invalid stack that finally raises the IndexError.

By forcibly setting broken_child=False and skipping the reconstruction of the stack everything is fine. Nah, not everything, of course, #580 is broken again.

So obviously the algorithm to detect broken_child is incomplete/wrong/lacking something. But it's beyond my skills to tweak it.

This minimal snippet crashes with prime number 1:

<p><span>
**************************************************************
***********
<b>fits</b></span>xxx</p>

And yes, the three whitespaces (here: linebreaks) are vital to the IndexError. Ditto vital: no whitespace in front of the xxx.

Tontyna on 17 Mar 2019

Recipe to reproduce the IndexError:

<p><span title="span required to trigger the bug">
*breakable*text* *followed*by*withespace*
<b>+fit+</b></span>CrossingTheMarginWithoutBreakRaisesIndexError</p>

Though @liZe's snippet doesnt crash with IndexError, it's related -- of course! he was right!
It reads like this:

<p><span title="span required to trigger the bug">
*non*breaking*text**followed*by*withespace*
<b>+fit+</b></span>CrossingTheMarginWithoutBreakExtendsTheLineUntil next whitespace goes on the next line</p>

The crash is prevented because
https://github.com/Kozea/WeasyPrint/blob/fc089f570001cc1c3879e1e7e6552c396868d5f5/weasyprint/layout/inlines.py#L820
returns False for the *non*breaking*text**followed*by*withespace*, while for the *breakable*text* *followed*by*withespace* we get True and enter "The dirty solution".

Tontyna on 17 Mar 2019

BTW: Taking "The dirty solution" path (i.e. calling split_inline_level one more time), but avoiding the recreation of the skip_stack, seems to fix both, the IndexError and the CrossingTheMarginShouldWrap bug.
At least if the CrossingTheMarginWithoutBreak string doesnt cross the margin of the next line, too.

Tontyna on 17 Mar 2019

👍2

Happens in the most ugly most dirty most abominable part of the inline layout code in split_inline_box

At least it's documented now, it saves me hours each time I have to understand it again :wink:.

So obviously the algorithm to detect broken_child is incomplete/wrong/lacking something.

Exactly. We had to check that the split children is the same, not only at level 1 but for all nested children until we find the same text box. It's fixed now.

Though @liZe's snippet doesnt crash with IndexError, it's related -- of course! he was right!

8724bc3 fixes this bug too, but … there's another bug in can_break_inside that is unable to detect a line break in aaa bbb. I've added a failing test in e7fd37b.

liZe on 18 Mar 2019

👍2

Thanks @Tontyna and @liZe !

knixeur on 18 Mar 2019

👍2

bug in can_break_inside that is unable to detect a line break in aaa bbb

That's because can_break_text() from text.py returns False for aaa, i.e. a string that ends with a space.

That's because can_break_text() checks for the PangoLogAttr's is_line_break attribute. And indeed, a trailing whitespace isn't a line break. In fact: whitespace is never a line break.
When I get it right, pango.pango_get_log_attrs() sets this attribute on letters that could become the first letter of a new line, that is, the letter that follows a possible line-breaking (whitespace or punctuation or...) letter.

It's a pity that we must take box.style['lang'] and box.style['white_space'] into account. Otherwise we could simply collect the text from the child's TextBoxes, including all the intermediate whitespaces, and pass that as a single string to can_break_text...

Tontyna on 18 Mar 2019

Yes, I've learned a lot of things fixing #301, and Unicode is really fascinating. I'm glad to have Pango :smile:.

We already have (almost) the whole logic in WeasyPrint: split_inline_level is able to find correctly if we can split a line, and it takes care of line breaks with nested tags. We should use it instead of relying on can_break_inside (that's small and fast but definitely buggy).

liZe on 18 Mar 2019

Was this page helpful?

0 / 5 - 0 ratings