Weasyprint: Allow page breaks in floats, absolute blocks, inline-blocks, table-cells

Created on 14 Feb 2013 · 28Comments · Source: Kozea/WeasyPrint

Floated elements that don't fit on the current page simply fall off the bottom, rather than being placed on the next page.

Here's a handy long list of floated elements to demonstrate the problem: http://www.stripey.com/demo/weasyprint/float_off_bottom.html

Look at it in Firefox and do ‘Print Preview’. You should see that there's a page break, with the list being continued on page 2. Similarly if you print from Chromium.

But WeasyPrint generates this file, where the elements simply run off the bottom of the first page: http://www.stripey.com/demo/weasyprint/float_off_bottom.pdf

bug

Source

Smylers

👍4

Most helpful comment

pls fix that problem.

Please, be kind. Everybody would like this issue to be closed, but there's no simple solution.

liZe on 25 Jul 2019

👍6

All 28 comments

Yes, this is a known limitation: no page breaks are supported inside floats, absolute positioning, or table cells. Unfortunately right now I don’t have a better answer than “avoid using floats that way”.

I’d be happy to help anyone who wants to fix this, but this is a non-trivial change in the layout code. Otherwise this is something to be fixed eventually, but I don’t know when I’ll get to it.

SimonSapin on 14 Feb 2013

Thanks. From your description I'm not sure whether this is the known limitation or not.

In this case I'm not trying to have page breaks _inside_ a floated element, but _between_ floated elements. Each li is floated separately. My apologies for not making that clearer in the initial report.

Smylers on 14 Feb 2013

As reported in #375, we have the same problem with consecutive absolute/relative blocks.

liZe on 28 Oct 2016

I have just started using WeasyPrint, and I'm already a big fan. However, I have also quickly run into the float/break issue -- my users want Bootstrap and floated columns, and don't like what happens in the PDF document!

Can @SimonSapin or anyone else comment on the refactoring that would be necessary to fix this wartish problem? I haven't perused your codebase yet, but I know Python very well; so, I'm looking for high-level overview of the current layout model/algorithm and why it gets tripped up trying to put breaks in floats, and what would have to be changed.

Thanks,
-Hugh

hughsw on 27 Apr 2017

301 @liZe

SimonSapin on 27 Apr 2017

Let's go! I'll skip some details and lie a little bit to avoid useless complexity.

Web pages have mainly been created to be displayed on rectangles whose width is fixed and whose height is automatically calculated according to the content. That's what a "normal" browser do. But the problem is a bit different when you want to print these web pages: the height is fixed too and you'll need to cut the content between different pages.

CSS defines how the layout must be done, how blocks and texts are displayed. "Normal" blocks are put one below the other and "normal" texts are broken between multiple lines put one below the other. The way the "normal" content is displayed is called the normal flow.

CSS gives the possibility to remove blocks from the normal flow of the page and make them behave in a different way. These blocks sometimes create their own flow, creating nested or parallel flows in the page. That's where it's becoming a bit hard.

When CSS 2 has been written, floats and absolute/relative blocks (and somehow tables) were (almost) the only blocks creating parallel flows, and no-one really defined how these parallel flows had to be broken between pages. That's why WeasyPrint's layout has only one flow that can be correcly broken, and the blocks that are outside this flow are seen as atomic blocks going below the bottom of the page if needed.

But now, many CSS specifications have added many ways to create strange flows, such as columns, regions, flexbox and grid. It was time to define how parallel and nested flows had to be broken between pages. It's now done in the fragmentation module. It's not clearly defined but it's much better than what we had in CSS 2.

Bad news: it was not written when we started WeasyPrint.

Really bad news: it's really different from what we have in WeasyPrint.

It's probably not that difficult to implement the parts of the fragmentation module that are needed to fix this issue (well, for really simple cases). But it will need to slightly change many functions and modules in a single atomic commit that will be huge. We can imagine that the work needed is something like #291: long, tiring and painful. But not impossible.

liZe on 28 Apr 2017

😄3 👍1

OK. Where should I be looking in the code to learn about the following (beyond the peephole insight of #291):

WeasyPrint's layout model and algorithm
Where to make fixes for this issue
What would have to change architecturally to address the fragmentation module spec

Thanks!

hughsw on 28 Apr 2017

WeasyPrint's layout model and algorithm

You'll find all the code you need in the layout folder. The layout.pages module has got a make_all_pages function, calling the make_page function, calling the block_level_layout function, etc.

Where to make fixes for this issue
What would have to change architecturally to address the fragmentation module spec

Nested flows (as defined by the fragmentation CSS module) are pretty well supported for block-level and inline-level boxes, using a variable called resume_at that keeps a kind of pointer to where the rendering is (the "current" position). resume_at contains nested tuples representing the nested boxes, you'll find how it works for example in the block_container_layout function (in layout.blocks).

We need to add the support of parallel flows. Instead of one pointer pointing to one position in the flow, we need multiple pointers pointing to the "current" positions in the parallel flows. I imagine that resume_at can be changed into resume_at_list, containing one or more resume_at pointers.

To fix this issue, we basically need:

to change resume_at into resume_at_list almost everywhere, as redering a box in the flow can return parallel positions where the parallel flows have reached the end of the page (one flow for itself and for each child creating parallel flows such as floats, table cells, etc., the list is in the fragmentation module),
to make floats, table cells, etc. take care of the bottom of the page and return their resume_at_list, instead of assuming that they have no limit for their vertical position.

That's all :smile:! I think that everything's not correcly defined in the spec, we'll have to make some stupid choices for stupid cases (how do you render floats whose top border is taller than the page?), but the "normal" use cases should be quite well described and easy (and long, and painful) to implement.

If you need anything, I'll be really happy to help!

liZe on 28 Apr 2017

👍1

Thanks. That's just the kind of overview I was looking for.

One last thing: Testing driven development: (okay, two last things)

What's the quickest way to run tests during development work?
Do you have instances of HTML/CSS tests for parallel flows, that, when passing, will indicate that the work is finished? I of course have the instance that got me here, but do you know of a reference test set?

hughsw on 28 Apr 2017

FYI, my habit is to do minor refactoring while I'm working to understand existing logic. So you can expect some PRs along those lines.

Also, I'm completely new to CSS implementation work ! ;-) However, my career has largely been developing scientific software, so I'm at home with specs and deep algorithms.

The code appears to have a good number of pointers to key CSS specs. However, if there are some spec documents that are so basic that you wouldn't mention them in code comments, they might actually be useful for me! So, I would appreciate pointers to key algorithmic starting points for CSS.

Thanks.

hughsw on 28 Apr 2017

What's the quickest way to run tests during development work?

./setup.py test (launch tests and check coding style).

Do you have instances of HTML/CSS tests for parallel flows.

<style>
  @page {
    font-family: monospace;
    height: 2.5em;
    line-height: 1em;
    margin: 0;
    width: 10em;
  }
  body {
    margin: 0;
  }
  div {
    background: red;
    float: left; 
    width: 50%; 
  }
</style>

<body>
  <div>
    float float float float float
  </div>
  flow flow flow flow flow
</body>

You need to get something like:

Page 1
+-------------------------+
| float float | flow flow |
| float float | flow flow |
+-------------------------+
Page 2
+-------------------------+
| float       | flow      |
|-------------+           |
+-------------------------+

However, my career has largely been developing scientific software, so I'm at home with specs and deep algorithms.

You'll need these skills!

So, I would appreciate pointers to key algorithmic starting points for CSS.

There's a very useful chapter in the documentation. In the CSS spec, the best starting point is probably the presentation of the normal flow and the implementation of 9.4.1 and 9.4.2 in layout.blocks and layout.inlines.

Good luck!

liZe on 28 Apr 2017

OK, you threw me in the deep end of CSS spec, and I'm floundering, but progressing, through prose like this:

Except for table boxes, which are described in a later chapter, and replaced elements, a block-level box is also a block container box. A block container box either contains only block-level boxes or establishes an inline formatting context and thus contains only inline-level boxes. Not all block container boxes are block-level boxes: non-replaced inline blocks and non-replaced table cells are block containers but not block-level boxes. Block-level boxes that are also block containers are called block boxes.

I don't yet have a solid mental-model of what it takes to do all the layout given multiple flows and page breaks, but I'm working on it, and the WeasyPrint code-base is very approachable and the focus on resume_at helps. To keep getting my hands dirty I intend to add a terse detail string to each assert, so at least I'll know e.g. box sizes when I'm breaking things...

hughsw on 4 May 2017

We solved the table split problem by placing <div style="clear:both;"><div> before table.

polonat on 5 Oct 2017

❤1 👍1

This problem is really annoying, I found a way to fix this.

The key point is split one <tr> into more <tr>, eg:

<tr>
  <td>col1</td>
  <td>long lines1
long lines2
long lines3
long lines4
  </td>
 </tr>

will be changed to

<tr>
  <td class="top_border"></td>
  <td class="top_border">long lines1</td>
 </tr>
<tr>
  <td class="no_border">col1</td>
  <td class="no_border">long lines2</td>
 </tr>
<tr>
  <td class="no_border"></td>
  <td class="no_border">long lines3</td>
 </tr>
<tr>
  <td class="no_border"></td>
  <td class="no_border">long lines4</td>
 </tr>

css

        table tr .no_border {
            border-left: 1px solid #000000;
            border-right: 1px solid #000000;
            border-top: 0;
            border-bottom: 0;
        }

        table tr .top_border {
            border-left: 1px solid #000000;
            border-right: 1px solid #000000;
            border-top: 1px solid #000000;
            border-bottom: 0;
        }

This just some sample code, just try to explain the main ideas, you need to change it to fit your situations. Wish this could help someone out.

wd on 29 May 2018

This just some sample code, just try to explain the main ideas, you need to change it to fit your situations. Wish this could help someone out.

wd, thank you for your idea, but (at least in our case) is not feasible.

Any news about this bug??? It's critical if we pretend put in production ... :(

RafaelLinux on 26 Nov 2018

Unfortunately, after getting a beauty result Weasyprint, we have a deadline for our project upgrade where we need to create PDF files .... and we had to choose a solution where we didn't lose any text on getting the PDF files. This bug creates a big problem in our case, so finally we needed to adopt mPDF libraries instead. It have other collateral issues, but in final output is the same text that were in original HTML page.

Anyway, I give you all thanks for your help and comprehension. I will try to visit this thread from time to time, to see if it's closed ..... and then we will at last use Weasyprint as our solution.

You are doing a great job!!!!! ;)

RafaelLinux on 28 Nov 2018

Got same problem,, but

<ol>
  <li>a</li>
...
  <li>z</li>
</ol>

If there some <li> will show in page 2, it will not shown in pdf,, but after that code will show in page 2..

Any help will appreciated..

budimm on 8 Mar 2019

If there some <li> will show in page 2, it will not shown in pdf

Page breaks are allowed in lists. You probably get this problem because your list has a position, display or float property different from the default values.

liZe on 8 Mar 2019

Yeah it's solved a moment ago,, sory late reply..
i got class="input-group" in my bootstrap layout.. i remove it,,
And my ol and table runs well..
Thanks for reply.. :+1:

budimm on 8 Mar 2019

So we ran into this issue as well.

@liZe I see your description above about how to solve this issue and you suggest turning resume_at into an array. This seems complicated and a bit error prone. I'm wondering if there isn't a simpler way to do this, using Python generators/coroutines. My idea is is that each element is only responsible for rendering itself, yielding (unsplittable) layout blocks and having containers combine them. This way you don't have to track any kind of resume_at value. The algorithm for rendering a table over multiple pages would look something like (pseudocode!)

# remaining = remaining height in page
foreach row in table:
    # Init cells
    foreach cell in row:
        iters[cell] = cell.start_render(width=cell.width)
    # While any cell still has content
    while any(iters[c] not None for c in row):
        this_row = LayoutContainer()
        for cell in row:
            if not iters[cell]: continue
            # Collect blocks from this cell until space is full
            remain_cell = remaining
            while remain_cell > 0:
                # Calls generator to return a layout block
                block = it.send(remain_cell)
                if not block: # This cell done
                    iters[cell] = None
                    break
                this_row.add(block)   # Result block
                remain_cell -= block.height
        # Note: if we filled a page, then remaining will become the height of the new page
        remaining = (yield this_row)

As you can see, this can split a table cell over multiple pages, without any of the contents of the cells actually being aware that they are being split over multiple pages. There are of course details to work out, if remaining is too small for your widget, who is responsible for adding the spacer? And floats need to be rendered first, and then other things needing to be rendered around them (possibly by passing a "current page" object around that widgets can inspect to see what they need to wrap around). This rendering method would allow you to render the float until the end of the this page so you know what it fills, then rendering the rest of the page. Then you can continue rendering the rest of float on the next page.

I've not really looked at the code so I'm not sure if there is some reason why this couldn't work, but this does seem much simpler that tracking resume locations yourself, by letting the Python coroutine stacks hold the state for you implicitly.

kleptog on 10 May 2019

Is anyone currently working on a PR for this? I see that multiple people have begun looking into it. As I'm looking at it now, I don't want to duplicate someone else's effort.

pytrumpeter on 21 Jul 2019

For expedience sake I have moved on. I'm using Puppeteer in a headless Chrome browser to turn HTML into PDF. And, I'm using Mozilla's pdf.js library to analyze PDFs. The heavyweight browser folks have solved all the common problems, and they keep up with the evolving specs... Yes, this means I'm using Typescript/Javascript a lot these days, and, I'm enjoying functional programming far more than I ever expected.

hughsw on 22 Jul 2019

I moved to other tool too precisely by this bug. Any HTML to PDF tool web browser based is far away to have the functionality of WeasyPrint. I'll use again when this bug is solved.

RafaelLinux on 22 Jul 2019

Is anyone currently working on a PR for this? I see that multiple people have begun looking into it. As I'm looking at it now, I don't want to duplicate someone else's effort.

I don't think anyone is working on this right now. If you need help, please ask, I'll be happy to answer!

liZe on 22 Jul 2019

pls fix that problem. the workaround with clear: both is not working for me.
i've a table with a td that's larger than one page.

cymn on 25 Jul 2019

pls fix that problem.

Please, be kind. Everybody would like this issue to be closed, but there's no simple solution.

liZe on 25 Jul 2019

👍6

Any news on this thread, what about the support of tables?

I'm currently using wkHtmlToPdf and also have the issue with tables, the current behavior is cut every where (that is fine for me) but it also allow to cut in the middle of a line of text that makes the lib not usable for me.

Do we have a patch for this lib for my desired behavior?