Nextflow: The execution report cannot handle big workflows

Created on 7 Dec 2017 · 31Comments · Source: nextflow-io/nextflow

Currently all tasks metadata and metrics are embedded as a JSON object in the execution report HTML file.

For big workflow executions ie. spawning 50K tasks or more, the resulting become to big (>100MB) and browser are not able to handle it.

The goal of this feature is to implement a better strategy to handle a large number of tasks.

/cc @ewels we need to brainstorm a bit about this at some point.

Source

pditommaso

Most helpful comment

Wait, it seems it's possible to render a box plot just providing the Y data. Look for an example here. I guess this would give us much more control, including a way to fix #504.

What do you think?

pditommaso on 7 Dec 2017

😄1 👍1

All 31 comments

Hah, this is a familiar problem.. :) Can perhaps drop the table and generate the plots at run time to embed in the HTML?

ewels on 7 Dec 2017

It could be an option, but it should still parse the big JSON payload to compute the values displayed in the chart, shouldn't it? Maybe we could compute those values server side. Does it make sense ?

pditommaso on 7 Dec 2017

Yes exactly, that's what I meant by run time. So in MultiQC I have a max number of samples and after that I generate flat image plots, which scale to infinity without increasing file size. The Plotly library that we're using for the report has built-in options for exporting to flat images like this, at least in the Python version: https://plot.ly/python/static-image-export/

Phil

ewels on 7 Dec 2017

OK, but wouldn't be even easier moving this code NF side and leave the rendering on the client?

      for (var i = 0; i < window.data_byprocess[pname].length; i ++) {
        rc[i] = xround(parseInt(window.data_byprocess[pname][i]['%cpu']));
        pc[i] = xround((parseInt(window.data_byprocess[pname][i]['%cpu']) / (parseInt(window.data_byprocess[pname][i]['cpus']) * 100)) * 100);
        rm[i] = xround(parseInt(window.data_byprocess[pname][i]['vmem']) / 1000000000);
        pm[i] = xround((parseInt(window.data_byprocess[pname][i]['vmem']) / parseInt(window.data_byprocess[pname][i]['memory'])) * 100)
        rt[i] = xround(moment.duration( parseInt(window.data_byprocess[pname][i]['realtime']) ).asMinutes());
        pt[i] = xround((parseInt(window.data_byprocess[pname][i]['realtime']) / parseInt(window.data_byprocess[pname][i]['time'])) * 100)
        rd[i] = xround((parseInt(window.data_byprocess[pname][i]['read_bytes']) + parseInt(window.data_byprocess[pname][i]['write_bytes'])) / 1000000000);
        if (parseInt(window.data_byprocess[pname][i]['time']) > 0){ time_alloc_hasdata = true; }
        if (rd[i] > 0){ readwrite_hasdata = true; }
      }

pditommaso on 7 Dec 2017

Yes, if the raw plot data isn't too big I guess. If we have 10s or 100s thousand of tasks with four numbers each that could still be quite a bit of data and browser processor load. But yes maybe let's try that first!

Phil

ewels on 7 Dec 2017

Wait, basically the above code creates a separate series for each process name, but it still need to hold in memory and parse all tasks.

I was thinking it was calculating the min,max,median,etc values.

Is it not possible to give Plotly just the final values to render?

pditommaso on 7 Dec 2017

Apparently not 😞 https://github.com/plotly/plotly.js/issues/1059

ewels on 7 Dec 2017

Too bad. This only leave the NF side image rendering, but I can't find the Java client for Plotly. Maybe it's possible to use the JS one with the Java embedded JavaScript engine ..

pditommaso on 7 Dec 2017

😰

It might be worth trying with just the numbers first. I imagine that 99% of the filesize will be the other JSON stuff.

ewels on 7 Dec 2017

Not so sure. I will give a try to see if it's possible to render a chart with Nashorn.

pditommaso on 7 Dec 2017

Wait, it seems it's possible to render a box plot just providing the Y data. Look for an example here. I guess this would give us much more control, including a way to fix #504.

What do you think?

pditommaso on 7 Dec 2017

😄1 👍1

Brilliant! This is totally cheating but why not??

ewels on 11 Dec 2017

Good. Being so if I implement the computation of [min, q1, median, q3, max] for each process, would you be able to render it? does it sound a good plan?

pditommaso on 11 Dec 2017

Yes, sounds great 👍

I hesitated about what the whiskers show - the min / max or some other metric. Plotly has some documentation on this here: https://help.plot.ly/what-is-a-box-plot/#the-whiskers - I guess the current plots show the min / max? May be worth double checking though to be sure that we're doing the same thing as normal reports.

ewels on 11 Dec 2017

OK, I will check that.

pditommaso on 11 Dec 2017

So, i've double checked this and the whiskers show the min and the max. How would you need the json data structured to render these info?

pditommaso on 11 Jan 2018

Great! Doesn't matter too much what the JSON looks like, I can work around it. At the moment we have window.data['trace'] that contains all of the stats for each task. Could instead have window.data['process'] that has these summary stats for the processes? Then if the former isn't defined the javascript can use the latter instead.

But whatever is easiest for you really - as long as the task JSON isn't printed to the HTML (to avoid the filesize) and the new summaries are then we should be able to adapt the javascript accordingly.

Phil

ewels on 14 Jan 2018

Ok. Let's try to speedup this. Could you provide the javascript snippet to render that charts using a fake json structure? then I can generate it dynamically on NF side.

pditommaso on 14 Jan 2018

Is it not nicer to just print the data into the report as JSON and then use the embedded javascript file that we already have for the javascript code?

Nothing fancy required. From the top of my head:

{
    "process_summary": {
        "time": [
            [ "fastqc", 12, 24, 35, 46, 90 ],
            [ "bwa", 120, 230, 340, 450, 560 ],
        ],
        "cpu": [
            [ "fastqc", 20, 50, 76, 100, 140 ],
        ],
    }
}

Where the arrays are process name, min, lower quartile, median, upper quartile, max. Or whatever.

Once we have this in as JSON in the report (the above printed in the same way as the existing data) I can plug it into the existing plot code pretty easily.

Phil

ewels on 15 Jan 2018

What about a structure like this:

{
   "bwa-mem":{
      "cpu":{
         "min":1000.0,
         "minLabel":"bwa-mem-1",
         "max":11000.0,
         "maxLabel":"bwa-mem-3",
         "q1":3500.0,
         "q2":6000.0,
         "q3":8500.0,
         "mean":6000.0
      },
      "mem":{
         "min":2000.0,
         "minLabel":"bwa-mem-1",
         "max":12000.0,
         "maxLabel":"bwa-mem-3",
         "q1":4500.0,
         "q2":7000.0,
         "q3":9500.0,
         "mean":7000.0
      },
      "time":{
         "min":3000.0,
         "minLabel":"bwa-mem-1",
         "max":13000.0,
         "maxLabel":"bwa-mem-3",
         "q1":5500.0,
         "q2":8000.0,
         "q3":10500.0,
         "mean":8000.0
      },
      "reads":{
         "min":4000.0,
         "minLabel":"bwa-mem-1",
         "max":14000.0,
         "maxLabel":"bwa-mem-3",
         "q1":6500.0,
         "q2":9000.0,
         "q3":11500.0,
         "mean":9000.0
      },
      "writes":{
         "min":5000.0,
         "minLabel":"bwa-mem-1",
         "max":15000.0,
         "maxLabel":"bwa-mem-3",
         "q1":7500.0,
         "q2":10000.0,
         "q3":12500.0,
         "mean":10000.0
      }
   },
   "multiqc":{
      "cpu":{
         "min":16000.0,
         "minLabel":"multiqc-1",
         "max":21000.0,
         "maxLabel":"multiqc-2",
         "q1":17250.0,
         "q2":18500.0,
         "q3":19750.0,
         "mean":18500.0
      },
      "mem":{
         "min":17000.0,
         "minLabel":"multiqc-1",
         "max":22000.0,
         "maxLabel":"multiqc-2",
         "q1":18250.0,
         "q2":19500.0,
         "q3":20750.0,
         "mean":19500.0
      },
      "time":{
         "min":18000.0,
         "minLabel":"multiqc-1",
         "max":23000.0,
         "maxLabel":"multiqc-2",
         "q1":19250.0,
         "q2":20500.0,
         "q3":21750.0,
         "mean":20500.0
      },
      "reads":{
         "min":19000.0,
         "minLabel":"multiqc-1",
         "max":24000.0,
         "maxLabel":"multiqc-2",
         "q1":20250.0,
         "q2":21500.0,
         "q3":22750.0,
         "mean":21500.0
      },
      "writes":{
         "min":20000.0,
         "minLabel":"multiqc-1",
         "max":25000.0,
         "maxLabel":"multiqc-2",
         "q1":21250.0,
         "q2":22500.0,
         "q3":23750.0,
         "mean":22500.0
      }
   }
}

pditommaso on 23 Jan 2018

Perfect! Labels are maybe not really needed, but doesn't hurt.

ewels on 24 Jan 2018

Good. Let me add a few notes:

1) the label are meant to provide an useful information when passing the mouse over the chart #504
2) I've added also the mean which is not reported in the current chart.
3) I've split the storage data into reads and writes with the idea to render them in two separate charts, because I think the current report is not informative enough on disk usage.

pditommaso on 24 Jan 2018

Just pushed this change. Now the JSON payload includes a new summary data object structured as shown in the example in the previous comment.

pditommaso on 25 Jan 2018

I've fixed a couple of issues and added the logging of the summary JSON in the nextflow log.

A note more, there a could corner case in which a series eg. cpu, mem, etc it's empty because the metric are missing. Currently it's reported as a null. For example:

 {
     "foo": {
         "cpu": null,
         "mem": null,
         "time": {
             "mean": 82.25,
             "min": 54.0,
             "q1": 79.5,
             "q2": 90.0,
             "q3": 92.75,
             "max": 95.0,
             "minLabel": "foo (4)",
             "maxLabel": "foo (1)",
             "q1Label": "foo (4)",
             "q2Label": "foo (3)",
             "q3Label": "foo (2)"
         },
         "reads": null,
         "writes": null
     }
 }

pditommaso on 25 Jan 2018

👍1

Any progress here?

pditommaso on 31 Jan 2018

No sorry, haven't touched a keyboard for coding for a couple of weeks now. Next two weeks look packed too, will hopefully have some time after that (or will treat it as an evening / airplane project).

See you Monday! Maybe we can have a mini-hackathon during a coffee break ;)

ewels on 31 Jan 2018

It sounds good!

pditommaso on 31 Jan 2018

@ewels just to prevent any duplicated work, with @emi80 we have almost completed this.

pditommaso on 16 Feb 2018

Ah, you guys are heroes! Sorry for being so slow with this - I had a stab the other day at the airport back from Barcelona but couldn't get the testing environment running again for some reason so moved on to some of the MultiQC backlog instead.

ewels on 16 Feb 2018

OK. We refactored a bit the report to allow the rendering of large number of tasks. Main changes are the following:
1) task table rendering has been optimised to improve performance
2) trace data is retuned only when it contains < 10'000 tasks (default), for a higher number the tasks table is not shown
3) boxplots are rendered using the summary data produced by NF
4) as side effect the distribution is not showed any more (that's really a pity, but i was unable to find a way to render it at least when the trace records are available.

Please review it at your earliest convenience.

pditommaso on 16 Feb 2018

Sorry one more thing, now we have the information of the task name for each metric ie. minLabel, q1Label, etc

However I was unable to show it over the boxplot as it was suggesting #504. Any idea if it's plotty allows that?

pditommaso on 16 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

JSON output for `nextflow info`

ewels · 6Comments

Error when trying to use grep in Nextflow

Z-Zen · 5Comments

Println operator output is not shown with ansi logging

olgabot · 7Comments

Allow the `storeDir` directive to handle file names with "non-standard" characters

tinyheero · 6Comments

Creation of temp files in the /dev/shm directory can result in a permission error in some Linux installations

mfoll · 6Comments