Currently all tasks metadata and metrics are embedded as a JSON object in the execution report HTML file.
For big workflow executions ie. spawning 50K tasks or more, the resulting become to big (>100MB) and browser are not able to handle it.
The goal of this feature is to implement a better strategy to handle a large number of tasks.
/cc @ewels we need to brainstorm a bit about this at some point.
Hah, this is a familiar problem.. :) Can perhaps drop the table and generate the plots at run time to embed in the HTML?
It could be an option, but it should still parse the big JSON payload to compute the values displayed in the chart, shouldn't it? Maybe we could compute those values server side. Does it make sense ?
Yes exactly, that's what I meant by run time. So in MultiQC I have a max number of samples and after that I generate flat image plots, which scale to infinity without increasing file size. The Plotly library that we're using for the report has built-in options for exporting to flat images like this, at least in the Python version: https://plot.ly/python/static-image-export/
Phil
OK, but wouldn't be even easier moving this code NF side and leave the rendering on the client?
for (var i = 0; i < window.data_byprocess[pname].length; i ++) {
rc[i] = xround(parseInt(window.data_byprocess[pname][i]['%cpu']));
pc[i] = xround((parseInt(window.data_byprocess[pname][i]['%cpu']) / (parseInt(window.data_byprocess[pname][i]['cpus']) * 100)) * 100);
rm[i] = xround(parseInt(window.data_byprocess[pname][i]['vmem']) / 1000000000);
pm[i] = xround((parseInt(window.data_byprocess[pname][i]['vmem']) / parseInt(window.data_byprocess[pname][i]['memory'])) * 100)
rt[i] = xround(moment.duration( parseInt(window.data_byprocess[pname][i]['realtime']) ).asMinutes());
pt[i] = xround((parseInt(window.data_byprocess[pname][i]['realtime']) / parseInt(window.data_byprocess[pname][i]['time'])) * 100)
rd[i] = xround((parseInt(window.data_byprocess[pname][i]['read_bytes']) + parseInt(window.data_byprocess[pname][i]['write_bytes'])) / 1000000000);
if (parseInt(window.data_byprocess[pname][i]['time']) > 0){ time_alloc_hasdata = true; }
if (rd[i] > 0){ readwrite_hasdata = true; }
}
Yes, if the raw plot data isn't too big I guess. If we have 10s or 100s thousand of tasks with four numbers each that could still be quite a bit of data and browser processor load. But yes maybe let's try that first!
Phil
Wait, basically the above code creates a separate series for each process name, but it still need to hold in memory and parse all tasks.
I was thinking it was calculating the min,max,median,etc values.
Is it not possible to give Plotly just the final values to render?
Apparently not 馃槥 https://github.com/plotly/plotly.js/issues/1059
Too bad. This only leave the NF side image rendering, but I can't find the Java client for Plotly. Maybe it's possible to use the JS one with the Java embedded JavaScript engine ..
馃槹
It might be worth trying with just the numbers first. I imagine that 99% of the filesize will be the other JSON stuff.
Not so sure. I will give a try to see if it's possible to render a chart with Nashorn.
Wait, it seems it's possible to render a box plot just providing the Y data. Look for an example here. I guess this would give us much more control, including a way to fix #504.
What do you think?
Brilliant! This is totally cheating but why not??
Good. Being so if I implement the computation of [min, q1, median, q3, max] for each process, would you be able to render it? does it sound a good plan?
Yes, sounds great 馃憤
I hesitated about what the whiskers show - the min / max or some other metric. Plotly has some documentation on this here: https://help.plot.ly/what-is-a-box-plot/#the-whiskers - I guess the current plots show the min / max? May be worth double checking though to be sure that we're doing the same thing as normal reports.
OK, I will check that.
So, i've double checked this and the whiskers show the min and the max. How would you need the json data structured to render these info?
Great! Doesn't matter too much what the JSON looks like, I can work around it. At the moment we have window.data['trace'] that contains all of the stats for each task. Could instead have window.data['process'] that has these summary stats for the processes? Then if the former isn't defined the javascript can use the latter instead.
But whatever is easiest for you really - as long as the task JSON isn't printed to the HTML (to avoid the filesize) and the new summaries are then we should be able to adapt the javascript accordingly.
Phil
Ok. Let's try to speedup this. Could you provide the javascript snippet to render that charts using a fake json structure? then I can generate it dynamically on NF side.
Is it not nicer to just print the data into the report as JSON and then use the embedded javascript file that we already have for the javascript code?
Nothing fancy required. From the top of my head:
{
"process_summary": {
"time": [
[ "fastqc", 12, 24, 35, 46, 90 ],
[ "bwa", 120, 230, 340, 450, 560 ],
],
"cpu": [
[ "fastqc", 20, 50, 76, 100, 140 ],
],
}
}
Where the arrays are process name, min, lower quartile, median, upper quartile, max. Or whatever.
Once we have this in as JSON in the report (the above printed in the same way as the existing data) I can plug it into the existing plot code pretty easily.
Phil
What about a structure like this:
{
"bwa-mem":{
"cpu":{
"min":1000.0,
"minLabel":"bwa-mem-1",
"max":11000.0,
"maxLabel":"bwa-mem-3",
"q1":3500.0,
"q2":6000.0,
"q3":8500.0,
"mean":6000.0
},
"mem":{
"min":2000.0,
"minLabel":"bwa-mem-1",
"max":12000.0,
"maxLabel":"bwa-mem-3",
"q1":4500.0,
"q2":7000.0,
"q3":9500.0,
"mean":7000.0
},
"time":{
"min":3000.0,
"minLabel":"bwa-mem-1",
"max":13000.0,
"maxLabel":"bwa-mem-3",
"q1":5500.0,
"q2":8000.0,
"q3":10500.0,
"mean":8000.0
},
"reads":{
"min":4000.0,
"minLabel":"bwa-mem-1",
"max":14000.0,
"maxLabel":"bwa-mem-3",
"q1":6500.0,
"q2":9000.0,
"q3":11500.0,
"mean":9000.0
},
"writes":{
"min":5000.0,
"minLabel":"bwa-mem-1",
"max":15000.0,
"maxLabel":"bwa-mem-3",
"q1":7500.0,
"q2":10000.0,
"q3":12500.0,
"mean":10000.0
}
},
"multiqc":{
"cpu":{
"min":16000.0,
"minLabel":"multiqc-1",
"max":21000.0,
"maxLabel":"multiqc-2",
"q1":17250.0,
"q2":18500.0,
"q3":19750.0,
"mean":18500.0
},
"mem":{
"min":17000.0,
"minLabel":"multiqc-1",
"max":22000.0,
"maxLabel":"multiqc-2",
"q1":18250.0,
"q2":19500.0,
"q3":20750.0,
"mean":19500.0
},
"time":{
"min":18000.0,
"minLabel":"multiqc-1",
"max":23000.0,
"maxLabel":"multiqc-2",
"q1":19250.0,
"q2":20500.0,
"q3":21750.0,
"mean":20500.0
},
"reads":{
"min":19000.0,
"minLabel":"multiqc-1",
"max":24000.0,
"maxLabel":"multiqc-2",
"q1":20250.0,
"q2":21500.0,
"q3":22750.0,
"mean":21500.0
},
"writes":{
"min":20000.0,
"minLabel":"multiqc-1",
"max":25000.0,
"maxLabel":"multiqc-2",
"q1":21250.0,
"q2":22500.0,
"q3":23750.0,
"mean":22500.0
}
}
}
Perfect! Labels are maybe not really needed, but doesn't hurt.
Good. Let me add a few notes:
1) the label are meant to provide an useful information when passing the mouse over the chart #504
2) I've added also the mean which is not reported in the current chart.
3) I've split the storage data into reads and writes with the idea to render them in two separate charts, because I think the current report is not informative enough on disk usage.
Just pushed this change. Now the JSON payload includes a new summary data object structured as shown in the example in the previous comment.
I've fixed a couple of issues and added the logging of the summary JSON in the nextflow log.
A note more, there a could corner case in which a series eg. cpu, mem, etc it's empty because the metric are missing. Currently it's reported as a null. For example:
{
"foo": {
"cpu": null,
"mem": null,
"time": {
"mean": 82.25,
"min": 54.0,
"q1": 79.5,
"q2": 90.0,
"q3": 92.75,
"max": 95.0,
"minLabel": "foo (4)",
"maxLabel": "foo (1)",
"q1Label": "foo (4)",
"q2Label": "foo (3)",
"q3Label": "foo (2)"
},
"reads": null,
"writes": null
}
}
Any progress here?
No sorry, haven't touched a keyboard for coding for a couple of weeks now. Next two weeks look packed too, will hopefully have some time after that (or will treat it as an evening / airplane project).
See you Monday! Maybe we can have a mini-hackathon during a coffee break ;)
It sounds good!
@ewels just to prevent any duplicated work, with @emi80 we have almost completed this.
Ah, you guys are heroes! Sorry for being so slow with this - I had a stab the other day at the airport back from Barcelona but couldn't get the testing environment running again for some reason so moved on to some of the MultiQC backlog instead.
OK. We refactored a bit the report to allow the rendering of large number of tasks. Main changes are the following:
1) task table rendering has been optimised to improve performance
2) trace data is retuned only when it contains < 10'000 tasks (default), for a higher number the tasks table is not shown
3) boxplots are rendered using the summary data produced by NF
4) as side effect the distribution is not showed any more (that's really a pity, but i was unable to find a way to render it at least when the trace records are available.
Please review it at your earliest convenience.
Sorry one more thing, now we have the information of the task name for each metric ie. minLabel, q1Label, etc
However I was unable to show it over the boxplot as it was suggesting #504. Any idea if it's plotty allows that?
Most helpful comment
Wait, it seems it's possible to render a box plot just providing the Y data. Look for an example here. I guess this would give us much more control, including a way to fix #504.
What do you think?