Whatever we look at the Tasks table in _Raw values mode_ or _Human readable mode_ we expect to read the exact same values for the memory usage if the decimal system is used. We also expect these values to be the same in the Timeline.
in the Memory usage plot from the report, the memory value reported is 1051574272 (raw value extracted from Plotly) and displayed as 1.052G in the boxplot
in the Tasks table from the report in the Raw values mode, the memory value reported in the field vmem is 1051574272 which is consistent the previous value in the plot
in the Tasks table from the report in the Human readable mode, the memory value reported in the field vmem is 1.1GB seems to be ~ 1051574272 x (1.024 x 1.024). This value should be 0.97935GB = 1051574272/(1024 x 1024 x 1024) if the binary unit is used. Same behavior holds for peak_vmem and allocated memory fields but not for rss and peak_rss.
in the Timeline, the memory displayed is 1002.9 MB which is ~ 1051574272/(1024 x 1024) = 1002.859375. This is correct if the binary unit is used.
It should be mentioned whether you use decimal or binary unit to display the memory usage.
Install the program stress:
sudo apt-get install stress or sudo yum install stress
Create the following nextflow.nf script:
#!/usr/bin/env nextflow
process TwoCpus4mn {
cpus 1
memory '1 GB'
"""
/usr/bin/time -v stress -c 2 -t 15 -m 1 --vm-bytes 1000000000
"""
}
Launch:
nextflow nextflow.nf -with-report report.html -with-timeline timeline.html
reply to myself:
Actually, the memory value reported in the field vmem with the value 1.1GB is due to the rounding performed in line 154 of the file modules/nextflow/src/main/resources/nextflow/trace/assets/ReportTemplate.js: return bytes.toFixed(1)+' '+units[u];
Changing this line to return bytes.toFixed(3)+' '+units[u]; gives 1.052 GB which is consistent with what is displayed in the boxplot. I suggest to apply this modification in ReportTemplate.js.
It seems that the report uses decimal unit for the memory while the timeline uses the binary unit. Same unit should be used for both.
Related to this I've also noted that the fields read_bytes and write_bytes are not consistent with the activity of stress, it should be used rchar and wchar instead. http://man7.org/linux/man-pages/man5/proc.5.html
This is interesting, I started seeing issues in the reported memory usage on our new cluster as well (RHEL 7, Singularity 2.5.2), was not sure if its related. Example:

Obviously, nothing was using 1.5TB of RAM, but I have been seeing this in the reports. The trace from these tasks:
task_id hash process %mem rss vmem peak_rss peak_vmem rchar wchar syscr syscw
1129 89/416c95 custom_sample_report 0.0 82096128 1664303104 82231296 1576106508288 9044907 44077 1315 190
1135 bd/ea7a7c custom_sample_report 0.0 82849792 1674637312 82984960 1576106508288 9567085 20826 1194 158
1137 1c/b5b8bc custom_sample_report 0.0 84549632 1674637312 84684800 1576106508288 9566701 20809 1198 158
1120 9f/202012 custom_sample_report 0.0 82272256 1674637312 82407424 1576106508288 9567187 20977 1194 160
1124 7e/00070c custom_sample_report 0.0 81670144 1675296768 81805312 1576106512384 9567045 20743 1191 158
1127 33/c31a21 custom_sample_report 0.0 81707008 1674637312 81842176 1576106508288 9566915 20805 1196 158
1123 dc/d4da94 custom_sample_report 0.0 81707008 1674637312 81842176 1576106508288 9567153 20807 1194 158
1131 fb/eaaaf4 custom_sample_report 0.0 81702912 1666248704 81838080 1576106508288 9566945 20656 1194 156
1148 8c/5b4a56 custom_sample_report 0.0 81666048 1666101248 81801216 1576106512384 9566807 20591 1195 156
1136 b3/c74a1e custom_sample_report 0.0 81707008 1674637312 81842176 1576106508288 9566947 20809 1194 158
1138 80/677109 custom_sample_report 0.0 81707008 1674637312 81842176 1576106508288 9566767 20805 1195 158
1125 8c/795dc5 custom_sample_report 0.0 81707008 1674637312 81842176 1576106508288 9567329 20810 1194 158
1126 48/c6dae0 custom_sample_report 0.0 82714624 1674625024 82849792 1576106508288 9566919 20703 1194 158
1128 f0/309dc5 custom_sample_report 0.0 80650240 1664303104 80785408 1576106508288 9044899 44081 1312 190
1139 a9/3a2d55 custom_sample_report 0.0 81702912 1666248704 81838080 1576106508288 9566733 20658 1195 156
1130 d5/f3bbb7 custom_sample_report 0.0 81702912 1666248704 81838080 1576106508288 9567065 20657 1195 156
1132 43/19700d custom_sample_report 0.0 81707008 1674637312 81842176 1576106508288 9566913 20813 1195 158
1133 0f/e99f8b custom_sample_report 0.0 81702912 1666248704 81838080 1576106508288 9567085 20661 1195 156
1134 4b/800aee custom_sample_report 0.0 81707008 1674637312 81842176 1576106508288 9566921 20809 1198 158
1142 be/6004f2 custom_sample_report 0.0 73023488 1576062324736 73158656 1576106508288 8943090 10795 1085 124
1143 f5/254075 custom_sample_report 0.0 72916992 1576068993024 73052160 1576106508288 8943104 10805 1085 124
1144 2e/a49717 custom_sample_report 0.0 74350592 1576019144704 74485760 1576106508288 8943098 10795 1087 124
1141 e5/c12f4d custom_sample_report 0.0 72916992 1576062324736 73052160 1576106508288 8943090 10795 1085 124
1145 20/78905c custom_sample_report 0.0 72916992 1576068997120 73052160 1576106508288 8943090 10795 1085 124
1140 99/234a34 custom_sample_report 0.0 72916992 1576073678848 73052160 1576106508288 8943090 10795 1085 124
1149 59/feb997 custom_sample_report 0.0 72912896 1576065224704 73048064 1576106512384 8942978 10715 1085 124
1146 e7/bc0bca custom_sample_report 0.0 72916992 1576064217088 73052160 1576106508288 8943090 10795 1085 124
1147 59/70ae9f custom_sample_report 0.0 72916992 1576073678848 73052160 1576106508288 8943090 10795 1085 124
edit: this might be more relevant for #980 actually
@stevekm Work in progress, stay in touch.
@phupe I've slightly modified the RSS figure, adding as a third tab in the Memory chart instead of having a separate chart for RSS memory. I think this a bit more consistent, also because I don't think it has much sense to report the RSS as percentage of memory request. My understanding is that RSS is much smaller than allocated (virtual) memory.
Included in version 19.01.0-edge.
@pditommaso:
This disccusion is actually related to Issue #980.
I see your point but let me clarified why it is important to have the percentage of memory requested for the peak RSS. Thus I really think that 2 different charts are needed for both vmem and peak_rss.
The total amount of memory used be a processs is the Virtual memory (vmem). The vmem contains all memory areas whether they are in the physical memory (RAM), in the Swap space, on the disk or shared with other processes.
The resident set size (RSS) is the amount of space of physical memory (RAM) held by a process.
The relationship is: vmem >= RSS + Swap.
The following page gives an interesting example (https://docs.hpc.qmul.ac.uk/using/memory/) to illustrate the difference of vmem and RSS.
Let's assume you have a newtflow process with the memory directive set to memory '1 GB'.
Let's assume we submit a job using PBS/Torque. According to the PbsExecutor.groovy file in nextflow source code, the job will be submitted using the following command: qsub -l mem=1gb.
The mem option of PBS/Torque corresponds to the maximum amount of physical memory used by the job, in other words, it is the maximum RSS value allowed by the process. PBS/Torque is generally set to kill the job if the process uses more memory than the job requested in the mem option.
Therefore, it is really important to figure out what is the peak RSS of your process for 2 main objectives:
mem option that PBS/Torque will consider to submit or queued a job.A boxplot with the percentage of peak RSS / requested RSS is definitively important to calibrate what amount of memory in the nextflow directive to meet these 2 objectives.
Maybe this is the terminology % Allocated which is misleading since it refers to memory allocation. I suggest maybe to simply use % of requested or % of booked in the legend.
I see your point. The main issue is that, it's not defined what kind of memory the directory memory should specify: virtual - or - physical. PBS is physical, SGE is virtual, K8s? Batch? frankly I don't. I guess it depends on how docker limit the memory. Looking the docker docs it seems suggesting that's the physical memory.
If we decide that NF memory represents the physical memory, then we can use it to represent the % Allocated chart instead of the virtual mem.
Does make sense ?
To be honest I've never really understood what "% Allocated" means, not sure what allocation it's referring to. In all cases in my experience the physical memory in RAM is what you would expect to report and is the useful value, as @phupe describes, since it more directly impacts the success of your tasks and your workflow configuration. Whatever solutions you all come to, I hope it includes making this a more prominent part of the report
I see your point. The main issue is that, it's not defined what kind of memory the directory
memoryshould specify: virtual - or - physical.
Indeed.
If we decide that NF
memoryrepresents the physical memory, then we can use it to represent the% Allocatedchart instead of the virtual mem.Does make sense ?
In my opinion, it really makes sense that memory represents the physical memory, or at least this should be the default. This means that the right memory option for all executors must identified. This would be the scenario1 (By the way, virtual_free seems to represent the amount of unused virtual memory according to http://www.univa.com/resources/files/univa_user_guide_univa__grid_engine_854.pdf, but I am not familiar with Grid Engine, other options are h_rss/s_rss for resident set size, or h_vmem/s_vmem for virtual memory size). In this scenario the only one chart for memory resource usage would be displayed with the peak_rss.
We could imagine a scenario2 where a memory.type directive is offered by nextflow. The value could be either rss or vmem, the option of all executors would selected according to this directive. Only one chart for the memory resource usage in the report would be displayed using either the vmem or rss depending on the type of memory set by the user. This scenario may take some time to be developed and I have absolutely no idea how many users would be interested to use it.
As a straightforward scenario0, I would display the 2 charts for both vmem and peak_rss as proposed in my pull request #985. By the way, I would suggest that, even for vmem, you use the peak value used by the process. Only peak values seem interested to me. I think that only 2 lines should be modified as follows in the file ReportSummary.groovy:
mappers.mem = { TraceRecord record -> record.get('peak_vmem') as Double }
final vmem = record.get('peak_vmem') as Double
I think the best solution would be scenario 1.
sounds good!
Thanks
I have tested the following release: 845e6d1 [origin/master] [release 19.01.0-edge]
That's great you have added the RSS in the timeline html report.
However, I have still noticed some issues:
trace.csv file. See attached files below.To reproduce the issue, use the following NF script:
#!/usr/bin/env nextflow
process OneCpu2GBRAM {
cpus 1
"""
stress --vm 1 --vm-bytes 2000000000 -t 120
"""
}
process OneCpu200MBRAM {
cpus 1
"""
stress --vm 1 --vm-bytes 200000000 -t 120
"""
}
Umm, I'm not able to replicate this. Using your script I'm getting these numbers
task_id hash native_id name status exit submit duration realtime %cpu rss peak_rss peak_vmem vmem rchar wchar read_bytes write_bytes
2 b1/f21e08 13597 OneCpu2GBRAM COMPLETED 0 2019-01-14 11:22:28.266 2m 3s 2m 102.2% 1.9 GB 1.9 GB 1.9 GB 1.9 GB 39 KB 329 B 284 KB 0
1 a3/ecbc0c 13596 OneCpu200MBRAM COMPLETED 0 2019-01-14 11:22:28.256 2m 3s 2m 101.8% 194.3 MB 194.5 MB 220.8 MB 220.8 MB 39 KB 332 B 96 KB 0
Once executed, could you change in the work dir of one task and run NXF_DEBUG=1 bash .command.run and attach here the output (including the file .command.trace)?
find attached the requested information.
>> cat .command.trace
nextflow.trace/v2
realtime=120062
%cpu=1000
rchar=53284
wchar=324
syscr=120
syscw=14
read_bytes=36864
write_bytes=0
%mem=1
vmem=323108
rss=194096
peak_vmem=323204
peak_rss=197224
This was the 200 MB process, right? The peak_rss=197224 looks fine.
fyi, I reproduce the same problem on my side.
This was the 200 MB process, right? The
peak_rss=197224looks fine.
sorry for not mentioning it, but Yes, it was the 200MB process.
therefore where's the problem? in the timeline/html report files? In my case are fine (except the power of 10 vs power of 2 problem).
In the timeline html, I have : OneCpu200MBRAM 2m 1s / 3156 MB (1926 MB)
but I should have instead: OneCpu200MBRAM 2m 1s / 315.6 MB (192.6 MB)`
The values are 10 times bigger as they should be.
Also for the run reporting peak_rss=197224 in the trace file ?
I have the same problem with version 18.10.1, I obtained OneCpu200MBRAM 2m 14s / 3155 MB instead of 315.5 MB.
I removed the $HOME/.nextflow dir and restarted with 19.01.0-edge but still the same problem.
Could this be related to java? Here is what I have:
>> java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)