LoadImageTimeout needs to be set based on benchmarking experiments to load a docker image. See https://github.com/aws/amazon-ecs-agent/pull/841/files#r121754481 for details.
Currently the LoadImageTimeout is only used to load the pause container image. Benchmarked the time of loading the pause container image with the following script:
#!/bin/bash
min=1000.0
max=0.0
total=0.0
for i in {0..99}
do
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
start=`date +%s.%N`
docker image load < amazon-ecs-pause.tar &> /dev/null
end=`date +%s.%N`
cnt=$( echo "$i + 1" | bc -l )
loadtime=$( echo "$end - $start" | bc -l )
total=$(echo "$total + $loadtime" | bc -l )
avg=$(echo "$total / $cnt" | bc -l)
if (( $(echo "$min > $loadtime" | bc -l) )); then
min=$loadtime
fi
if (( $(echo "$max < $loadtime" | bc -l) )); then
max=$loadtime
fi
printf "Image load time %d: %.4fs\n" ${cnt} ${loadtime}
printf "Total time: %d, Avg: %.4fs, Min: %.4fs, Max: %.4fs\n" ${cnt} ${avg} ${min} ${max}
docker rmi amazon/amazon-ecs-pause:0.1.0 &> /dev/null
done
Tested on latest ecs optimized ami (agent version 1.32.1) for al1/al2/al2gpu/al2arm with smallest instance types available (a1.medium for arm, t2.nano for other), with ebs volume burst balance = 100 and burst balance = 0. Loaded image for 100 times.
Result:
Burst balance = 100:
| Instance | Avg (s) | Min (s) | Max (s) |
|:---------|:--------|:--------|:--------|
| AL1 (t2.nano) | 1.1994 | 0.6178 | 1.3431 |
| AL2 (t2.nano) | 0.7433 | 0.3992 | 0.7788 |
| AL2/GPU (t2.nano) | 0.8019 | 0.3935 | 1.1164 |
| AL2/ARM (a1.medium) | 0.4261 | 0.4128 | 0.5022 |
Burst balance = 0:
| Instance | Avg (s) | Min (s) | Max (s) |
|:---------|:--------|:--------|:--------|
| AL1 (t2.nano) | 18.5915 | 17.6324 | 22.5326 |
| AL2 (t2.nano) | 18.6168 | 17.0224 | 21.5424 |
| AL2/GPU (t2.nano) | 23.6569 | 20.1130 | 29.4856 |
| AL2/ARM (a1.medium) | 12.5409 | 10.6842 | 14.9941 |
Seems like worst case is around half minute.
So, there's scope to tighten the current value (10m). I'd say something like 3m should be good (adding some buffer to account for unexpected delays etc). Would be interested to know what will gets picked though.
The LoadImageTimeout has been updated to 2m in https://github.com/aws/amazon-ecs-agent/pull/2269, leaving 1.5m as buffer time. Closing this now
Most helpful comment
Currently the
LoadImageTimeoutis only used to load the pause container image. Benchmarked the time of loading the pause container image with the following script:Tested on latest ecs optimized ami (agent version 1.32.1) for al1/al2/al2gpu/al2arm with smallest instance types available (a1.medium for arm, t2.nano for other), with ebs volume burst balance = 100 and burst balance = 0. Loaded image for 100 times.
Result:
Burst balance = 100:
| Instance | Avg (s) | Min (s) | Max (s) |
|:---------|:--------|:--------|:--------|
| AL1 (t2.nano) | 1.1994 | 0.6178 | 1.3431 |
| AL2 (t2.nano) | 0.7433 | 0.3992 | 0.7788 |
| AL2/GPU (t2.nano) | 0.8019 | 0.3935 | 1.1164 |
| AL2/ARM (a1.medium) | 0.4261 | 0.4128 | 0.5022 |
Burst balance = 0:
| Instance | Avg (s) | Min (s) | Max (s) |
|:---------|:--------|:--------|:--------|
| AL1 (t2.nano) | 18.5915 | 17.6324 | 22.5326 |
| AL2 (t2.nano) | 18.6168 | 17.0224 | 21.5424 |
| AL2/GPU (t2.nano) | 23.6569 | 20.1130 | 29.4856 |
| AL2/ARM (a1.medium) | 12.5409 | 10.6842 | 14.9941 |
Seems like worst case is around half minute.