We are getting lots of "CannotInspectContainerError : Could not transition to inspecting; timed out after waiting 30s" .
docker -v
Docker version 1.9.1, build a34a1d5/1.9.1
AMI : amzn-ami-2015.09.d-amazon-ecs-optimized
ECS agent log :
Container change event module="TaskEngine" event="{TaskArn:arn:aws:ecs:us-west-2::task/5e91cf96-649e-4657-be99-6fa6132a5547 ContainerName:test-service Status:STOPPED Reason:CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s ExitCode:
2016-03-10T22:03:20Z [DEBUG] Container change event passed on module="TaskEngine" event="{TaskArn:arn:aws:ecs:us-west-2:
Could some one please suggest, what causing this above issue?
+1. ECS is getting jankier by the day.
@hridyeshpant @bilalaslamseattle I've typically seen this happen when the Docker daemon gets very slow as the disk gets full. Do you see anything in the Docker daemon log in /var/log/docker? Can you share the output of sudo pvs, sudo vgs, sudo lvs, and dmesg on the affected instance?
Hi @samuelkarp I am responding here for @bilalaslamseattle .
Context: We were upgrading all our machines to latest AMI (amzn-ami-2015.09.g-amazon-ecs-optimized-4ce33fd9-63ff-4f35-8d3a-939b641f1931-ami-33b48a59.3) and we are running ecs-agent 1.8.1.
We have plenty of space in our docker-pool volume
[root@ip-10-0-1-214 ~]# pvs
PV VG Fmt Attr PSize PFree
/dev/sdcz1 docker lvm2 a-- 100.00g 0
[root@ip-10-0-1-214 ~]# vgs
VG #PV #LV #SN Attr VSize VFree
docker 1 1 0 wz--n- 100.00g 0
[root@ip-10-0-1-214 ~]# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
docker-pool docker twi-aot--- 99.79g 34.43 60.02
here is our latest dmesg output (are we looking for disk device related issues? - I didn't find anything like that):
[148598.000407] XFS (dm-4): Mounting V4 Filesystem
[148598.223849] XFS (dm-4): Ending clean mount
[148598.248095] XFS (dm-4): Unmounting Filesystem
[148598.382691] XFS (dm-4): Mounting V4 Filesystem
[148598.693982] XFS (dm-4): Ending clean mount
[148598.698291] device veth9179b2d entered promiscuous mode
[148598.700779] IPv6: ADDRCONF(NETDEV_UP): veth9179b2d: link is not ready
[148598.703262] docker0: port 3(veth9179b2d) entered forwarding state
[148598.705716] docker0: port 3(veth9179b2d) entered forwarding state
[148598.748298] docker0: port 3(veth9179b2d) entered disabled state
[148598.750980] eth0: renamed from vethab1b5b3
[148598.768437] IPv6: ADDRCONF(NETDEV_CHANGE): veth9179b2d: link becomes ready
[148598.771048] docker0: port 3(veth9179b2d) entered forwarding state
[148598.773322] docker0: port 3(veth9179b2d) entered forwarding state
[148600.590332] vethb28e6b2: renamed from eth0
[148600.612217] docker0: port 5(vethfd6e9d5) entered disabled state
[148600.649720] docker0: port 5(vethfd6e9d5) entered disabled state
[148600.655303] device vethfd6e9d5 left promiscuous mode
[148600.657479] docker0: port 5(vethfd6e9d5) entered disabled state
[148600.770529] XFS (dm-5): Unmounting Filesystem
[148603.118288] vetha84dc5c: renamed from eth0
[148603.132139] docker0: port 6(veth139e811) entered disabled state
[148603.156986] docker0: port 6(veth139e811) entered disabled state
[148603.162356] device veth139e811 left promiscuous mode
[148603.164594] docker0: port 6(veth139e811) entered disabled state
[148603.272042] XFS (dm-8): Unmounting Filesystem
[148604.252063] docker0: port 7(vethe35c9ba) entered forwarding state
[148606.742376] veth191b7a6: renamed from eth0
[148606.756246] docker0: port 10(vethb53e8e1) entered disabled state
[148606.789104] docker0: port 10(vethb53e8e1) entered disabled state
[148606.793692] device vethb53e8e1 left promiscuous mode
[148606.795567] docker0: port 10(vethb53e8e1) entered disabled state
[148606.895606] XFS (dm-12): Unmounting Filesystem
[148607.644069] docker0: port 8(veth81f571b) entered forwarding state
[148609.366559] veth79e46a7: renamed from eth0
[148609.388130] docker0: port 8(veth81f571b) entered disabled state
[148609.412588] docker0: port 8(veth81f571b) entered disabled state
[148609.417083] device veth81f571b left promiscuous mode
[148609.419377] docker0: port 8(veth81f571b) entered disabled state
[148609.519901] XFS (dm-10): Unmounting Filesystem
[148613.792075] docker0: port 3(veth9179b2d) entered forwarding state
[148615.322276] vethab1b5b3: renamed from eth0
[148615.344134] docker0: port 3(veth9179b2d) entered disabled state
[148615.372810] docker0: port 3(veth9179b2d) entered disabled state
[148615.377364] device veth9179b2d left promiscuous mode
[148615.379603] docker0: port 3(veth9179b2d) entered disabled state
[148615.484147] XFS (dm-4): Unmounting Filesystem
[148656.938239] vethfd23535: renamed from eth0
[148656.952144] docker0: port 14(veth20f8a9b) entered disabled state
[148656.992160] docker0: port 14(veth20f8a9b) entered disabled state
[148656.996866] device veth20f8a9b left promiscuous mode
[148656.998910] docker0: port 14(veth20f8a9b) entered disabled state
[148657.094089] XFS (dm-18): Unmounting Filesystem
[148698.794368] veth0c0039f: renamed from eth0
[148698.808248] docker0: port 7(vethe35c9ba) entered disabled state
[148698.832905] docker0: port 7(vethe35c9ba) entered disabled state
[148698.837463] device vethe35c9ba left promiscuous mode
[148698.839596] docker0: port 7(vethe35c9ba) entered disabled state
[148698.934218] XFS (dm-9): Unmounting Filesystem
[148760.490350] veth55c5eb5: renamed from eth0
[148760.500137] docker0: port 23(vethdb0ee4f) entered disabled state
[148760.532975] docker0: port 23(vethdb0ee4f) entered disabled state
[148760.537540] device vethdb0ee4f left promiscuous mode
[148760.539777] docker0: port 23(vethdb0ee4f) entered disabled state
[148760.620828] XFS (dm-25): Unmounting Filesystem
[148765.343538] XFS (dm-4): Mounting V4 Filesystem
[148765.432009] XFS (dm-4): Ending clean mount
[148765.459511] XFS (dm-4): Unmounting Filesystem
[148765.580933] XFS (dm-4): Mounting V4 Filesystem
[148765.707909] XFS (dm-4): Ending clean mount
[148765.724402] XFS (dm-4): Unmounting Filesystem
[148765.822841] XFS (dm-4): Mounting V4 Filesystem
[148765.907867] XFS (dm-4): Ending clean mount
[148765.912327] device veth1d1aea1 entered promiscuous mode
[148765.914662] IPv6: ADDRCONF(NETDEV_UP): veth1d1aea1: link is not ready
[148765.984366] eth0: renamed from veth228537c
[148766.004445] IPv6: ADDRCONF(NETDEV_CHANGE): veth1d1aea1: link becomes ready
[148766.006923] docker0: port 3(veth1d1aea1) entered forwarding state
[148766.009063] docker0: port 3(veth1d1aea1) entered forwarding state
[148768.250410] veth228537c: renamed from eth0
[148768.260140] docker0: port 3(veth1d1aea1) entered disabled state
[148768.280564] docker0: port 3(veth1d1aea1) entered disabled state
[148768.284735] device veth1d1aea1 left promiscuous mode
[148768.286477] docker0: port 3(veth1d1aea1) entered disabled state
[148768.383826] XFS (dm-4): Unmounting Filesystem
[148778.600429] XFS (dm-4): Mounting V4 Filesystem
[148778.690413] XFS (dm-4): Ending clean mount
[148778.711659] XFS (dm-4): Unmounting Filesystem
[148778.855437] XFS (dm-4): Mounting V4 Filesystem
[148778.966238] XFS (dm-4): Ending clean mount
[148778.985373] XFS (dm-4): Unmounting Filesystem
[148779.071083] XFS (dm-4): Mounting V4 Filesystem
[148779.167678] XFS (dm-4): Ending clean mount
[148779.171848] device vethea945a4 entered promiscuous mode
[148779.174717] IPv6: ADDRCONF(NETDEV_UP): vethea945a4: link is not ready
[148779.212435] eth0: renamed from vethe07e898
[148779.232526] IPv6: ADDRCONF(NETDEV_CHANGE): vethea945a4: link becomes ready
[148779.235638] docker0: port 3(vethea945a4) entered forwarding state
[148779.237822] docker0: port 3(vethea945a4) entered forwarding state
[148788.610572] vethe07e898: renamed from eth0
[148788.636258] docker0: port 3(vethea945a4) entered disabled state
[148788.668787] docker0: port 3(vethea945a4) entered disabled state
[148788.673903] device vethea945a4 left promiscuous mode
[148788.675876] docker0: port 3(vethea945a4) entered disabled state
[148788.779910] XFS (dm-4): Unmounting Filesystem
[148794.279414] XFS (dm-4): Mounting V4 Filesystem
[148794.368444] XFS (dm-4): Ending clean mount
[148794.399327] XFS (dm-4): Unmounting Filesystem
[148794.510381] XFS (dm-4): Mounting V4 Filesystem
[148794.645299] XFS (dm-4): Ending clean mount
[148794.671919] XFS (dm-4): Unmounting Filesystem
[148794.762757] XFS (dm-4): Mounting V4 Filesystem
[148794.844144] XFS (dm-4): Ending clean mount
[148794.848370] device vetha6e28bc entered promiscuous mode
[148794.850718] IPv6: ADDRCONF(NETDEV_UP): vetha6e28bc: link is not ready
[148794.892448] eth0: renamed from vethed5fdd0
[148794.916531] IPv6: ADDRCONF(NETDEV_CHANGE): vetha6e28bc: link becomes ready
[148794.919526] docker0: port 3(vetha6e28bc) entered forwarding state
[148794.921777] docker0: port 3(vetha6e28bc) entered forwarding state
[148808.210349] vethed5fdd0: renamed from eth0
[148808.228137] docker0: port 3(vetha6e28bc) entered disabled state
[148808.256748] docker0: port 3(vetha6e28bc) entered disabled state
[148808.261328] device vetha6e28bc left promiscuous mode
[148808.263068] docker0: port 3(vetha6e28bc) entered disabled state
[148808.356413] XFS (dm-4): Unmounting Filesystem
[148831.836999] XFS (dm-4): Mounting V4 Filesystem
[148831.928650] XFS (dm-4): Ending clean mount
[148831.951532] XFS (dm-4): Unmounting Filesystem
[148832.075715] XFS (dm-4): Mounting V4 Filesystem
[148832.201509] XFS (dm-4): Ending clean mount
[148832.216837] XFS (dm-4): Unmounting Filesystem
[148832.305603] XFS (dm-4): Mounting V4 Filesystem
[148832.401634] XFS (dm-4): Ending clean mount
[148832.405974] device veth0dcc7c2 entered promiscuous mode
[148832.408614] IPv6: ADDRCONF(NETDEV_UP): veth0dcc7c2: link is not ready
[148832.516329] eth0: renamed from vethfeef268
[148832.536519] IPv6: ADDRCONF(NETDEV_CHANGE): veth0dcc7c2: link becomes ready
[148832.539154] docker0: port 3(veth0dcc7c2) entered forwarding state
[148832.541706] docker0: port 3(veth0dcc7c2) entered forwarding state
[148835.564258] XFS (dm-5): Mounting V4 Filesystem
[148841.924312] XFS (dm-5): Ending clean mount
[148842.398973] XFS (dm-5): Unmounting Filesystem
[148844.393552] XFS (dm-5): Mounting V4 Filesystem
[148847.580085] docker0: port 3(veth0dcc7c2) entered forwarding state
[148850.548260] XFS (dm-5): Ending clean mount
[148850.624212] XFS (dm-5): Unmounting Filesystem
[148852.812985] XFS (dm-5): Mounting V4 Filesystem
[148858.530565] XFS (dm-5): Ending clean mount
[148858.534994] device veth3a99e11 entered promiscuous mode
[148858.537181] IPv6: ADDRCONF(NETDEV_UP): veth3a99e11: link is not ready
[148858.572456] eth0: renamed from veth63750f4
[148858.604592] IPv6: ADDRCONF(NETDEV_CHANGE): veth3a99e11: link becomes ready
[148858.607058] docker0: port 5(veth3a99e11) entered forwarding state
[148858.609223] docker0: port 5(veth3a99e11) entered forwarding state
[148873.628081] docker0: port 5(veth3a99e11) entered forwarding state
[148935.574289] veth63750f4: renamed from eth0
[148935.600134] docker0: port 5(veth3a99e11) entered disabled state
[148935.639834] docker0: port 5(veth3a99e11) entered disabled state
[148935.644606] device veth3a99e11 left promiscuous mode
[148935.646573] docker0: port 5(veth3a99e11) entered disabled state
[148935.771992] XFS (dm-5): Unmounting Filesystem
[148937.062075] XFS (dm-5): Mounting V4 Filesystem
[148938.804554] XFS (dm-5): Ending clean mount
[148939.234080] XFS (dm-8): Mounting V4 Filesystem
[148940.986489] XFS (dm-8): Ending clean mount
[148941.013282] XFS (dm-5): Unmounting Filesystem
[148942.027481] XFS (dm-5): Mounting V4 Filesystem
[148943.854810] XFS (dm-5): Ending clean mount
[148943.911279] XFS (dm-5): Unmounting Filesystem
[148944.113933] XFS (dm-8): Unmounting Filesystem
[148945.020340] XFS (dm-5): Mounting V4 Filesystem
[148946.837711] XFS (dm-5): Ending clean mount
[148946.890850] XFS (dm-5): Unmounting Filesystem
[148947.380962] XFS (dm-5): Mounting V4 Filesystem
[148949.165454] XFS (dm-5): Ending clean mount
[148949.584833] XFS (dm-8): Mounting V4 Filesystem
[148951.156715] XFS (dm-8): Ending clean mount
[148951.160826] device vethee3e81d entered promiscuous mode
[148951.163062] IPv6: ADDRCONF(NETDEV_UP): vethee3e81d: link is not ready
[148951.224396] eth0: renamed from veth008f71d
[148951.248540] IPv6: ADDRCONF(NETDEV_CHANGE): vethee3e81d: link becomes ready
[148951.251062] docker0: port 5(vethee3e81d) entered forwarding state
[148951.253346] docker0: port 5(vethee3e81d) entered forwarding state
[148951.768479] XFS (dm-9): Mounting V4 Filesystem
[148954.935612] XFS (dm-9): Ending clean mount
[148954.940273] device veth9cea596 entered promiscuous mode
[148954.943171] IPv6: ADDRCONF(NETDEV_UP): veth9cea596: link is not ready
[148954.996484] eth0: renamed from veth8254d1f
[148955.020507] IPv6: ADDRCONF(NETDEV_CHANGE): veth9cea596: link becomes ready
[148955.023199] docker0: port 6(veth9cea596) entered forwarding state
[148955.025674] docker0: port 6(veth9cea596) entered forwarding state
[148956.069941] XFS (dm-10): Mounting V4 Filesystem
[148960.903733] XFS (dm-10): Ending clean mount
[148961.027654] XFS (dm-5): Unmounting Filesystem
[148963.626409] XFS (dm-5): Mounting V4 Filesystem
[148966.300061] docker0: port 5(vethee3e81d) entered forwarding state
[148970.076065] docker0: port 6(veth9cea596) entered forwarding state
[148970.308383] XFS (dm-5): Ending clean mount
[148972.406098] XFS (dm-12): Mounting V4 Filesystem
[148979.204529] XFS (dm-12): Ending clean mount
[148979.310072] XFS (dm-12): Unmounting Filesystem
[148979.633569] XFS (dm-10): Unmounting Filesystem
[148983.123119] XFS (dm-10): Mounting V4 Filesystem
[148991.624617] XFS (dm-10): Ending clean mount
[148991.719102] XFS (dm-10): Unmounting Filesystem
[148992.273999] XFS (dm-5): Unmounting Filesystem
[148996.400852] XFS (dm-5): Mounting V4 Filesystem
[149006.362503] XFS (dm-5): Ending clean mount
[149006.366398] device veth4c17192 entered promiscuous mode
[149006.368944] IPv6: ADDRCONF(NETDEV_UP): veth4c17192: link is not ready
[149006.371481] docker0: port 7(veth4c17192) entered forwarding state
[149006.374193] docker0: port 7(veth4c17192) entered forwarding state
[149006.376485] docker0: port 7(veth4c17192) entered disabled state
[149006.488697] eth0: renamed from vethf1f16f6
[149006.512744] IPv6: ADDRCONF(NETDEV_CHANGE): veth4c17192: link becomes ready
[149006.515220] docker0: port 7(veth4c17192) entered forwarding state
[149006.517499] docker0: port 7(veth4c17192) entered forwarding state
[149009.640492] XFS (dm-10): Mounting V4 Filesystem
[149021.532083] docker0: port 7(veth4c17192) entered forwarding state
[149025.218276] veth008f71d: renamed from eth0
[149025.240136] docker0: port 5(vethee3e81d) entered disabled state
[149025.264561] docker0: port 5(vethee3e81d) entered disabled state
[149025.269322] device vethee3e81d left promiscuous mode
[149025.271411] docker0: port 5(vethee3e81d) entered disabled state
[149027.535403] XFS (dm-10): Ending clean mount
[149027.635720] XFS (dm-10): Unmounting Filesystem
[149030.095504] XFS (dm-10): Mounting V4 Filesystem
[149037.251238] XFS (dm-10): Ending clean mount
[149037.255680] device veth33fd42f entered promiscuous mode
[149037.258285] IPv6: ADDRCONF(NETDEV_UP): veth33fd42f: link is not ready
[149037.308295] eth0: renamed from veth5313aad
[149037.324240] IPv6: ADDRCONF(NETDEV_CHANGE): veth33fd42f: link becomes ready
[149037.326997] docker0: port 5(veth33fd42f) entered forwarding state
[149037.329532] docker0: port 5(veth33fd42f) entered forwarding state
[149039.817639] XFS (dm-12): Mounting V4 Filesystem
[149046.010233] veth8254d1f: renamed from eth0
[149046.040515] docker0: port 6(veth9cea596) entered disabled state
[149046.047601] docker0: port 6(veth9cea596) entered disabled state
[149046.052412] device veth9cea596 left promiscuous mode
[149046.054473] docker0: port 6(veth9cea596) entered disabled state
[149049.174165] XFS (dm-12): Ending clean mount
[149049.193840] XFS (dm-8): Unmounting Filesystem
[149049.362115] XFS (dm-9): Unmounting Filesystem
[149051.283358] XFS (dm-8): Mounting V4 Filesystem
[149052.380095] docker0: port 5(veth33fd42f) entered forwarding state
[149057.369324] XFS (dm-8): Ending clean mount
[149057.373851] device vethed38967 entered promiscuous mode
[149057.376259] IPv6: ADDRCONF(NETDEV_UP): vethed38967: link is not ready
[149057.416396] eth0: renamed from veth75b5f7e
[149057.420649] XFS (dm-12): Unmounting Filesystem
[149057.432477] IPv6: ADDRCONF(NETDEV_CHANGE): vethed38967: link becomes ready
[149057.434989] docker0: port 6(vethed38967) entered forwarding state
[149057.437407] docker0: port 6(vethed38967) entered forwarding state
[149060.549355] XFS (dm-9): Mounting V4 Filesystem
[149070.293835] XFS (dm-9): Ending clean mount
[149070.391627] XFS (dm-9): Unmounting Filesystem
[149072.292865] XFS (dm-9): Mounting V4 Filesystem
[149072.476064] docker0: port 6(vethed38967) entered forwarding state
[149078.183708] XFS (dm-9): Ending clean mount
[149078.188850] device vethb4d1e4c entered promiscuous mode
[149078.191233] IPv6: ADDRCONF(NETDEV_UP): vethb4d1e4c: link is not ready
[149078.244379] eth0: renamed from veth3e75d38
[149078.268413] IPv6: ADDRCONF(NETDEV_CHANGE): vethb4d1e4c: link becomes ready
[149078.271102] docker0: port 8(vethb4d1e4c) entered forwarding state
[149078.273546] docker0: port 8(vethb4d1e4c) entered forwarding state
[149085.046306] vethfeef268: renamed from eth0
[149085.068155] docker0: port 3(veth0dcc7c2) entered disabled state
[149085.100852] docker0: port 3(veth0dcc7c2) entered disabled state
[149085.105209] device veth0dcc7c2 left promiscuous mode
[149085.107396] docker0: port 3(veth0dcc7c2) entered disabled state
[149085.213981] XFS (dm-4): Unmounting Filesystem
[149087.414439] vethf1f16f6: renamed from eth0
[149087.432156] docker0: port 7(veth4c17192) entered disabled state
[149087.457176] docker0: port 7(veth4c17192) entered disabled state
[149087.462018] device veth4c17192 left promiscuous mode
[149087.464300] docker0: port 7(veth4c17192) entered disabled state
[149087.569520] XFS (dm-5): Unmounting Filesystem
[149093.276064] docker0: port 8(vethb4d1e4c) entered forwarding state
[149117.174419] veth75b5f7e: renamed from eth0
[149117.188177] docker0: port 6(vethed38967) entered disabled state
[149117.216693] docker0: port 6(vethed38967) entered disabled state
[149117.221916] device vethed38967 left promiscuous mode
[149117.224040] docker0: port 6(vethed38967) entered disabled state
[149117.359820] XFS (dm-8): Unmounting Filesystem
[149132.998406] veth3e75d38: renamed from eth0
[149133.012226] docker0: port 8(vethb4d1e4c) entered disabled state
[149133.036571] docker0: port 8(vethb4d1e4c) entered disabled state
[149133.041418] device vethb4d1e4c left promiscuous mode
[149133.043390] docker0: port 8(vethb4d1e4c) entered disabled state
[149133.192824] XFS (dm-9): Unmounting Filesystem
[149139.006335] veth44f57f7: renamed from eth0
[149139.024246] docker0: port 4(vethe231830) entered disabled state
[149139.053278] docker0: port 4(vethe231830) entered disabled state
[149139.057511] device vethe231830 left promiscuous mode
[149139.059856] docker0: port 4(vethe231830) entered disabled state
[149139.153957] XFS (dm-6): Unmounting Filesystem
[149143.314316] veth5313aad: renamed from eth0
[149143.328130] docker0: port 5(veth33fd42f) entered disabled state
[149143.356825] docker0: port 5(veth33fd42f) entered disabled state
[149143.361370] device veth33fd42f left promiscuous mode
[149143.363408] docker0: port 5(veth33fd42f) entered disabled state
[149143.465966] XFS (dm-10): Unmounting Filesystem
[149175.831770] XFS (dm-4): Mounting V4 Filesystem
[149175.930204] XFS (dm-4): Ending clean mount
[149175.959535] XFS (dm-4): Unmounting Filesystem
[149176.067605] XFS (dm-4): Mounting V4 Filesystem
[149176.196382] XFS (dm-4): Ending clean mount
[149176.213340] XFS (dm-4): Unmounting Filesystem
[149176.310475] XFS (dm-4): Mounting V4 Filesystem
[149176.397907] XFS (dm-4): Ending clean mount
[149176.402287] device veth9338a6a entered promiscuous mode
[149176.404542] IPv6: ADDRCONF(NETDEV_UP): veth9338a6a: link is not ready
[149176.452380] eth0: renamed from vethea1b473
[149176.472466] IPv6: ADDRCONF(NETDEV_CHANGE): veth9338a6a: link becomes ready
[149176.474950] docker0: port 3(veth9338a6a) entered forwarding state
[149176.477048] docker0: port 3(veth9338a6a) entered forwarding state
[149191.516079] docker0: port 3(veth9338a6a) entered forwarding state
[149254.617881] XFS (dm-5): Mounting V4 Filesystem
[149263.403141] XFS (dm-5): Ending clean mount
[149263.914423] XFS (dm-5): Unmounting Filesystem
[149267.864165] XFS (dm-5): Mounting V4 Filesystem
[149275.100959] XFS (dm-5): Ending clean mount
[149275.182856] XFS (dm-5): Unmounting Filesystem
[149277.596934] XFS (dm-5): Mounting V4 Filesystem
[149284.623996] XFS (dm-5): Ending clean mount
[149284.631655] device vethb3758cf entered promiscuous mode
[149284.634042] IPv6: ADDRCONF(NETDEV_UP): vethb3758cf: link is not ready
[149284.696315] eth0: renamed from vethd39df60
[149284.724486] IPv6: ADDRCONF(NETDEV_CHANGE): vethb3758cf: link becomes ready
[149284.727024] docker0: port 4(vethb3758cf) entered forwarding state
[149284.729253] docker0: port 4(vethb3758cf) entered forwarding state
[149287.533498] XFS (dm-6): Mounting V4 Filesystem
[149295.371425] XFS (dm-6): Ending clean mount
[149298.228503] XFS (dm-8): Mounting V4 Filesystem
[149299.740079] docker0: port 4(vethb3758cf) entered forwarding state
[149303.598461] vethf805a12: renamed from eth0
[149303.608136] docker0: port 9(veth12fa990) entered disabled state
[149303.628754] docker0: port 9(veth12fa990) entered disabled state
[149303.633505] device veth12fa990 left promiscuous mode
[149303.635528] docker0: port 9(veth12fa990) entered disabled state
[149306.406484] XFS (dm-8): Ending clean mount
[149308.456428] XFS (dm-9): Mounting V4 Filesystem
[149315.351190] XFS (dm-9): Ending clean mount
[149317.382990] XFS (dm-10): Mounting V4 Filesystem
[149324.241323] XFS (dm-10): Ending clean mount
[149326.322200] XFS (dm-12): Mounting V4 Filesystem
[149333.257038] XFS (dm-12): Ending clean mount
[149335.614014] XFS (dm-13): Mounting V4 Filesystem
[149342.630850] XFS (dm-13): Ending clean mount
[149342.646079] XFS (dm-11): Unmounting Filesystem
[149345.614473] XFS (dm-11): Mounting V4 Filesystem
[149352.419559] XFS (dm-11): Ending clean mount
[149354.606527] XFS (dm-14): Mounting V4 Filesystem
[149361.989438] XFS (dm-14): Ending clean mount
[149364.248298] XFS (dm-15): Mounting V4 Filesystem
[149371.521260] XFS (dm-15): Ending clean mount
[149373.659157] XFS (dm-16): Mounting V4 Filesystem
[149380.988747] XFS (dm-16): Ending clean mount
[149381.228193] XFS (dm-12): Unmounting Filesystem
here are ecs-agent logs around incident:
2016-03-13T01:23:20Z [INFO] Pulling container module="TaskEngine" task="Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (NONE->RUNNING) Containers: [churn_drivers (NONE->RUNNING),]" container="churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (NONE->RUNNING)"
2016-03-13T01:34:46Z [INFO] Creating container module="TaskEngine" task="Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (NONE->RUNNING) Containers: [churn_drivers (PULLED->RUNNING),]" container="churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (PULLED->RUNNING)"
2016-03-13T01:34:46Z [INFO] Created container name mapping for task Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (NONE->RUNNING) Containers: [churn_drivers (PULLED->RUNNING),] - churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (PULLED->RUNNING) -> ecs-Churn_Prediction_320-7-churndrivers-b8a792d9f0dc87fcca01
2016-03-13T01:35:12Z [INFO] Created docker container for task Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (NONE->RUNNING) Containers: [churn_drivers (PULLED->RUNNING),]: churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (PULLED->RUNNING) -> a42e60f08bf661333f736178a27ad81a028972b84697fc2f397fc2c1f64e0b02
2016-03-13T01:35:12Z [INFO] Starting container module="TaskEngine" task="Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (CREATED->RUNNING) Containers: [churn_drivers (CREATED->RUNNING),]" container="churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (CREATED->RUNNING)"
2016-03-13T01:35:55Z [INFO] Error transitioning container module="TaskEngine" task="Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (CREATED->RUNNING) Containers: [churn_drivers (CREATED->RUNNING),]" container="churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (CREATED->RUNNING)" state="RUNNING"
2016-03-13T01:35:55Z [WARN] Error with docker; stopping container module="TaskEngine" task="Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (CREATED->RUNNING) Containers: [churn_drivers (RUNNING->RUNNING),]" container="churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (RUNNING->RUNNING)" err="Could not transition to inspecting; timed out after waiting 30s"
2016-03-13T01:35:55Z [INFO] Stopping container module="TaskEngine" task="Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (RUNNING->STOPPED) Containers: [churn_drivers (RUNNING->STOPPED),]" container="churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (RUNNING->STOPPED)"
2016-03-13T01:36:20Z [INFO] Redundant container state change for task Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (RUNNING->STOPPED) Containers: [churn_drivers (RUNNING->STOPPED),]: churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (RUNNING->STOPPED) to RUNNING, but already RUNNING
2016-03-13T01:36:55Z [INFO] Error transitioning container module="TaskEngine" task="Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (RUNNING->STOPPED) Containers: [churn_drivers (RUNNING->STOPPED),]" container="churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (RUNNING->STOPPED)" state="STOPPED"
2016-03-13T01:36:55Z [INFO] Error for 'docker stop' of container; assuming it's stopped anyways module="TaskEngine" task="Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (RUNNING->STOPPED) Containers: [churn_drivers (STOPPED->STOPPED),]"
2016-03-13T01:36:55Z [INFO] Task change event module="TaskEngine" event="{TaskArn:arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027 Status:STOPPED Reason: SentStatus:NONE}"
2016-03-13T01:36:55Z [INFO] Adding event module="eventhandler" change="ContainerChange: arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027 churn_drivers -> STOPPED, Reason CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s, Known Sent: NONE"
2016-03-13T01:36:55Z [INFO] Adding event module="eventhandler" change="TaskChange: arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027 -> STOPPED, Known Sent: NONE"
2016-03-13T01:36:55Z [INFO] Sending container change module="eventhandler" event="ContainerChange: arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027 churn_drivers -> STOPPED, Reason CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s, Known Sent: NONE" change="ContainerChange: arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027 churn_drivers -> STOPPED, Reason CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s, Known Sent: NONE"
2016-03-13T01:36:55Z [INFO] Sending task change module="eventhandler" event="TaskChange: arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027 -> STOPPED, Known Sent: NONE" change="TaskChange: arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027 -> STOPPED, Known Sent: NONE"
2016-03-13T01:48:49Z [INFO] Redundant container state change for task Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (STOPPED->STOPPED) Containers: [churn_drivers (STOPPED->STOPPED),]: churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (STOPPED->STOPPED) to STOPPED, but already STOPPED
2016-03-13T01:48:49Z [INFO] Redundant container state change for task Churn_Prediction_320:7 arn:aws:ecs:us-west-2:156473786033:task/da12924f-5f6f-4483-bcdc-644fa0169027, Status: (STOPPED->STOPPED) Containers: [churn_drivers (STOPPED->STOPPED),]: churn_drivers(quay.io/appuri/churn-drivers:aa9829e) (STOPPED->STOPPED) to STOPPED, but already STOPPED
There seems to be nothing important in docker logs:
time="2016-03-13T01:36:01.995075070Z" level=info msg="GET /v1.17/containers/87d1b17d023a8439128c406428d1aa72902b9c134303b84645129f904e5aa64a/json"
time="2016-03-13T01:36:10.964134870Z" level=info msg="POST /v1.17/images/create?fromImage=quay.io%2Fappuri%2Ftable-flipper%3A414c315"
time="2016-03-13T01:36:12.412347684Z" level=info msg="GET /v1.17/containers/a42e60f08bf661333f736178a27ad81a028972b84697fc2f397fc2c1f64e0b02/json"
time="2016-03-13T01:36:12.562930603Z" level=info msg="GET /v1.17/containers/b3c8c148af6d0f255ed082c1b9e6ac09b75579642c1e5f3bd90d1b3484fd4524/json"
time="2016-03-13T01:36:12.709987581Z" level=info msg="POST /v1.17/containers/create?name=ecs-Refresh_Users_and_Segments_335-3-tableflippercontainer-b69198ec999ed7d0fd01"
time="2016-03-13T01:36:15.328778441Z" level=info msg="POST /v1.17/images/create?fromImage=quay.io%2Fappuri%2Fsql-task-runner%3A0866c7f"
time="2016-03-13T01:36:17.285630820Z" level=info msg="POST /v1.17/images/create?fromImage=quay.io%2Fappuri%2Ftable-flipper%3A414c315"
time="2016-03-13T01:36:17.343400360Z" level=info msg="POST /v1.17/containers/create?name=ecs-Generate_User_Event_Index_216-2-sqltaskrunner-90feaca3cf8b9fbd5500"
time="2016-03-13T01:36:18.199433459Z" level=info msg="POST /v1.17/containers/4183cf25197d310d9db930ecad95125eed1f1b271251d53643fd9cdabea7c105/start"
time="2016-03-13T01:36:18.797324550Z" level=info msg="POST /v1.17/containers/create?name=ecs-Refresh_Users_and_Segments_164-3-tableflippercontainer-aef8a582ec9e9c9b0400"
time="2016-03-13T01:36:20.477754710Z" level=info msg="GET /v1.17/containers/07ece9de53c7a63024b15358688ca2026f2f806e2414e22b19a5b209bf7d767b/json"
time="2016-03-13T01:36:20.836527734Z" level=info msg="POST /v1.17/images/create?fromImage=quay.io%2Fappuri%2Ftable-flipper%3A414c315"
time="2016-03-13T01:36:22.572960232Z" level=info msg="POST /v1.17/images/create?fromImage=quay.io%2Fappuri%2Ftable-flipper%3A414c315"
time="2016-03-13T01:36:22.629403330Z" level=info msg="POST /v1.17/containers/create?name=ecs-Refresh_Users_and_Segments_177-3-tableflippercontainer-ecbafdedf5b1839b9101"
time="2016-03-13T01:36:24.145977361Z" level=info msg="POST /v1.17/images/create?fromImage=quay.io%2Fappuri%2Ftable-flipper%3A414c315"
time="2016-03-13T01:36:24.188707563Z" level=info msg="POST /v1.17/containers/create?name=ecs-Refresh_Users_and_Segments_143-3-tableflippercontainer-a2f7abffa3c2aebefc01"
time="2016-03-13T01:36:25.816666122Z" level=info msg="POST /v1.17/containers/create?name=ecs-Refresh_Users_and_Segments_190-3-tableflippercontainer-dc90f0b7f4f8bfd9e701"
time="2016-03-13T01:36:26.404627303Z" level=info msg="POST /v1.17/images/create?fromImage=quay.io%2Fappuri%2Ftable-flipper%3A414c315"
time="2016-03-13T01:36:28.024001357Z" level=info msg="POST /v1.17/containers/create?name=ecs-Refresh_Users_and_Segments_199-3-tableflippercontainer-f4cf919cc8a087af5f00"
time="2016-03-13T01:36:29.954773671Z" level=info msg="GET /v1.17/containers/ec6b67268206ea5c1fcdc278e591221a5afca19391e49731e0e31971f07de83f/json"
time="2016-03-13T01:36:33.822792337Z" level=info msg="GET /v1.17/containers/c73932d8393c2d75c0edc72074363f113f38f014c85fed5eb5869b05dff3ac5a/json"
time="2016-03-13T01:36:39.267658752Z" level=info msg="POST /v1.17/containers/b3c8c148af6d0f255ed082c1b9e6ac09b75579642c1e5f3bd90d1b3484fd4524/start"
time="2016-03-13T01:36:39.268587000Z" level=info msg="GET /v1.17/containers/088421aba4f4b810dabde3f1520ad720319bd8aaff3f0e257a7ac2e430ac8544/json"
time="2016-03-13T01:36:42.412616709Z" level=info msg="GET /v1.17/containers/07ece9de53c7a63024b15358688ca2026f2f806e2414e22b19a5b209bf7d767b/json"
time="2016-03-13T01:36:46.939015988Z" level=info msg="GET /v1.17/containers/9a941f8592625c090faa7bbb29d5d051460695bac358ff77da8968a9401eda5e/json"
time="2016-03-13T01:36:50.478022196Z" level=info msg="GET /v1.17/containers/9be050a0c9e8844fa3fc41140ab3347a5ff9786b8b214e09d9f9d9a142b4fe5f/json"
time="2016-03-13T01:36:50.906360414Z" level=info msg="POST /v1.17/images/create?fromImage=quay.io%2Fappuri%2Ftable-flipper%3A414c315"
time="2016-03-13T01:36:52.611985364Z" level=info msg="POST /v1.17/containers/create?name=ecs-Refresh_Users_and_Segments_264-4-tableflippercontainer-aa9fa48bf1d98f999801"
time="2016-03-13T01:36:53.298984787Z" level=info msg="GET /v1.17/containers/c7ae4c0ac987b161f36e2afbaa02d50344f2a0ce7efcac1a1ff03723b22de34f/json"
I will set ecs-agent log level to debug to see if we can get more interesting info.
Jakub
@veverjak Thanks for providing that information. Yes, you assumed correctly in that I was looking for disk device related issues in dmesg. The Agent logs do show timeouts, which is leading me to believe that the Docker daemon is responding slowly. Can you see how long it takes to perform a docker inspect of a container? Also, can you share the output of docker info and how long it took for that to return?
it does take a lot of time indeed...
time docker info
Containers: 3802
Images: 154
Server Version: 1.9.1
Storage Driver: devicemapper
Pool Name: docker-docker--pool
Pool Blocksize: 524.3 kB
Base Device Size: 107.4 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 37.38 GB
Data Space Total: 107.2 GB
Data Space Available: 69.78 GB
Metadata Space Used: 65.97 MB
Metadata Space Total: 109.1 MB
Metadata Space Available: 43.09 MB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.17-22.30.amzn1.x86_64
Operating System: Amazon Linux AMI 2015.09
CPUs: 4
Total Memory: 7.308 GiB
Name: ip-10-0-1-214
ID: PBSV:AZVO:LRPJ:G55J:U4VW:L6YV:YMKU:WLMH:53UF:ZYD3:G7CI:L26L
real 1m35.782s
user 0m0.024s
sys 0m0.008s
Seems like its related to this issue and is solved in docker 1.10
Do you know of this issue? Is there any workaround?
I believe docker 1.10 is not in amzn ECS AMI yet, right?
Jakub
It looks like you're already using the mitigation we attempted for the ECS-optimized AMI after discussion in https://github.com/docker/docker/issues/18314, but the Docker daemon is still being slow. Docker 1.10.x is not yet available in the Amazon Linux AMI repo yet (and unfortunately there is no RPM for the Amazon Linux AMI from get.docker.com).
A potential option here is raising the timeouts, which are defined here, but my worry is that making them higher will make detecting failure slower and that the daemon could still continue to exceed raised timeouts.
If you have reproduction steps for the behavior you're seeing, I'd really appreciate them so I can try and find another mitigation.
I don't have repro steps unfortunately. I can't see the issue consistently - it seems to be working now, so its delight to debug I suppose.
I will try to test few scenarios tomorrow and let you know if I will have repro steps.
We have our custom scheduler and we burst a lot of tasks at the time, so high load could be the source of issue, but right now I am just guessing.
Jakub
@samuelkarp to add some color to what @veverjak said - we spin up containers as part of a background job scheduling system. We can easily spin up 50-100 containers at the same time. These are small containers (typical CPU/mem is 32/32) and they should fit fine on the ECS instances we use. Unfortunately, we see the issues in this thread.
I have one question @samuelkarp you said we are using mitigation you tried in https://github.com/docker/docker/issues/18314, but this means using dm.basesize=10G, right?
We currently have default Base Device Size: 107.4 GB.
Can I change this settings during cloud-init run? Seems like the lvm is already created when cloud-init is run.
Can you advice what would you suggest to do it in automated way during instance initiation?
@samuelkarp ping on this. Does the ECS team have a plan for moving to Docker 1.10? We're down to periodically restarting Docker and the ECS agent - which kills all running containers.
this issue really making us crazy , each morning we need to get terminate ecs instances which have this error in agent log, so that ASG can spin new instances.
@samuelkarp could you please help here , sorry we dont have any reproduction steps .
docker info
Containers: 33
Images: 382
Server Version: 1.9.1
Storage Driver: devicemapper
Pool Name: docker-docker--pool
Pool Blocksize: 524.3 kB
Base Device Size: 107.4 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 10.16 GB
Data Space Total: 78.16 GB
Data Space Available: 68 GB
Metadata Space Used: 6.648 MB
Metadata Space Total: 25.17 MB
Metadata Space Available: 18.52 MB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.17-22.30.amzn1.x86_64
Operating System: Amazon Linux AMI 2015.09
CPUs: 16
Total Memory: 29.44 GiB
Name: ip-**
ID: FP62:I76P:IAUQ:LMFR:A2LP:HAKN:6U4W:VXGE:VMS5:OQHE:2FWB:22S5
@hridyeshpant we restart Docker and the agent ... I think this gets around the need to terminate instances wholesale. @veverjak if I'm right, can you share the instructions?
but we can't restart docker and agent for production environment ,it will make service unavailable.
we are doing deregister that faulty container id, so that al running task can migrate to another ecs instances and then terminating that instance . @bilalaslamseattle
@hridyeshpant that sucks .. sorry .. we're running into this for batch (background) jobs where a restart is ok because the job will get kicked off again.
@samuelkarp @aws-dpt - can you please update this issue with next steps? As @hridyeshpant already noted, this is causing major disruptions in services that rely on ECS. I also believe the More Info Needed tag on the issue is incorrect, you have all the info you need.
@veverjak The mitigation we use in is the direct-lvm setup (two disks, with the second disk being a dedicated thin pool for Docker). LVM is set up as a bootcmd, so it runs fairly early in the boot process (see /etc/cloud/cloud.cfg.d/90_ecs.cfg on your instance) ahead of most user-data. You can change dm.basesize if you want (and it's a reasonable thing to try) with the following user-data:
#cloud-boothook
cloud-init-per instance dm_basesize sh -c "echo 'OPTIONS=\"\${OPTIONS} --storage-opt dm.basesize=10G\"' >> /etc/sysconfig/docker"
Note that the user-data I provided here is not guaranteed to work in the future if we change the AMI configuration.
@hridyeshpant @bilalaslamseattle Any sort of information you can provide on reproduction steps would be a great help here. If you're not sure how to reproduce it, answering these questions might help:
Hi @samuelkarp,
your last hint was correct in our case, we have one job that is bursting a lot of reads from disk and this caused device saturation.
Overloading devicemapper device is obviously causing docker slowness.
I don't think we are hitting any ecs or docker bug here.
Thank you for all your help though,
Jakub
@veverjak I'm glad to hear that we've come to what sounds like a root cause. Let me know if you continue to have problems.
@hridyeshpant If you're still running into problems, can you check the things I suggested: https://github.com/aws/amazon-ecs-agent/issues/336#issuecomment-198026978?
@samuelkarp can you tell me how to test disk activity on your EBS volumes look like (IOPS rate, provisioned/burst IOPS, etc).
Can we have any command to get such data?
@samuelkarp i am attaching some EBS metric attached to the faulty machines where we just have 56 CannotInspectContainerError .





"c4.4xlarge",
3.Avg. no of task per container are 12.
I had this issue after i update the ami version to amzn-ami-2015.09.g on m3.medium instances with a default storage of 22 gb ssd. I was able to resolve this changing the size of storage up to 120 gb but i want to know if there is any other workaround in order to avoid expand the storage.
@hridyeshpant Based on the graphs you posted, it looks like you have fairly high sustained and burst write IOPS on that volume. The EBS documentation has a good section on how IOPS credits are accumulated and consumed.
@samuelkarp we are seeing a similar thing. Our use case is to run scheduled (lots of them) on a timer - for example, we routinely launch 50+ containers on an instance at once. The thing is that most of these containers are instances of the same 4 or 5 images. I would assume that Docker would be smart about caching these images to reduce IO i.e. if I launch 10 instances of the same container I don't read the same image 10 times - do you know if that's the case?
@bilalaslamseattle Docker will create metadata and a new write layer for each container that starts. I'm not sure offhand how much IO Docker will perform during concurrent container creation; your best bet is probably measuring what you consume in this particular use-case.
@samuelkarp big thanks for your help here. We found that this problem went away when we ran more smaller instances e.g. c4.large instead of a few bigger ones.
@bilalaslamseattle @hridyeshpant Please let us know if you need more help here or continue to face issues. I am closing this issue for now as it seems like we have root caused the issue and a remediation as well.
Confirming that this problem reported by OP was indeed related to exhausted EBS IOPS credits when using the default 22Gb /dev/xvdcz volume (using c4.4xlarge instances with up to 20 tasks running per instance). The problem is observable in Cloudwatch by comparing VolumeWriteOps with VolumeQueueLength - we could see VolumeQueueLength jump to over 10 once IOPS credits were exhausted (about 8hrs after launch).
The problem was particularly reproducible in the scenario where there are a number of services in the cluster that are continually deploying tasks (i.e. due to failing containers).
As per ECS team's advice, we solved it by configuring /dev/xvdcz as a 1,000GiB gp2 volume during launch to ensure we don't exhaust IOPS credits.
Pointers on how to change size/type/iops of /dev/xvdcz volume?
Most helpful comment
Confirming that this problem reported by OP was indeed related to exhausted EBS IOPS credits when using the default 22Gb /dev/xvdcz volume (using c4.4xlarge instances with up to 20 tasks running per instance). The problem is observable in Cloudwatch by comparing VolumeWriteOps with VolumeQueueLength - we could see VolumeQueueLength jump to over 10 once IOPS credits were exhausted (about 8hrs after launch).
The problem was particularly reproducible in the scenario where there are a number of services in the cluster that are continually deploying tasks (i.e. due to failing containers).
As per ECS team's advice, we solved it by configuring /dev/xvdcz as a 1,000GiB gp2 volume during launch to ensure we don't exhaust IOPS credits.