Carla: Unexplained consistent random stalling/crash with latest release builds

Created on 9 Jan 2019  路  9Comments  路  Source: carla-simulator/carla

Minimal setup to reproduce the issue:

  • Multiple (>=2) CARLA 0.9.2 release server binaries running on the same GPU,
  • 1 vehicle actor spawned with 1 RGB camera attached,
  • Using compatible PythonAPI build (0.9.2) for client communications

Observations/ Issues to be addressed:

  • Works as expected for a few thousand ticks
  • [ ] Consistently stalls (there is GPU activity though) after several thousand ticks (~2 hrs)
  • [x] Tries to send crash reports to Epic Games server which sometimes fails and causes stalling
  • [ ] The stderr/stdout messages are not informative when a SIGSEGV (signal 11) is received
  • [x] ERROR:tcp accept error: Too many open files (#1119)

Sample (short) trace of stdout when the stalling happens:

...
================================================================================
[2019.01.08-22.02.55:704][  0]LogInit: Display: Game Engine Initialized.
[2019.01.08-22.02.55:704][  0]LogGameplayTags: Display: UGameplayTagsManager::DoneAddingNativeTags. DelegateIsBound: 0
[2019.01.08-22.02.55:704][  0]LogStats: UGameplayTagsManager::ConstructGameplayTagTree: Construct from data asset -  0.000 s
[2019.01.08-22.02.55:704][  0]LogStats: UGameplayTagsManager::ConstructGameplayTagTree: GameplayTagTreeChangedEvent.Broadcast -  0.000 s
[2019.01.08-22.02.55:718][  0]LogInit: Display: Starting Game.
[2019.01.08-22.02.55:718][  0]LogNet: Browse: /Game/Carla/Maps/Town01?Name=Player
[2019.01.08-22.02.55:731][  0]LogLoad: LoadMap: /Game/Carla/Maps/Town01?Name=Player
[2019.01.08-22.02.58:337][  0]LogAIModule: Creating AISystem for world Town01
[2019.01.08-22.02.59:354][  0]LogLoad: Game class is 'TheNewCarlaGameMode_C'
[2019.01.08-22.02.59:768][  0]LogWorld: Bringing World /Game/Carla/Maps/Town01.Town01 up for play (max tick rate 0) at 2019.01.08-17.02.59
[2019.01.08-22.02.59:795][  0]LogWorld: Bringing up level for play took: 0.437414
[2019.01.08-22.02.59:813][  0]LogCarlaServer: Initializing rpc-server at port 37382
[2019.01.08-22.02.59:821][  0]LogCarlaServer: New episode 'Town01' started
[2019.01.08-22.02.59:826][  0]LogLoad: Took 4.095626 seconds to LoadMap(/Game/Carla/Maps/Town01)
[2019.01.08-22.03.00:880][  0]LogLoad: (Engine Initialization) Total time: 6.40 seconds
[2019.01.08-22.03.01:057][  0]LogRenderer: Reallocating scene render targets to support 856x640 Format 10 NumSamples 1 (Frame:1).
[2019.01.08-22.03.01:651][  1]LogLinux: Setting swap interval to 'Immediate'
[2019.01.08-22.03.01:651][  1]LogLinux: Warning: Unable to set desired swap interval 'Immediate'
[2019.01.08-22.03.01:652][  1]LogCarla: Starting AWorldObserver sensor
[2019.01.08-22.03.56:994][626]LogHttp: Warning: 0x7fbb68fd5c80: request failed, libcurl error: 6 (Couldn't resolve host name)
[2019.01.08-22.03.56:994][626]LogHttp: Warning: 0x7fbb68fd5c80: libcurl info message cache 0 (Could not resolve host: datarouter.ol.epicgames.com)
[2019.01.08-22.03.56:994][626]LogHttp: Warning: 0x7fbb68fd5c80: libcurl info message cache 1 (Closing connection 0)


Related Issues: #1
CARLA Version: 0.9.2
OS: Ubuntu

bug c++

Most helpful comment

Another reason for a crash:

terminating with uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >: bind: Address already in use
Signal 6 caught.
Malloc Size=131076 LargeMemoryPoolOffset=131092 
Malloc Size=65535 LargeMemoryPoolOffset=196655 
Malloc Size=51811 LargeMemoryPoolOffset=248483 
Aborted (core dumped)

All 9 comments

Another reason for a crash:

terminating with uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >: bind: Address already in use
Signal 6 caught.
Malloc Size=131076 LargeMemoryPoolOffset=131092 
Malloc Size=65535 LargeMemoryPoolOffset=196655 
Malloc Size=51811 LargeMemoryPoolOffset=248483 
Aborted (core dumped)

I noticed another run-time error (it may be happening after the server crashes). Looks relevant to #1119 :
rpc::timeout: Timeout of 60000ms while calling RPC function 'destroy_actor'

Another one: RuntimeError: rpc::rpc_error during call in function spawn_actor
This seems irrecoverable once it happens. It will be better if it is raised with some traceable information/reason.

rpc::timeout: Timeout of 60000ms while calling RPC function 'destroy_actor'

This happens because our scripts try to destroy the actors on exit, if the server has crashed they try to connect but the server is no longer there.

RuntimeError: rpc::rpc_error during call in function spawn_actor

This is usually due to collision in the spawn position, if that's the case it's safe to try to spawn again somewhere else, or alternatively, use try_spawn_actor that returns None instead of raising an exception. The lack of information is a known issue, the spawn function returns a string with the cause of failure but there is a problem retrieving this message on the client-side. I'll open another issue with this for people looking for this same message (#1095).

Another reason for a crash:

terminating with uncaught exception of type 

boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system>_error> >
: bind: Address already in use

The bind: Address already in use issue is still present with the 0.9.3 release. The process might just need to try and bind to an unused port or even better, it can bind() to port 0 in which case the OS will allocate an unused port.

@nsubiron If you have a fix in place for the bind: Address already in use issue, can you please push to a branch (some other branch if not to master) ?

@praveen-palanisamy You can find latest build here. Now you can launch the simulator with -carla-streaming-port= to select the streaming port (and -carla-rpc-port= for main port). If the streaming port is set to 0, a random available port is chosen.

@nsubiron Nice! Thanks for the fix. It seems to be working well.
We can probably close this issue once it is merged into master.

Great :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kartikye picture kartikye  路  3Comments

syinari0123 picture syinari0123  路  4Comments

qixiaoshuai0120 picture qixiaoshuai0120  路  3Comments

jinfagang picture jinfagang  路  3Comments

mhusseinsh picture mhusseinsh  路  3Comments