Kolibri: Tasks don't complete in Device - Channels module

Created on 30 Jan 2020  Â·  10Comments  Â·  Source: learningequality/kolibri

Observed behavior


Tasks are no longer completing after an abrupt shutdown (power to the server unplugged) while an update task was ongoing.
The tasks were completing and had even cleared them prior to the abrupt shutdown.

Expected behavior


Tasks should complete.

User-facing consequences


Can't accomplish any task in the Device -> Channels module.

Errors and logs


Herein are attached the logs of the last few days ever since the problem started.
kolibri-2020-01-25.txt
kolibri-2020-01-26.txt
kolibri-2020-01-27.txt
kolibri-2020-01-23.txt
kolibri-2020-01-24.txt

Steps to reproduce

  1. Start a channel content update task
  2. Simulate an abrupt shutdown of Kolibri while the task is still underway (Unplug the device)
  3. Start device and Kolibri
  4. No new tasks shall complete thereafter

Context

Kolibri server - 0.3.6
Kolibri - 0.13.0
Ubuntu 18.04.3

bug

Most helpful comment

After some live debugging with the @intelliant01 installation, the problem was on this line https://github.com/learningequality/kolibri/blob/release-v0.13.x/kolibri/core/content/management/commands/importchannel.py#L179

importchannel tries to acquire a lock to do the importation and could not do it because the database said a lock was acquired by another process with a different pid. Diskcache does not return any error in this case, it just keeps the acquire routine in an infinite loop: https://github.com/grantjenks/python-diskcache/blob/master/diskcache/recipes.py#L145-L154 until the lock is released.

The problem appears because when the abrupt shutdown happened, kolibri was interrupted while doing the import and had a pid that was inserted in the cache.db to lock the process. After kolibri restarted the pid was not the same and we were in a deadlock.

I think the solution should be deleting cache.dbwhenever kolibri starts, to avoid this kind of problems, but I'd like to hear more opinions, specially from @rtibbles who implemented the locking here.

btw, this bug may happen with all the tasks using db_task_write_lock, not only importing channels, also deleting them, trying to vacuum, importing content, etc.

I am attaching here the bad db to help debugging

All 10 comments

Thanks for the report @intelliant01!

I'm a bit confused though. I believe the expected behavior for both 0.12.x and 0.13.0 is that any transfers that are outstanding during a hard shutdown will "fail" and need to be manually re-initiated.

Would you mind clarifying a couple things?

Tasks are no longer completing after an abrupt shutdown (power to the server unplugged) while an update task was ongoing.
The tasks were completing and had even cleared them prior to the abrupt shutdown.

Are you saying that in 0.12.x, content transfers would resume after a hard shutdown of Kolibri? I just tested on 0.12.9, and the behavior was that the transfer had disappeared completely, which I believe is the same behavior as in 0.13.0.

Can't accomplish any task in the Device -> Channels module.

Are you saying that after a hard shutdown, content transfers working altogether and cannot be re-initiated?

Are you saying that in 0.12.x, content transfers would resume after a hard shutdown of Kolibri?

No. They don't resume. They disappear.

Are you saying that after a hard shutdown, content transfers (stop) working altogether and cannot be re-initiated?

Yes, they stop working i.e. all content tasks can be re-initiated but they never complete.

If there is still any doubt then shall create a gif and share.

Thank you! Yes, a gif would definitely be helpful. It also sounds somewhat similar but slightly different to a problem reported in the forums here: https://community.learningequality.org/t/import-export-issues/1396

My experience is similar to point 1 therein, though I have not tried the uninstall and reinstall step -

When I import several resources at the same time the Task Manager gets stuck at 100%. Cancelling the import doesn’t work. Only after restarting the computer does the task disappear. But afterwards I get the same problem with every task. So completely unable to import anything. I have to uninstall (deleting all imports) and reinstall Kolibri to be able to import again.

However I do remember being able to start and complete multiple imports. They did execute sequentially but they did get enlisted in the tasks list simultaneously and did complete. It was only after a hard shutdown that I experienced the problem.

Shall try to provide a gif soon as I may be away from the problem system for the next two days.

Hi @intelliant01 -

We've tried unsuccessfully to reproduce. Are you still seeing this?

Pasting the main error I saw in the logs provided below:

ERROR 2020-01-23 17:04:02,347 exception Internal Server Error: /api/content/remotechannel/7426fe78415147308e81e2004b987c52/
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/kolibri/dist/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/kolibri/dist/urllib3/util/connection.py", line 57, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/kolibri/dist/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/kolibri/dist/urllib3/connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3/dist-packages/kolibri/dist/urllib3/connectionpool.py", line 839, in _validate_conn
    conn.connect()
  File "/usr/lib/python3/dist-packages/kolibri/dist/urllib3/connection.py", line 301, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/kolibri/dist/urllib3/connection.py", line 168, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fd42ef42898>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/kolibri/dist/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/kolibri/dist/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/kolibri/dist/urllib3/util/retry.py", line 399, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='studio.learningequality.org', port=443): Max retries exceeded with url: /api/public/v1/channels/lookup/7426fe78415147308e81e2004b987c52 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd42ef42898>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/handlers/exception.py", line 41, in inner
    response = get_response(request)
  File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/kolibri/dist/django/views/decorators/csrf.py", line 58, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/kolibri/dist/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/kolibri/dist/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/lib/python3/dist-packages/kolibri/dist/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/lib/python3/dist-packages/kolibri/dist/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "./kolibri/core/content/api.py", line 986, in retrieve
    identifier=pk, baseurl=baseurl, keyword=keyword, language=language
  File "./kolibri/core/content/api.py", line 906, in _make_channel_endpoint_request
    resp = requests.get(url)
  File "/usr/lib/python3/dist-packages/kolibri/dist/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/kolibri/dist/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/kolibri/dist/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/kolibri/dist/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/kolibri/dist/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='studio.learningequality.org', port=443): Max retries exceeded with url: /api/public/v1/channels/lookup/7426fe78415147308e81e2004b987c52 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd42ef42898>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Hi @indirectlylit,

Yes am facing this issue even after updating to Kolibri 0.13.1rc1 . Rest of the environment is the same.

Herein are 2 gifs (Had hit pause in the middle of recording to avoid nagging wait time) -

tasks_not_completing

tasks_not_completing-1

Few changes noticed from 0.13.0 -

  1. The task cancellation has started working.
  2. A task's progress bar is no longer visible which used to quickly move to 100% earlier and never complete. Instead I just see a 'waiting' now.

After some live debugging with the @intelliant01 installation, the problem was on this line https://github.com/learningequality/kolibri/blob/release-v0.13.x/kolibri/core/content/management/commands/importchannel.py#L179

importchannel tries to acquire a lock to do the importation and could not do it because the database said a lock was acquired by another process with a different pid. Diskcache does not return any error in this case, it just keeps the acquire routine in an infinite loop: https://github.com/grantjenks/python-diskcache/blob/master/diskcache/recipes.py#L145-L154 until the lock is released.

The problem appears because when the abrupt shutdown happened, kolibri was interrupted while doing the import and had a pid that was inserted in the cache.db to lock the process. After kolibri restarted the pid was not the same and we were in a deadlock.

I think the solution should be deleting cache.dbwhenever kolibri starts, to avoid this kind of problems, but I'd like to hear more opinions, specially from @rtibbles who implemented the locking here.

btw, this bug may happen with all the tasks using db_task_write_lock, not only importing channels, also deleting them, trying to vacuum, importing content, etc.

I am attaching here the bad db to help debugging

Yes - I think regenerating it when Kolibri starts seems wise.

Hi @intelliant01, @jredrejo has a fix in #6567, which will be included in release 0.13.1.
I'm going to close this issue now, but please let us know if you still see the error after upgrading to 0.13.1. Thank you!

@jredrejo @lyw07 thanks for the update. Have removed the problem cache.db as of now. However, will keep an eye for recurrence of this issue.

Was this page helpful?
0 / 5 - 0 ratings