Kiwix-android: Current download process breaks potentially ZIM files

Created on 9 Apr 2018  路  33Comments  路  Source: kiwix/kiwix-android

Bug Report

Environment

  • version of the software you use : git master
  • device / operating system : all

The Bug:

The download process split automatically ZIM files in 4 GB chunks. This assumes that we can split the ZIM files at any place. That is not true anymore. Because a few content (video or ft indexes for examples) are accessed directly (sharing directly the file descriptor) it is not possible to split content in the middle of a cluster anymore.

Steps to reproduce:

  1. Download WPFR ZIM
  2. Make a search, for example "petit antoine marc"
  3. No suggestion is given, the suggestion search does not use the Xapian index

  4. What should be the behaviour from your point of view? How do you expect the service to work?

    The suggestion system should be far more efficient, for example with our example, kiwix-search delivers (with a non-splitted version of the ZIM):

./kiwix-search -s wikipedia_fr_all_novid_2018-03.zim "petit antoine marc"
Marc-Antoine Petit

Other Comments:

The Kiwix-lib will in the future handle all of this, but for now I would recommend to not split files anymore. So basically do not allow to download big content > 4GB, if the filesystem is not able to write a file >4GB (should be easy to test)

bug

All 33 comments

Currently we use 2GB chunks. We will need to change this to 3.X to facilitate this change.

@mhutti1 Could with for 2.4:

  • Increase to 4GB (no reason to stick to 2GB IMO)
  • Do not split if the FS can handle 4GB+ files

For 3.X the kiwix-lib should be able to provide the download primitive.

So you don't want split ZIM's at all anymore? No more Wikipedia on 90% of android devices?

@mhutti1 We can not randomly split ZIM files anymore, this is a fact. A smart splitting can be done and will be done with the next versions of the libkiwix. It does not make sense to make now something smart on java side. But we still need to do our best for the UX, So we have one alternative if the fs can not handle files > 4GB:

  • Forbidds download of ZIM > 4GB
  • Let the system split it... but then the FT system won't work probably and the suggestion will be poor (with Xapian index).

Can we not fall back on the old search system in such a case? I really do not want to have to disable chunks of the library :(

@mhutti1 Should work but means:

  • no ft search
  • really limited suggestions

but we have here basically a strong regression (but so far I can see no crash) on a core feature of Kiwix. Quite difficult IMO to argue that this problem is not a blocker.

There has never been full text of full Wikipedia in Android right? the FT index was too large. By doing as I suggest we still are able to let users use the files that they have. Maybe mark them as limited search functionality?

@mhutti1 How about keeping the same download methodology and merging the zim files on supported devices?

@sakchhams That could work but that isn't the main issue. The vast majority of devices don't support it.

@mhutti1 Once we have the complete file (all its parts) on the file system, we can redistribute the content in the parts so that the video files, ft index, etc. should remain in one file each. This shouldn't take too long to implement and can work as a temporary solution till the download code is shifted to kiwix-lib

@RohanBh That would work for files overlapping breaks but not for files actually larger than 4GB. I also don't know how easy working out the relative file descriptor offsets would be with the current lib kiwix.

@mhutti1 the current implementation doesn't take a file descriptor. It takes a path to the file instead. If it would accept a file descriptor it would fix this issue and #56 both.

@sakchhams We use file descriptors for video content already. The issue is if a single file is larger than 4GB then we would have to split it what ever happens. If you think you can get a valid file descriptor from a file that doesn't get duplicated on install then definitely bring that up in #56 but I don't think that it is possible.

@mhutti1 That's not what I meant. Assume we have a video in the zim file but it's split in two and is stored in the files file.zimaa and file.zimab of 2 GB each. What I meant is to read the bits from these files and start a new file where the blob for the video starts and then end the file when that blob ends. So that, in the end, we will have 3 files file1.zimaa, file1.zimab and file1.zimac. Then we will delete the original parts and rename the newly made parts. This way the video's data will be stored in a cluster in file1.zimab. We can then provide the file descriptor for video content.

@RohanBh There is no need for this because we already do this but without needing to recreate files. We can open videos with a file descriptor and an offset. The issue is files larger than 4GB.

I understand that when files are larger than 4GB we can't store them as one on Android devices because of FAT32 filesystem limitations. What I am saying is that, when this happens, we download files in chunks of 2GBs each until the whole file is downloaded. Once that happens, there is a possibility that a video content might have been split because of the file breaking into chunks. Because of this, a valid file descriptor can't be provided for the whole video file. The above solution aims to fix this problem (unless I don't understand the problem completely).

Yes but if the file is bigger than 4GB then it does not fix the problem and that is the main issue we are facing especially with ft indexes.

If we can get the split to happen so that the videos and ft indexes are not split between different chunks, then wouldn't that solve the problem?

@RohanBh There is no way - for now - to do that in Java. Java does not know where to split properly and can not know it.

We can add a class for this operation in kiwix-lib and make it available in Java through JNI.

@RohanBh At that point we might as well do it all in kiwix-lib which is what we are working on with the added benefit of being able to handle files larger than 4GB which your solution still does not fix.

@mhutti1 Can you please tell me why my solution doesn't fix this issue?

@RohanBh 2 premises

  1. kiwix-lib can't read files split between physical files.
  2. Android file system files on majority of devices can't be larger than 4GB

If we wan't to get kiwix-lib reading a >4GB full text index in a ZIM file that has been split by premise 1 and 2 it is impossible.

@mhutti1 If there is a zim file that is split such that none of its files (like image, audio, video) aren't split between two physical files, will kiwix-lib successfully be able to handle it?
Because this is what I was proposing. I wanted to make a program that takes a zim file as an input and outputs another splitted zim file such that it is not "broken", i.e. it can be handled by kiwix-lib.

@RohanBh Yes that is what it does now. Again though this would not fix the issue for the major cases we have though i.e full text indexes. We are already working on a solution that fixes the problem fully not just part of it.

This is a bit hard for me to test. My emulator is playing up for large ZIM files and my phone doesn't have enough space. @kelson42 Have you seen this on any smaller ZIMs?

@mhutti1

The problem appears only with ZIM files over 4GB. All the other one should simply work like today.

Kiwix Android should know if it really need to split a ZIM file in 4GB chunks or not. This can be done simply by trying to create a dummy file of more than 4GB and then remove it.

Than we have two cases:

  • If it works, no need to split the ZIM files and then no problem
  • otherwise, you will have to split the file but the libkiwix should provide a legacy mode transparently, I have open a ticket for that https://github.com/kiwix/kiwix-lib/issues/154

@kelson42 Do we not already use legacy mode? Can you give an example of an article that you can't search by title by?

@mhutti1 legacy mode works fine if not xapian index in the ZIM file. Here we (1) have a xapian ft index (2) but its splitted/corrupted. In that case it fails to give any suggestion.

Do you have an example of a ZIM file with this issue?

@kelson42 It doesn't. I can search on it.

We have decided to not split anymore the ZIM files on the device during the download. The reason is that, with the level of information delivered, it is impossible for Kiwix-Android to know how to split files properly. With 2.5 the download of files which are not supported by the fs is not allowed anymore. We will try to propose in the OPDS feed a way to know how to split files, see https://github.com/kiwix/kiwix-tools/issues/287. Then we could rethink about splitting files in Kiwix-Android.

Was this page helpful?
0 / 5 - 0 ratings