Mastodon: Data export as .zip instead of .tar.gz

Created on 20 Nov 2018  Â·  18Comments  Â·  Source: tootsuite/mastodon

Pitch

Right now, the account archive can be obtained from /settings/export which is great. I believe, this export should be in a more common format, though. I would like to suggest using .zip instead of .tar.gz.

Motivation

Mastodon is no longer a tool just for geeks. Every major operating system comes with tools to unpack zip files. Tar, however, needs to be unpacked by a third party tool, at least on windows. I don't have stats, but I would guess that a large portion of mastodon's user base is using windows. On the other hand, I don't believe zip has any disadvantages over tar.gz - making the format a mere implementation detail.

suggestion

Most helpful comment

Well… as you can see at least two users are already interested in it. (And without an issue, such ideas can easily get lost, that's why I would always open an issue.) So I've opened one at https://github.com/tootsuite/mastodon/issues/9461.

All 18 comments

There is nothing in the archive that you could really make a use of without a specialized tool of some sort. There's images and JSON files. So I don't see how zip vs tar makes a difference. A specialized tool can do tar.

Twitter does something very nice, in my opinion: they have this tiny html thing that can (I guess?) parse the JSON. This would be a separate feature request, though. It would add tremendous value to the export.

I think .zip is better than .tar.gz because .zip can extract one file without extracting whole archive.
When extracting one photo file from .tar.gz, archiver ungzip whole .gz and search file from ungzipped .tar. (.tar is TApe Archive. So, it does sequential access.)
In use case that search toot from outbox.json and get attachment file from large archive without extracting whole file, I think .zip is faster than .tar.gz.

+1 for zip.

zip has no encoding information.
CJK users have problems to unzip it.
I don't recommend zip.

.zip has filename encoding issue only when file has non-ASCII name.
I think Mastodon export archive only contains ASCII name file. (except Mastodon is running by user who have non-ASCII username (I think it is problematic on Linux System)).
So, I think there is no problem about filename encoding.

If I interpret this correctly, tar does not solve the encoding problem, either? https://superuser.com/a/60591/286021

The other answer on that thread suggests that CJK characters may be problematic on windows, even when they come from a tar file.

IMHO you should just offer ZIP as an alternative for tar.gz. Just two buttons or so… :smile:

There is nothing wrong with just another option. Tar.gz also has it's advantages (AFAIK a better compression).

Twitter does something very nice, in my opinion: they have this tiny html thing that can (I guess?) parse the JSON. This would be a separate feature request, though. It would add tremendous value to the export.

@ccoenen Did you open a new issue for that?

@rugk no, I did not add another request. I like the idea, but it's not _terribly_ important to me right now. I didn't want to add an issue that I (personally) wouldn't even use in the forseeable future. Feel free to create an issue for it!

Well… as you can see at least two users are already interested in it. (And without an issue, such ideas can easily get lost, that's why I would always open an issue.) So I've opened one at https://github.com/tootsuite/mastodon/issues/9461.

Twitter does something very nice, in my opinion: they have this tiny html thing that can (I guess?) parse the JSON. This would be a separate feature request, though. It would add tremendous value to the export.

FYI I'm using this tool to help slowly sanitise my Twitter account prior to deletion/dormancy

I think we should compare the sizes of both archives. If .tar.gz is significant smaller we should stick to it to save bandwidth.

Also you can just use 7zip to open .tar.gz. If someone tells you they are not using 7zip than they can't work with the data inside the archive anyway.

I think we should compare the sizes of both archives. If .tar.gz is significant smaller we should stick to it to save bandwidth.

Also you can just use 7zip to open .tar.gz. If someone tells you they are not using 7zip than they can't work with the data inside the archive anyway.

This is a basic usability question. Nobody knows what a tar.gz is. Most people know what a .zip is. That there is software (yes, I myself use 7zip as well) that can read a tar.gz is not the point. It is about everyone being able to open this. Making the contents easier to work with is already requested in #9461.

For a case that happens as rarely as download of an archive, I would _not_ use "how much bandwidth is used" as a metric. It makes no sense to me.

Nobody knows what a tar.gz is

I know what it is. Most non technically don't know it.

Making the contents easier to work with is already requested in #9461.

Adding an additional zip option would probably take some UI work but it is probably worth it. zip don't compress pure text as well as tar.gz. eg if you download the latest mastodon release the difference is almost 4MB which is almost 20%.

If I need to archive something fast and I have a lot of data the couple saved MBs can make the difference. I am not against adding zip just against replacing tar.gz.

No. You're literally optimizing for the wrong thing, here. You're optimizing for machines and technically minded people. What's worse is, you're doing so on a hunch without checking for the data in question.

I just downloaded an archive of one of my accounts, it exports as 55.6MB as tar.gz with the current way of doing things. It unpacks to 61.6MB. That data in various formats is

  • 55.6MB tar.gz from export function (=90% of original)
  • 57.1MB in zip "normal" setting of 7zip (=92% of original)
  • 57.0MB in zip "maximum" setting of 7zip (=93% of original)
  • 56.6MB in 7zip "maximum" setting of 7zip (=92% of original)

That's not a huge difference. It is true that text is much better compressed by comparison, but this archive's bulk is _not_ the 7MB of JSON, it's the 54MB of media files - which compress pretty much equally bad in all standards.

And, even if this were _not_ the case, even if it _was_ a 20% difference, then you'd still optimize for the wrong thing.

So why am I for replacing? With two options you have to make sure they are interchangeable. You have to make sure both export the same thing and do not forget anything. You will also have to factor in the burden on a user's end who now has to make a choice. I see no benefit that outweighs this.

Probably most people I deal with every day a very tech savvy so if I optimize for technically minded people I just optimize for my community. Nothing wrong about that.

I optimize for technically minded people I just optimize for my community

IMHO the community of Mastodon – which aims to be a client for the huge Fediverse – is not a "tech-savvy" one. It should be usable by anyone, – just as your COC states – “experience for everyone, regardless of […] level of experience”.
I know the COC applies to the project/community, but IMHO it's not a bad idea to apply it to the software, too.

If you want, you could thus call this issue a thing of “accessibility”. I bet there are many people who don't know .tar.gz and it's hard to open on Windows.
Also, as this concerns a right under GDPR, it would really be good if anyone could make use of their right – regardless how much they know about technology.
They should not need to be "tech-saavy". Mastodon should be an open place for all, including "not tech-saavy" people.

(Also, if that was not clear, Mastodon can and possibly should still offer .tar.gz as a download option, that is all fine – especially if already implemented –, just offer .zip in addition for those who don't know a .tar.gz.)

Was this page helpful?
0 / 5 - 0 ratings