Go-ipfs: Can't access InRelease files

Created on 27 May 2020  Â·  21Comments  Â·  Source: ipfs/go-ipfs

Version information:

go-ipfs version: 0.6.0-dev-413ab315b
Repo version: 9
System version: arm64/linux
Golang version: go1.14.3
OS: Ubuntu 20.04 LTS aarch64 
Host: Raspberry Pi 4 Model B Rev 1.2 
Kernel: 5.4.0-1011-raspi 
Uptime: 2 days, 43 mins 
Packages: 669 (dpkg), 6 (snap) 
Shell: bash 5.0.16 
Terminal: /dev/pts/0 
CPU: BCM2835 (4) @ 1.500GHz 
Memory: 885MiB / 3793MiB 

Description:

I'm trying to build a mirror of Ubuntu Archives on IPNS using a Raspberry Pi and a 2 TB external HDD. So far, thing are going pretty well, but I think I've encountered a breaking bug.

```sources.list
deb http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal main restricted universe multiverse # IPNS
deb http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-updates main restricted universe multiverse # IPNS
deb http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-backports main restricted universe multiverse # IPNS
deb http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-security main restricted universe multiverse # IPNS
deb http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-proposed main restricted universe multiverse # IPNS


```dash
Err:11 http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-updates InRelease
  Connection failed [IP: 127.0.0.1 8080]
Err:12 http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-backports InRelease
  Connection failed [IP: 127.0.0.1 8080]
Err:13 http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-security InRelease
  Connection failed [IP: 127.0.0.1 8080]
Err:14 http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-proposed InRelease
  Connection failed [IP: 127.0.0.1 8080]
Fetched 265 kB in 4min 0s (1 102 B/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
12 packages can be upgraded. Run 'apt list --upgradable' to see them.
W: Failed to fetch http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu/dists/focal-updates/InRelease  Connection failed [IP: 127.0.0.1 8080]
W: Failed to fetch http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu/dists/focal-backports/InRelease  Connection failed [IP: 127.0.0.1 8080]
W: Failed to fetch http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu/dists/focal-security/InRelease  Connection failed [IP: 127.0.0.1 8080]
W: Failed to fetch http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu/dists/focal-proposed/InRelease  Connection failed [IP: 127.0.0.1 8080]
W: Some index files failed to download. They have been ignored, or old ones used instead.

According to those logs, the problem occurs at http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu/dists.

I'm using this to query multiple public gateways to know if they can access the file.

To speed up discovery, ipfs swarm connect /p2p/QmV8TePNsdZiXUpq62739hp5MJLSk8SdpSWcpLxaqhRQdR.

kinbug neetriage

Most helpful comment

So, my ideal solution here would be to just not use the go-ipfs daemon, but instead write a custom dropbox like IPFS service by cobbling together bitswap, libp2p, a datastore, and the DHT. It would:

  1. Monitor a directory for changes.
  2. When a file is added, it would chunk, hash, and _index_ (but not copy) the file. You could even store the results in an sql database instead of using a datastore.
  3. When a file is removed/changed, it would remove references to the file.

The database schema would be:

  • Table: files

    • filename (primary key)

    • modtime

  • Table: blocks

    • id (primary key)

    • cid (indexed)

    • filename (indexed)

    • offset

On start:

  • scan for changed files, comparing with the mod times in the database.
    On add/update.
  • Add the file to the files table.
  • Run DELETE FROM blocks where filename=filename (just in case)
  • Chunk the file, adding each block to the blocks table.
    On remove:
  • Run DELETE FROM blocks where filename=filename (just in case)
  • Remove the file from the files table.

All 21 comments

Is it a symlink?

There's a huge probability it is; There's way too many links in there. I noticed some of them were just downloaded as files and it looks like some other just aren't reachable.

Could you give me your full multiaddr? I can't find your node.

(but yeah, we need to follow symlinks on the gateway)

I can't seem to reach any of those addresses. But you can check to see if it's a symlink by calling ipfs get on the file in question.

Oh. I think we found the problem.

ipfs get bafybeihocm6ufvyz44kde6fewu2wsj4qfiecfzbjubbekvcnw3hr7u3smq/ubuntu/dists/focal-updates/
Saving file(s) to focal-updates
 311.33 MiB / 311.33 MiB [==================================================================================] 100.00% 1s
Error: data in file did not match. mirrors/ubuntu/dists/focal-updates/InRelease offset 0

Because rsync takes 10 minutes to re-sync and IPFS takes multiple hours to re-sync, there's no way the InRelease file can match.

Is there a way to make the adding process faster? Right now, the command I'm using is ipfs add --recursive --hidden --quieter --wrap-with-directory --chunker=rabin --nocopy --fscache --cid-version=1.

I saw in https://github.com/ipfs-inactive/package-managers/issues/18 that removing --nocopy held huge improvements, but that's kinda hard when Ubuntu Archives is 1.24 TB and I have only 2 TB available 🤔

Removing --fscache may help. Other than that, which datastore are you using? Could you post the output of ipfs config show?


ipfs config show

{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": [],
    "Gateway": "/ip4/0.0.0.0/tcp/8080",
    "NoAnnounce": [],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic",
      "/ip6/::/udp/4001/quic"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb"
  ],
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "child": {
        "path": "badgerds",
        "syncWrites": false,
        "truncate": true,
        "type": "badgerds"
      },
      "prefix": "badger.datastore",
      "type": "measure"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "10GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": true,
      "Interval": 10
    }
  },
  "Experimental": {
    "FilestoreEnabled": true,
    "GraphsyncEnabled": true,
    "Libp2pStreamMounting": true,
    "P2pHttpProxy": true,
    "ShardingEnabled": true,
    "StrategicProviding": true,
    "UrlstoreEnabled": true
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "QmV8TePNsdZiXUpq62739hp5MJLSk8SdpSWcpLxaqhRQdR"
  },
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "Routing": {
    "Type": "dht"
  },
  "Swarm": {
    "AddrFilters": null,
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 900,
      "LowWater": 600,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": false,
    "DisableRelay": false,
    "EnableAutoRelay": true,
    "EnableRelayHop": true
  }
}

Since I got the data in file did not match error, I removed the --nocopy option, but now I need 2.48 TB of storage and I only have 1.80 TB. I think this project will sink for me ^^

Right now, I'm using Btrfs and duperemove to save on the duplication, but it looks like not much of the Badger Datastore can be deduplicated. If I could deduplicate just enough to not go over my 1.8 TB budget, I would be able to publish this mirror and actually use it.


apt show duperemove

Package: duperemove
Version: 0.11.1-3
Priority: optional
Section: universe/admin
Origin: Ubuntu
Maintainer: Ubuntu Developers <[email protected]>
Original-Maintainer: Peter Záhradník <[email protected]>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 260 kB
Depends: libc6 (>= 2.14), libglib2.0-0 (>= 2.31.8), libsqlite3-0 (>= 3.7.15)
Enhances: btrfs-progs
Homepage: https://markfasheh.github.io/duperemove/
Download-Size: 70.6 kB
APT-Sources: http://archive.ubuntu.com/ubuntu focal/universe amd64 Packages
Description: extent-based deduplicator for file systems
 Duperemove is a tool for finding duplicated extents and submitting them for
 deduplication.  When given a list of files it will hash their contents on a
 block by block basis and compare those hashes to each other, finding and
 categorizing extents that match each other.
 .
 On BTRFS and, experimentally, XFS, it can then reflink such extents in a
 race-free way.  Unlike hardlink-based solutions, affected files appear
 independent in any way other than reduced disk space used.

Got it. I wanted to make sure you were using badger without sync writes enabled.

I'm not sure why removing --nocopy helps and I'm not entirely sure that that's still true after some optimizations we've made.

Note: I'd consider using snapshots to decouple these. That is, you can:

  1. Rsync in one loop every 10? minutes.
  2. In a separate loop:

    1. Take a btrfs snapshot (use flock to make sure an rsync run isn't running?).

    2. Add this btrfs snapshot to IPFS with nocopy.

That will mean that the IPFS mirror will always be a _bit_ behind but you'll never have to stall the HTTP mirror to wait on the IPFS mirror. This will _also_ ensure that you never modify files after adding them to IPFS.

Oh, that's very interesting. For the --nocopy option to work, new files have to be in a different path than the old files, and unchanged files mustn't be removed. That means I'll end up with an ever-growing amount of snapshots, roughly once per ipfs add.

Is there a way to cleanup the snapshots? What happens if I add using --nocopy a file that already exists elsewhere?

That means I'll end up with an ever-growing amount of snapshots, roughly once per ipfs add.

Yes, but the snapshots should dedup.

Is there a way to cleanup the snapshots? What happens if I add using --nocopy a file that already exists elsewhere?

Unfortunately, I don't think it's possible to override _old_ files with _new_ files. I believe for performance reasons, we don't bother replacing old "filestore no copy" records with ones pointing to new files.

Honestly, I think the best approach here would be to create a _new_ repo, add a _new_ snapshot, then delete the old repos and the old snapshots (once every few days). I assume the repos (with --nocopy) aren't _too_ large, right?

Otherwise, we may be able to find a way to bypass the "do I already have this block check" by adding yet another flag (but I'd prefer not to if possible).

Otherwise, we may be able to find a way to bypass the "do I already have this block check" by adding yet another flag (but I'd prefer not to if possible).

This seems very useful. In fact, it's confusing that it's not already the case; If I add a new file using --nocopy, then I expect the unpinned ones to be replaced. Another approach could be to add multiple sources to --nocopy files, but I'm not sure if it's that useful. I think I prefer just overriding the previous link.

I believe the benefices are real. Should I raise an issue for that?

It deserves an issue, but I'm not sure about the best approach. A really nice property of the current blockstore is that it's idempotent. This change would break that.

@Stebalien
1 Check have
2 Validate
3 Replace if old is bad block

I'm closing this as it's not really a bug. Removing/changing a file on disk after adding it to go-ipfs with the --nocopy flag isn't allowed.

Hey! I just wanted to add that I've updated my script to manage snapshots as you suggested.

image

I had to create a Btrfs subvolume and move the mirror over, but this done overnight, I'm now adding it back to IPFS using a fresh badgerds. It seems to take a very long time.

The problem with the program I made is that it's now dependent on Btrfs. While I do love Btrfs, I'm not sure if it's a great idea for my ipfs-mirror-manager to be tied to a filesystem. Moreover, the --nocopy option makes it mandatory for the node to boot from the same drive as the mirror itself. It would be nice to be able to separate them.

Nonetheless, successfully pulling off an IPFS mirror of the Ubuntu archive on a Raspberry Pi would be very impressive, and I'm extremely proud that IPFS has come this far.

At this time, the .ipfs folder is only 1.3G. The total disk usage is 1.3T.

So, my ideal solution here would be to just not use the go-ipfs daemon, but instead write a custom dropbox like IPFS service by cobbling together bitswap, libp2p, a datastore, and the DHT. It would:

  1. Monitor a directory for changes.
  2. When a file is added, it would chunk, hash, and _index_ (but not copy) the file. You could even store the results in an sql database instead of using a datastore.
  3. When a file is removed/changed, it would remove references to the file.

The database schema would be:

  • Table: files

    • filename (primary key)

    • modtime

  • Table: blocks

    • id (primary key)

    • cid (indexed)

    • filename (indexed)

    • offset

On start:

  • scan for changed files, comparing with the mod times in the database.
    On add/update.
  • Add the file to the files table.
  • Run DELETE FROM blocks where filename=filename (just in case)
  • Chunk the file, adding each block to the blocks table.
    On remove:
  • Run DELETE FROM blocks where filename=filename (just in case)
  • Remove the file from the files table.

So, my ideal solution here would be to just not use the go-ipfs daemon, but instead write a custom dropbox like IPFS service by cobbling together bitswap, libp2p, a datastore, and the DHT. It would:

1. Monitor a directory for changes.

2. When a file is added, it would chunk, hash, and _index_ (but not copy) the file. You could even store the results in an sql database instead of using a datastore.

3. When a file is removed/changed, it would remove references to the file.

The database schema would be:

* Table: files

  * filename (primary key)
  * modtime

* Table: blocks

  * id (primary key)
  * cid (indexed)
  * filename (indexed)
  * offset

On start:

* scan for changed files, comparing with the mod times in the database.
  On add/update.

* Add the file to the files table.

* Run `DELETE FROM blocks where filename=filename` (just in case)

* Chunk the file, adding each block to the blocks table.
  On remove:

* Run `DELETE FROM blocks where filename=filename` (just in case)

* Remove the file from the files table.

perhaps you should create a new issue to track the development of this idea

Good point. I've filed an issue here: https://github.com/ipfs/notes/issues/434

I can't really afford the time it would take to build a custom IPFS daemon, I have to do with what I have. And now, what I have is a mirror that takes around 2 days per updates. I posted it on Reddit.

In the meantime, is there any way to optimize it?

Right now, the command I'm using is ipfs add --recursive --hidden --quieter --progress --chunker=rabin --nocopy --cid-version=1.

CPU usage is about 40% and HDD read speeds are at about 15-30 Mbps.

Don't use --chunker=rabin. Our rabin implementation is terrible. For now, I recommend --chunker=buzhash. You could also try passing --inline to inline small (<=32 bytes) files into directory entries.

Was this page helpful?
0 / 5 - 0 ratings