Stack: Timeout when accessing `casa.fpcomplete.com`

Created on 15 Sep 2020  Â·  48Comments  Â·  Source: commercialhaskell/stack

General summary/comments (optional)

The following happens in CI:

HttpExceptionRequest Request {
  host                 = "casa.fpcomplete.com"
  port                 = 443
  secure               = True
  requestHeaders       = []
  path                 = "/v1/pull"
  queryString          = ""
  method               = "POST"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}
 ConnectionTimeout

Seems to be consistent right now. I guess a few questions about this:

  1. Why is it failing? (rate limiting / down time?)
  2. Is it possible to customize this? If so, we could configure our Nexus proxy to proxy casa.fpcomplete.com and point Stack to the proxy.

On our CentOS CI job, it seems to happen around 72/352. Our Mac job got to 233/357.

Steps to reproduce

N/A

Expected

N/A

Actual

N/A

Stack version

We pin Stack to 2.3.1

Method of installation

We copied the Stack executable from GitHub into S3 and install from there

Most helpful comment

The post mortem I can share now is:

  1. A DevOps mistake took down the Kube cluster hosting Casa (initial downtime)
  2. A new version of the cluster included a new version of a network layer, which included a bug that dropped about 0.5% of connections. This did not get picked up by initial testing, since those tests all passed. However, by law of large numbers, it killed basically every CI job (downtime before my temporary server update 4 hours ago)
  3. Separately: there was a bug in Stack 2.3 that treated a Casa downtime as a failure. It was never supposed to do that, and that issue has been fixed. We're testing the new version and intend to release a patched Stack

Mitigation going forward is really focused on the Stack bug. While we want to have Casa as a high availability service, we did not intend to rely on it that way. We try to only rely on services like S3 with strong SLAs for that.

All 48 comments

We're also seeing this issue in our CI, stack version 2.3.3

Also seeing this intermittently, about 10 minutes ago was the last time, but CircleCI succeeded before and after that. I have /root/.stack/ and .stack-work cached, so I would have expected the build to proceed fine offline.

Also getting this on TravisCI, build goes for about up to 10 minutes of installing stack dependencies before this happens, sometimes less (3 minutes). Sometimes it goes through, but in most cases it does not. Stack is 2.3.3.

HttpExceptionRequest Request {
  host                 = "casa.fpcomplete.com"
  port                 = 443
  secure               = True
  requestHeaders       = []
  path                 = "/v1/pull"
  queryString          = ""
  method               = "POST"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}

 ConnectionTimeout

The command "stack --no-terminal --install-ghc test --only-dependencies" failed and exited with 1 during .

I still get this randomly. Interestingly it mostly happens on:

Registering library for $some-dependency

I'm curious why stack decides to pull data from a remote source when locally registering a dependency.

OT: I also noticed that during initial dependency resolution stack does lots of N+1 queries to that endpoint. I wounder if they should be batched.

I see this also happening on CI of this very repo, I made a tiny PR yesterday and a couple of CI jobs fell with the same error.

Is there any workaround available?
Sadly it's a total showstopper for me right now.

Works fine for me now

Would it be possible to test out a build from master? It should already have the fix in place. Hopefully we'll get a new release out soon.

Also: the server outage this morning is resolved for now, so most occurrences of this issue are hopefully addressed for the moment.

So I'm trying to build the master stack on CI to work around this, but I can't because I keep getting timeouts! If a release is going to take some time, would it be possible to host binaries somewhere? This is a total blocker for us.

All server issues should be fixed, but check out the Actions tab for builds. Latest: https://github.com/commercialhaskell/stack/actions/runs/265180359

Still getting

#!/bin/bash -eux -o pipefail
curl -sSL https://get.haskellstack.org/ | sh
stack --version
+ curl -sSL https://get.haskellstack.org/
+ sh
curl: (35) Encountered end of file

Exited with code exit status 35
CircleCI received exit code 35

EDIT: Seems to be working now

We're still seeing timeouts to casa.fpcomplete.com. Are there still lingering issues for others?

I can confirm I still have issues with Casa timeouts

HttpExceptionRequest Request {
  host                 = "casa.fpcomplete.com"
  port                 = 443
  secure               = True
  requestHeaders       = []
  path                 = "/v1/pull"
  queryString          = ""
  method               = "POST"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}
 ConnectionTimeout

Can confirm on my end as well. Using a stack from master seems to have resolved things.

Casa is still failing, and using stack from master is giving us Pantry errors: https://github.com/commercialhaskell/pantry/issues/27

FWIW our build has ~500 dependencies

Most of the comments above have been about CI systems. I was seeing identical failures locally (early on with the same ~500 packages as @brandon-leapyear was describing).

We were concerned that running our own casa server would require populating it first before we could start using it. However, after some experiments, this does not seem to be the case:

  • build and run casa-server
```bash
DBCONN="/var/tmp/casa.sqlite" PORT=80 AUTHORIZED_PORT=443 stack exec casa-server
```
  • Update the non-project config (/etc/stack/config.yaml or ~/.stack/config.yaml) to include the line

    casa-repo-prefix: http://localhost
    

and building works again (even when I disallow network access to casa.fpcomplete.com).

This allows me to update our CI config to run casa-server in a container (https://github.com/LeapYear/casa/pull/1) alongside our build.

This is probably not a long-term solution: this was only meant to unblock my team. I worry that future versions of stack could break with an unpopulated casa-server https://www.fpcomplete.com/blog/casa-and-stack/. A better solution will be to hook it up a real postgres database and run casa-server in my org's cloud and restricting access to our CI system. This disruption was pretty bad for us; we typically use a private-cloud HA proxy to mirror resources and supporting a no POST mode in casa would have been a much simpler resolution.

I'm still seeing failures as of 06:43:49 UTC. Both when building locally using Docker and from a GitHub Actions Docker build (docker/build-push-action@v1).

Running locally:

ansi-wl-pprint                   > copy/register
ansi-wl-pprint                   > Installing library in /root/.stack/snapshots/x86_64-linux/2d48e525c7ecdea92684d3c412a53b2518257e58d0af82a8d15dcb5bcad14a89/8.8.3/lib/x86_64-linux-ghc-8.8.3/ansi-wl-pprint-0.6.9-8g4uNwkOiJA5RF873eOrQv
ansi-wl-pprint                   > Registering library for ansi-wl-pprint-0.6.9..
HttpExceptionRequest Request {
  host                 = "casa.fpcomplete.com"
  port                 = 443
  secure               = True
  requestHeaders       = []
  path                 = "/v1/pull"
  queryString          = ""
  method               = "POST"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}
 ConnectionTimeout
The command '/bin/sh -c stack install --dependencies-only' returned a non-zero code: 1

GitHub actions:

2020-09-22T06:43:49.2411772Z appar                    > copy/register
2020-09-22T06:43:49.2634776Z appar                    > Installing library in /root/.stack/snapshots/x86_64-linux/2d48e525c7ecdea92684d3c412a53b2518257e58d0af82a8d15dcb5bcad14a89/8.8.3/lib/x86_64-linux-ghc-8.8.3/appar-0.1.8-Ivne98HYrwb7S9Jxqgw6mR
2020-09-22T06:43:49.4233626Z appar                    > Registering library for appar-0.1.8..
2020-09-22T06:43:49.4647761Z HttpExceptionRequest Request {
2020-09-22T06:43:49.4648362Z   host                 = "casa.fpcomplete.com"
2020-09-22T06:43:49.4648932Z   port                 = 443
2020-09-22T06:43:49.4649598Z   secure               = True
2020-09-22T06:43:49.4649918Z   requestHeaders       = []
2020-09-22T06:43:49.4650235Z   path                 = "/v1/pull"
2020-09-22T06:43:49.4650772Z   queryString          = ""
2020-09-22T06:43:49.4651264Z   method               = "POST"
2020-09-22T06:43:49.4651553Z   proxy                = Nothing
2020-09-22T06:43:49.4651843Z   rawBody              = False
2020-09-22T06:43:49.4652142Z   redirectCount        = 10
2020-09-22T06:43:49.4652652Z   responseTimeout      = ResponseTimeoutDefault
2020-09-22T06:43:49.4653187Z   requestVersion       = HTTP/1.1
2020-09-22T06:43:49.4653485Z }
2020-09-22T06:43:49.4653943Z  ConnectionTimeout
2020-09-22T06:43:49.6827680Z The command '/bin/sh -c stack install --dependencies-only' returned a non-zero code: 1
2020-09-22T06:43:49.6862619Z Error: exit status 1

Here it is also still happening, for me on AppVeyor, both casa and get.haskellstack are failing from time to time.

I'm seeing this too for both windows and macOS builds for pandoc on GitHub actions. It occurs regularly now, but I don't remember seeing it until about a week ago.

~Stack on master works for us, after clearing all .stack-work and ~/.stack directories.~

Never mind, it worked the first run, but still fails on subsequent runs

These continuous timeouts are blocking our development process in ghcide (azure) and haskell-language-server (circleci). It failed last time just 2 hours ago.

I think it might help us to move forward if we can gather some information:

  1. Why isn't this happening locally? (Is it happening locally?)
  2. What is the purpose of making these requests? Are they optional or at least recoverable?

EDIT: Missed some comments above. Thanks, Brandon, for answering

  1. A few comments prior mentioned seeing it locally (https://github.com/commercialhaskell/stack/issues/5387#issuecomment-696535040)
  2. They're supposed to be recoverable; this is implemented in the master version of stack, but not any of the released versions yet

Update: I'm now able to get this locally by doing

mv ~/.stack ~/.stack.bak
stack build

These continuous timeouts are blocking our development process in ghcide (azure) and haskell-language-server (circleci). It failed last time just 2 hours ago.

ghc-lib azure too is blocked both the digital asset repo and my clone. Manifests as failure at random points during package downloads.

mv ~/.stack ~/.stack.bak
stack build

Yes, because ~/.stack is moved out of the way, the build invokes downloads.

Are there workarounds other than running a casa server locally? Stack is completely unusable for me atm.

The casa.fpcomplete.com server seems quite bad today: I've had to make my CI builds retry stack commands up to 10 times to start getting successful builds due to the POST /v1/pull timing out. I also had this happen locally with stack install shake on a clean machine, and I had to retry 3 times before it completed successfully.

Here's an example build where stack build --only-snapshot took 5 retries on windows, and 5 retries on mac: https://github.com/avh4/elm-format/actions/runs/267891714 (to see the logs, click into either build job in the left sidebar, and then expand the "stack build --only-snapshot" step).

My apologies that this has remained an issue, I don't know why the new server is giving so many connection timeouts. I've moved the service over to a new EC2 instance, I'm hoping that that resolves the intermittent connection timeouts.

FYI: DNS is still propagating, the new IP address is 107.20.95.51

I think a post mortem on this issue — once resolved — would be fitting. I had to spend most of my time yesterday on migrating my Docker builds to cabal in order to deliver software to a customer. I’d like to understand why this issue occurred in the first place, and gain confidence in it not happening again.

The post mortem I can share now is:

  1. A DevOps mistake took down the Kube cluster hosting Casa (initial downtime)
  2. A new version of the cluster included a new version of a network layer, which included a bug that dropped about 0.5% of connections. This did not get picked up by initial testing, since those tests all passed. However, by law of large numbers, it killed basically every CI job (downtime before my temporary server update 4 hours ago)
  3. Separately: there was a bug in Stack 2.3 that treated a Casa downtime as a failure. It was never supposed to do that, and that issue has been fixed. We're testing the new version and intend to release a patched Stack

Mitigation going forward is really focused on the Stack bug. While we want to have Casa as a high availability service, we did not intend to rely on it that way. We try to only rely on services like S3 with strong SLAs for that.

I've tested in multiple ways since the comment above, and I believe this issue is now fully resolved. We are still isolating which DevOps tool introduced the instability, but I believe it was an interaction between Istio ingress controller and Windows node groups. We are running this again on our Kubernetes cluster, which has been defenestrated while we test the ingress controller elsewhere.

Has anyone seen these timeout issues occur in the past 6 hours?

All builds are ok for us since 23 sept. 2020 at 7:31 CEST
@snoyberg Many thanks for take care of this and share the post mortem info

Builds succeeding on my end too. Thanks for taking care of this!

Thanks for the confirmation, and again, apologies to everyone for the hassle this caused. Hopefully the new version of Stack will be ready for testing and then release soon, and even if we have server issues again (which I hope isn't the case), things will continue working.

Can we re-open? Been getting this error for the past few days when attempting to build on GitLab.

HttpExceptionRequest Request {
  host                 = "casa.fpcomplete.com"
  port                 = 443
  secure               = True
  requestHeaders       = []
  path                 = "/v1/pull"
  queryString          = ""
  method               = "POST"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}
 ConnectionTimeout
ERROR: Job failed: exit code 1

@mattaudesse Thank you ! @snoyberg @mattaudesse, any ideas on how best to combat this issue?

@dmjio I'm late to the game and don't have much to add RE: possible mitigations, but did you already try using a more recent stack release candidate and/or manually building from git?

@mattaudesse

I've attempted to use the latest stack 2.3.3 (as shown below), but am still hitting timeout errors. Is there some kind of aggressive throttling policy in place?

```
$ stack --version
Version 2.3.3, Git revision cb44d51bed48b723a5deb08c3348c0b3ccfc437e x86_64 hpack-0.33.0
$ ./scripts/build.sh

  • [ -z true ]
  • [ = master ]
  • PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/bin/:/root/.local/bin:.stack-work/dist stack build --fast --ghc-options=-j8 -ddump-to-file -ddump-hi --test --bench --no-run-benchmarks --local-bin-path .stack-work/dist --copy-bins
    Preparing to install GHC (tinfo6) to an isolated location.
    This will not interfere with any system-level installation.
    Preparing to download ghc-tinfo6-8.8.4 ...
    ghc-tinfo6-8.8.4: download has begun
    ghc-tinfo6-8.8.4: 36.66 MiB / 198.61 MiB ( 18.46%) downloaded...
    ghc-tinfo6-8.8.4: 81.00 MiB / 198.61 MiB ( 40.78%) downloaded...
    ghc-tinfo6-8.8.4: 126.02 MiB / 198.61 MiB ( 63.45%) downloaded...
    ghc-tinfo6-8.8.4: 171.56 MiB / 198.61 MiB ( 86.38%) downloaded...
    ghc-tinfo6-8.8.4: 198.61 MiB / 198.61 MiB (100.00%) downloaded...
    Downloaded ghc-tinfo6-8.8.4.
    Unpacking GHC into /builds/.stack-root/programs/x86_64-linux/ghc-tinfo6-8.8.4.temp/ ...
    Configuring GHC ...
    Installing GHC ...
    Installed GHC
    HttpExceptionRequest Request {
    host = "casa.fpcomplete.com"
    port = 443
    secure = True
    requestHeaders = []
    path = "/v1/pull"
    queryString = ""
    method = "POST"
    proxy = Nothing
    rawBody = False
    redirectCount = 10
    responseTimeout = ResponseTimeoutDefault
    requestVersion = HTTP/1.1
    }
    ConnectionTimeout
    ERROR: Job failed: exit code 1

@dmjio Sorry - I can't speak to the throttling question since it sounds like that'd be something internal to FP Complete's hosting (although reading through the comments here, I would've guessed the answer is probably no?).

I also haven't bumped into this issue personally, so my suggestion about using a newer stack comes from this response from @snoyberg rather than direct experience:

  1. Separately: there was a bug in Stack 2.3 that treated a Casa downtime as a failure. It was never supposed to do that, and that issue has been fixed. We're testing the new version and intend to release a patched Stack

2.3.3 seems to be the latest stack, so not sure what else I can do here. Sometimes the build finishes, which leads me to believe this is a server issue. Would be happy to be proven wrong.

Correct, 2.3.3 is the latest official Stack. But there's a _release candidate_ for 2.5.0.1. You can upgrade to it with:

stack upgrade --binary-version 2.5.0.1

That seems to have done it. Thanks @snoyberg ! 🥳

No problem. FTR, I don't believe the current issue is a server issue. Our monitoring has not shown any specific outage. But it's certainly expected that when making thousands of HTTP requests to the Casa server, one of them may fail. And the previous code treated that as fatal. We're not going to be investigating server outages further (though of course we'll keep alerts and metrics in place), and are pushing to get 2.5 stabilized and released.

Maybe related to hackage.fpcomplete.com timeout? https://github.com/commercialhaskell/stack/issues/5417

This issue happens again in stack 2.5.1 using Docker.

updates:

This only happens to the local Docker build image process, after changed the Docker DNS, the issue is gone:
(Click Docker, Preference -> Docker Engine)

{
  "features": {
    "buildkit": true
  },
  "experimental": false,
  "dns": [
    "8.8.8.8",
    "192.168.0.1"
  ]
}
Was this page helpful?
0 / 5 - 0 ratings

Related issues

mgsloan picture mgsloan  Â·  3Comments

symbiont-joseph-kachmar picture symbiont-joseph-kachmar  Â·  3Comments

srghma picture srghma  Â·  3Comments

Toxaris picture Toxaris  Â·  4Comments

Cosmius picture Cosmius  Â·  3Comments