Stack: Timeout when accessing `casa.fpcomplete.com`

Created on 15 Sep 2020 · 48Comments · Source: commercialhaskell/stack

General summary/comments (optional)

The following happens in CI:

HttpExceptionRequest Request {
  host                 = "casa.fpcomplete.com"
  port                 = 443
  secure               = True
  requestHeaders       = []
  path                 = "/v1/pull"
  queryString          = ""
  method               = "POST"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}
 ConnectionTimeout

Seems to be consistent right now. I guess a few questions about this:

Why is it failing? (rate limiting / down time?)
Is it possible to customize this? If so, we could configure our Nexus proxy to proxy casa.fpcomplete.com and point Stack to the proxy.

On our CentOS CI job, it seems to happen around 72/352. Our Mac job got to 233/357.

Steps to reproduce

N/A

Expected

N/A

Actual

N/A

Stack version

We pin Stack to 2.3.1

Method of installation

We copied the Stack executable from GitHub into S3 and install from there

Source

brandon-leapyear

👀2 👍2

Most helpful comment

The post mortem I can share now is:

A DevOps mistake took down the Kube cluster hosting Casa (initial downtime)
A new version of the cluster included a new version of a network layer, which included a bug that dropped about 0.5% of connections. This did not get picked up by initial testing, since those tests all passed. However, by law of large numbers, it killed basically every CI job (downtime before my temporary server update 4 hours ago)
Separately: there was a bug in Stack 2.3 that treated a Casa downtime as a failure. It was never supposed to do that, and that issue has been fixed. We're testing the new version and intend to release a patched Stack

Mitigation going forward is really focused on the Stack bug. While we want to have Casa as a high availability service, we did not intend to rely on it that way. We try to only rely on services like S3 with strong SLAs for that.

snoyberg on 23 Sep 2020

❤7 👍3

All 48 comments

We're also seeing this issue in our CI, stack version 2.3.3

jcberentsen on 16 Sep 2020

Also seeing this intermittently, about 10 minutes ago was the last time, but CircleCI succeeded before and after that. I have /root/.stack/ and .stack-work cached, so I would have expected the build to proceed fine offline.

ejconlon on 16 Sep 2020

Also getting this on TravisCI, build goes for about up to 10 minutes of installing stack dependencies before this happens, sometimes less (3 minutes). Sometimes it goes through, but in most cases it does not. Stack is 2.3.3.

HttpExceptionRequest Request {
  host                 = "casa.fpcomplete.com"
  port                 = 443
  secure               = True
  requestHeaders       = []
  path                 = "/v1/pull"
  queryString          = ""
  method               = "POST"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}

 ConnectionTimeout

The command "stack --no-terminal --install-ghc test --only-dependencies" failed and exited with 1 during .

Martinsos on 17 Sep 2020

👍1

I still get this randomly. Interestingly it mostly happens on:

Registering library for $some-dependency

I'm curious why stack decides to pull data from a remote source when locally registering a dependency.

OT: I also noticed that during initial dependency resolution stack does lots of N+1 queries to that endpoint. I wounder if they should be batched.

mbj on 21 Sep 2020

I see this also happening on CI of this very repo, I made a tiny PR yesterday and a couple of CI jobs fell with the same error.

Martinsos on 21 Sep 2020

Is there any workaround available?
Sadly it's a total showstopper for me right now.

CarstenKoenig on 21 Sep 2020

Works fine for me now

vsklamm on 21 Sep 2020

👍2

Would it be possible to test out a build from master? It should already have the fix in place. Hopefully we'll get a new release out soon.

snoyberg on 21 Sep 2020

Also: the server outage this morning is resolved for now, so most occurrences of this issue are hopefully addressed for the moment.

snoyberg on 21 Sep 2020

❤6

Now I'm getting errors from get.haskellstack.org?
https://app.circleci.com/pipelines/github/LeapYear/aeson-schemas/233/workflows/756e3e3a-6ee2-4258-bbef-bd11102dae09/jobs/1159
https://app.circleci.com/pipelines/github/LeapYear/aeson-schemas/233/workflows/756e3e3a-6ee2-4258-bbef-bd11102dae09/jobs/1158

brandon-leapyear on 21 Sep 2020

So I'm trying to build the master stack on CI to work around this, but I can't because I keep getting timeouts! If a release is going to take some time, would it be possible to host binaries somewhere? This is a total blocker for us.

TOTBWF on 21 Sep 2020

😕1

All server issues should be fixed, but check out the Actions tab for builds. Latest: https://github.com/commercialhaskell/stack/actions/runs/265180359

snoyberg on 21 Sep 2020

👍1

Still getting

#!/bin/bash -eux -o pipefail
curl -sSL https://get.haskellstack.org/ | sh
stack --version
+ curl -sSL https://get.haskellstack.org/
+ sh
curl: (35) Encountered end of file

Exited with code exit status 35
CircleCI received exit code 35

EDIT: Seems to be working now

brandon-leapyear on 21 Sep 2020

We're still seeing timeouts to casa.fpcomplete.com. Are there still lingering issues for others?

aviaviavi on 21 Sep 2020

I can confirm I still have issues with Casa timeouts

HttpExceptionRequest Request {
  host                 = "casa.fpcomplete.com"
  port                 = 443
  secure               = True
  requestHeaders       = []
  path                 = "/v1/pull"
  queryString          = ""
  method               = "POST"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}
 ConnectionTimeout

vaclavsvejcar on 21 Sep 2020

Can confirm on my end as well. Using a stack from master seems to have resolved things.

TOTBWF on 22 Sep 2020

Casa is still failing, and using stack from master is giving us Pantry errors: https://github.com/commercialhaskell/pantry/issues/27

FWIW our build has ~500 dependencies

brandon-leapyear on 22 Sep 2020

👍2

Most of the comments above have been about CI systems. I was seeing identical failures locally (early on with the same ~500 packages as @brandon-leapyear was describing).

We were concerned that running our own casa server would require populating it first before we could start using it. However, after some experiments, this does not seem to be the case:

build and run casa-server

```bash
DBCONN="/var/tmp/casa.sqlite" PORT=80 AUTHORIZED_PORT=443 stack exec casa-server
```

Update the non-project config (/etc/stack/config.yaml or ~/.stack/config.yaml) to include the line
```
casa-repo-prefix: http://localhost
```

and building works again (even when I disallow network access to casa.fpcomplete.com).

This allows me to update our CI config to run casa-server in a container (https://github.com/LeapYear/casa/pull/1) alongside our build.

This is probably not a long-term solution: this was only meant to unblock my team. I worry that future versions of stack could break with an unpopulated casa-server https://www.fpcomplete.com/blog/casa-and-stack/. A better solution will be to hook it up a real postgres database and run casa-server in my org's cloud and restricting access to our CI system. This disruption was pretty bad for us; we typically use a private-cloud HA proxy to mirror resources and supporting a no POST mode in casa would have been a much simpler resolution.

liam-ly on 22 Sep 2020

I'm still seeing failures as of 06:43:49 UTC. Both when building locally using Docker and from a GitHub Actions Docker build (docker/build-push-action@v1).

Running locally:

ansi-wl-pprint                   > copy/register
ansi-wl-pprint                   > Installing library in /root/.stack/snapshots/x86_64-linux/2d48e525c7ecdea92684d3c412a53b2518257e58d0af82a8d15dcb5bcad14a89/8.8.3/lib/x86_64-linux-ghc-8.8.3/ansi-wl-pprint-0.6.9-8g4uNwkOiJA5RF873eOrQv
ansi-wl-pprint                   > Registering library for ansi-wl-pprint-0.6.9..
HttpExceptionRequest Request {
  host                 = "casa.fpcomplete.com"
  port                 = 443
  secure               = True
  requestHeaders       = []
  path                 = "/v1/pull"
  queryString          = ""
  method               = "POST"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}
 ConnectionTimeout
The command '/bin/sh -c stack install --dependencies-only' returned a non-zero code: 1

GitHub actions:

2020-09-22T06:43:49.2411772Z appar                    > copy/register
2020-09-22T06:43:49.2634776Z appar                    > Installing library in /root/.stack/snapshots/x86_64-linux/2d48e525c7ecdea92684d3c412a53b2518257e58d0af82a8d15dcb5bcad14a89/8.8.3/lib/x86_64-linux-ghc-8.8.3/appar-0.1.8-Ivne98HYrwb7S9Jxqgw6mR
2020-09-22T06:43:49.4233626Z appar                    > Registering library for appar-0.1.8..
2020-09-22T06:43:49.4647761Z HttpExceptionRequest Request {
2020-09-22T06:43:49.4648362Z   host                 = "casa.fpcomplete.com"
2020-09-22T06:43:49.4648932Z   port                 = 443
2020-09-22T06:43:49.4649598Z   secure               = True
2020-09-22T06:43:49.4649918Z   requestHeaders       = []
2020-09-22T06:43:49.4650235Z   path                 = "/v1/pull"
2020-09-22T06:43:49.4650772Z   queryString          = ""
2020-09-22T06:43:49.4651264Z   method               = "POST"
2020-09-22T06:43:49.4651553Z   proxy                = Nothing
2020-09-22T06:43:49.4651843Z   rawBody              = False
2020-09-22T06:43:49.4652142Z   redirectCount        = 10
2020-09-22T06:43:49.4652652Z   responseTimeout      = ResponseTimeoutDefault
2020-09-22T06:43:49.4653187Z   requestVersion       = HTTP/1.1
2020-09-22T06:43:49.4653485Z }
2020-09-22T06:43:49.4653943Z  ConnectionTimeout
2020-09-22T06:43:49.6827680Z The command '/bin/sh -c stack install --dependencies-only' returned a non-zero code: 1
2020-09-22T06:43:49.6862619Z Error: exit status 1

runeksvendsen on 22 Sep 2020

Here it is also still happening, for me on AppVeyor, both casa and get.haskellstack are failing from time to time.

Martinsos on 22 Sep 2020

I'm seeing this too for both windows and macOS builds for pandoc on GitHub actions. It occurs regularly now, but I don't remember seeing it until about a week ago.

jgm on 22 Sep 2020

~Stack on master works for us, after clearing all .stack-work and ~/.stack directories.~

Never mind, it worked the first run, but still fails on subsequent runs

brandon-leapyear on 22 Sep 2020

These continuous timeouts are blocking our development process in ghcide (azure) and haskell-language-server (circleci). It failed last time just 2 hours ago.

jneira on 22 Sep 2020

I think it might help us to move forward if we can gather some information:

Why isn't this happening locally? (Is it happening locally?)
What is the purpose of making these requests? Are they optional or at least recoverable?

EDIT: Missed some comments above. Thanks, Brandon, for answering

ejconlon on 23 Sep 2020

A few comments prior mentioned seeing it locally (https://github.com/commercialhaskell/stack/issues/5387#issuecomment-696535040)
They're supposed to be recoverable; this is implemented in the master version of stack, but not any of the released versions yet

Update: I'm now able to get this locally by doing

mv ~/.stack ~/.stack.bak
stack build

brandon-leapyear on 23 Sep 2020

👍1

These continuous timeouts are blocking our development process in ghcide (azure) and haskell-language-server (circleci). It failed last time just 2 hours ago.

ghc-lib azure too is blocked both the digital asset repo and my clone. Manifests as failure at random points during package downloads.

shayne-fletcher on 23 Sep 2020

mv ~/.stack ~/.stack.bak
stack build

Yes, because ~/.stack is moved out of the way, the build invokes downloads.

shayne-fletcher on 23 Sep 2020

Are there workarounds other than running a casa server locally? Stack is completely unusable for me atm.

Lemmih on 23 Sep 2020

The casa.fpcomplete.com server seems quite bad today: I've had to make my CI builds retry stack commands up to 10 times to start getting successful builds due to the POST /v1/pull timing out. I also had this happen locally with stack install shake on a clean machine, and I had to retry 3 times before it completed successfully.

Here's an example build where stack build --only-snapshot took 5 retries on windows, and 5 retries on mac: https://github.com/avh4/elm-format/actions/runs/267891714 (to see the logs, click into either build job in the left sidebar, and then expand the "stack build --only-snapshot" step).

avh4 on 23 Sep 2020

My apologies that this has remained an issue, I don't know why the new server is giving so many connection timeouts. I've moved the service over to a new EC2 instance, I'm hoping that that resolves the intermittent connection timeouts.

snoyberg on 23 Sep 2020

FYI: DNS is still propagating, the new IP address is 107.20.95.51

snoyberg on 23 Sep 2020

I think a post mortem on this issue — once resolved — would be fitting. I had to spend most of my time yesterday on migrating my Docker builds to cabal in order to deliver software to a customer. I’d like to understand why this issue occurred in the first place, and gain confidence in it not happening again.

runeksvendsen on 23 Sep 2020

👍3

The post mortem I can share now is:

A DevOps mistake took down the Kube cluster hosting Casa (initial downtime)
A new version of the cluster included a new version of a network layer, which included a bug that dropped about 0.5% of connections. This did not get picked up by initial testing, since those tests all passed. However, by law of large numbers, it killed basically every CI job (downtime before my temporary server update 4 hours ago)
Separately: there was a bug in Stack 2.3 that treated a Casa downtime as a failure. It was never supposed to do that, and that issue has been fixed. We're testing the new version and intend to release a patched Stack

snoyberg on 23 Sep 2020

❤7 👍3

I've tested in multiple ways since the comment above, and I believe this issue is now fully resolved. We are still isolating which DevOps tool introduced the instability, but I believe it was an interaction between Istio ingress controller and Windows node groups. We are running this again on our Kubernetes cluster, which has been defenestrated while we test the ingress controller elsewhere.

Has anyone seen these timeout issues occur in the past 6 hours?

snoyberg on 23 Sep 2020

All builds are ok for us since 23 sept. 2020 at 7:31 CEST
@snoyberg Many thanks for take care of this and share the post mortem info

jneira on 23 Sep 2020

❤1 👍1

Builds succeeding on my end too. Thanks for taking care of this!

jes-moore on 23 Sep 2020

❤1

Thanks for the confirmation, and again, apologies to everyone for the hassle this caused. Hopefully the new version of Stack will be ready for testing and then release soon, and even if we have server issues again (which I hope isn't the case), things will continue working.

snoyberg on 23 Sep 2020

❤6 👍2

Can we re-open? Been getting this error for the past few days when attempting to build on GitLab.

HttpExceptionRequest Request {
  host                 = "casa.fpcomplete.com"
  port                 = 443
  secure               = True
  requestHeaders       = []
  path                 = "/v1/pull"
  queryString          = ""
  method               = "POST"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}
 ConnectionTimeout
ERROR: Job failed: exit code 1

dmjio on 11 Oct 2020

@mattaudesse Thank you ! @snoyberg @mattaudesse, any ideas on how best to combat this issue?

dmjio on 12 Oct 2020

@dmjio I'm late to the game and don't have much to add RE: possible mitigations, but did you already try using a more recent stack release candidate and/or manually building from git?

mattaudesse on 12 Oct 2020

@mattaudesse

I've attempted to use the latest stack 2.3.3 (as shown below), but am still hitting timeout errors. Is there some kind of aggressive throttling policy in place?

```
$ stack --version
Version 2.3.3, Git revision cb44d51bed48b723a5deb08c3348c0b3ccfc437e x86_64 hpack-0.33.0
$ ./scripts/build.sh

[ -z true ]
[ = master ]
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/bin/:/root/.local/bin:.stack-work/dist stack build --fast --ghc-options=-j8 -ddump-to-file -ddump-hi --test --bench --no-run-benchmarks --local-bin-path .stack-work/dist --copy-bins
Preparing to install GHC (tinfo6) to an isolated location.
This will not interfere with any system-level installation.
Preparing to download ghc-tinfo6-8.8.4 ...
ghc-tinfo6-8.8.4: download has begun
ghc-tinfo6-8.8.4: 36.66 MiB / 198.61 MiB ( 18.46%) downloaded...
ghc-tinfo6-8.8.4: 81.00 MiB / 198.61 MiB ( 40.78%) downloaded...
ghc-tinfo6-8.8.4: 126.02 MiB / 198.61 MiB ( 63.45%) downloaded...
ghc-tinfo6-8.8.4: 171.56 MiB / 198.61 MiB ( 86.38%) downloaded...
ghc-tinfo6-8.8.4: 198.61 MiB / 198.61 MiB (100.00%) downloaded...
Downloaded ghc-tinfo6-8.8.4.
Unpacking GHC into /builds/.stack-root/programs/x86_64-linux/ghc-tinfo6-8.8.4.temp/ ...
Configuring GHC ...
Installing GHC ...
Installed GHC
HttpExceptionRequest Request {
host = "casa.fpcomplete.com"
port = 443
secure = True
requestHeaders = []
path = "/v1/pull"
queryString = ""
method = "POST"
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
}
ConnectionTimeout
ERROR: Job failed: exit code 1

dmjio on 12 Oct 2020

@dmjio Sorry - I can't speak to the throttling question since it sounds like that'd be something internal to FP Complete's hosting (although reading through the comments here, I would've guessed the answer is probably no?).

I also haven't bumped into this issue personally, so my suggestion about using a newer stack comes from this response from @snoyberg rather than direct experience:

Separately: there was a bug in Stack 2.3 that treated a Casa downtime as a failure. It was never supposed to do that, and that issue has been fixed. We're testing the new version and intend to release a patched Stack

mattaudesse on 12 Oct 2020

2.3.3 seems to be the latest stack, so not sure what else I can do here. Sometimes the build finishes, which leads me to believe this is a server issue. Would be happy to be proven wrong.

dmjio on 12 Oct 2020

Correct, 2.3.3 is the latest official Stack. But there's a _release candidate_ for 2.5.0.1. You can upgrade to it with:

stack upgrade --binary-version 2.5.0.1

snoyberg on 12 Oct 2020

That seems to have done it. Thanks @snoyberg ! 🥳

dmjio on 12 Oct 2020

No problem. FTR, I don't believe the current issue is a server issue. Our monitoring has not shown any specific outage. But it's certainly expected that when making thousands of HTTP requests to the Casa server, one of them may fail. And the previous code treated that as fatal. We're not going to be investigating server outages further (though of course we'll keep alerts and metrics in place), and are pushing to get 2.5 stabilized and released.

snoyberg on 12 Oct 2020

👍1

Maybe related to hackage.fpcomplete.com timeout? https://github.com/commercialhaskell/stack/issues/5417

Kitanotori on 30 Oct 2020

This issue happens again in stack 2.5.1 using Docker.

updates:

This only happens to the local Docker build image process, after changed the Docker DNS, the issue is gone:
(Click Docker, Preference -> Docker Engine)

{
  "features": {
    "buildkit": true
  },
  "experimental": false,
  "dns": [
    "8.8.8.8",
    "192.168.0.1"
  ]
}

cht8687 on 28 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

almost all stack actions result in SQLite error ...

silky · 3Comments

stack does not use proxy setting in windows

Cosmius · 3Comments

stack ls snapshots usage

cybaj · 3Comments

Multi-line script interpreter opener

bitemyapp · 3Comments

Globally enable --haddock?

abhinav · 4Comments