The following happens in CI:
HttpExceptionRequest Request {
host = "casa.fpcomplete.com"
port = 443
secure = True
requestHeaders = []
path = "/v1/pull"
queryString = ""
method = "POST"
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
}
ConnectionTimeout
Seems to be consistent right now. I guess a few questions about this:
On our CentOS CI job, it seems to happen around 72/352. Our Mac job got to 233/357.
N/A
N/A
N/A
We pin Stack to 2.3.1
We copied the Stack executable from GitHub into S3 and install from there
We're also seeing this issue in our CI, stack version 2.3.3
Also seeing this intermittently, about 10 minutes ago was the last time, but CircleCI succeeded before and after that. I have /root/.stack/ and .stack-work cached, so I would have expected the build to proceed fine offline.
Also getting this on TravisCI, build goes for about up to 10 minutes of installing stack dependencies before this happens, sometimes less (3 minutes). Sometimes it goes through, but in most cases it does not. Stack is 2.3.3.
HttpExceptionRequest Request {
host = "casa.fpcomplete.com"
port = 443
secure = True
requestHeaders = []
path = "/v1/pull"
queryString = ""
method = "POST"
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
}
ConnectionTimeout
The command "stack --no-terminal --install-ghc test --only-dependencies" failed and exited with 1 during .
I still get this randomly. Interestingly it mostly happens on:
Registering library for $some-dependency
I'm curious why stack decides to pull data from a remote source when locally registering a dependency.
OT: I also noticed that during initial dependency resolution stack does lots of N+1 queries to that endpoint. I wounder if they should be batched.
I see this also happening on CI of this very repo, I made a tiny PR yesterday and a couple of CI jobs fell with the same error.
Is there any workaround available?
Sadly it's a total showstopper for me right now.
Works fine for me now
Would it be possible to test out a build from master? It should already have the fix in place. Hopefully we'll get a new release out soon.
Also: the server outage this morning is resolved for now, so most occurrences of this issue are hopefully addressed for the moment.
Now I'm getting errors from get.haskellstack.org?
https://app.circleci.com/pipelines/github/LeapYear/aeson-schemas/233/workflows/756e3e3a-6ee2-4258-bbef-bd11102dae09/jobs/1159
https://app.circleci.com/pipelines/github/LeapYear/aeson-schemas/233/workflows/756e3e3a-6ee2-4258-bbef-bd11102dae09/jobs/1158
So I'm trying to build the master stack on CI to work around this, but I can't because I keep getting timeouts! If a release is going to take some time, would it be possible to host binaries somewhere? This is a total blocker for us.
All server issues should be fixed, but check out the Actions tab for builds. Latest: https://github.com/commercialhaskell/stack/actions/runs/265180359
Still getting
#!/bin/bash -eux -o pipefail
curl -sSL https://get.haskellstack.org/ | sh
stack --version
+ curl -sSL https://get.haskellstack.org/
+ sh
curl: (35) Encountered end of file
Exited with code exit status 35
CircleCI received exit code 35
EDIT: Seems to be working now
We're still seeing timeouts to casa.fpcomplete.com. Are there still lingering issues for others?
I can confirm I still have issues with Casa timeouts
HttpExceptionRequest Request {
host = "casa.fpcomplete.com"
port = 443
secure = True
requestHeaders = []
path = "/v1/pull"
queryString = ""
method = "POST"
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
}
ConnectionTimeout
Can confirm on my end as well. Using a stack from master seems to have resolved things.
Casa is still failing, and using stack from master is giving us Pantry errors: https://github.com/commercialhaskell/pantry/issues/27
FWIW our build has ~500 dependencies
Most of the comments above have been about CI systems. I was seeing identical failures locally (early on with the same ~500 packages as @brandon-leapyear was describing).
We were concerned that running our own casa server would require populating it first before we could start using it. However, after some experiments, this does not seem to be the case:
casa-server```bash
DBCONN="/var/tmp/casa.sqlite" PORT=80 AUTHORIZED_PORT=443 stack exec casa-server
```
Update the non-project config (/etc/stack/config.yaml or ~/.stack/config.yaml) to include the line
casa-repo-prefix: http://localhost
and building works again (even when I disallow network access to casa.fpcomplete.com).
This allows me to update our CI config to run casa-server in a container (https://github.com/LeapYear/casa/pull/1) alongside our build.
This is probably not a long-term solution: this was only meant to unblock my team. I worry that future versions of stack could break with an unpopulated casa-server https://www.fpcomplete.com/blog/casa-and-stack/. A better solution will be to hook it up a real postgres database and run casa-server in my org's cloud and restricting access to our CI system. This disruption was pretty bad for us; we typically use a private-cloud HA proxy to mirror resources and supporting a no POST mode in casa would have been a much simpler resolution.
I'm still seeing failures as of 06:43:49 UTC. Both when building locally using Docker and from a GitHub Actions Docker build (docker/build-push-action@v1).
Running locally:
ansi-wl-pprint > copy/register
ansi-wl-pprint > Installing library in /root/.stack/snapshots/x86_64-linux/2d48e525c7ecdea92684d3c412a53b2518257e58d0af82a8d15dcb5bcad14a89/8.8.3/lib/x86_64-linux-ghc-8.8.3/ansi-wl-pprint-0.6.9-8g4uNwkOiJA5RF873eOrQv
ansi-wl-pprint > Registering library for ansi-wl-pprint-0.6.9..
HttpExceptionRequest Request {
host = "casa.fpcomplete.com"
port = 443
secure = True
requestHeaders = []
path = "/v1/pull"
queryString = ""
method = "POST"
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
}
ConnectionTimeout
The command '/bin/sh -c stack install --dependencies-only' returned a non-zero code: 1
GitHub actions:
2020-09-22T06:43:49.2411772Z appar > copy/register
2020-09-22T06:43:49.2634776Z appar > Installing library in /root/.stack/snapshots/x86_64-linux/2d48e525c7ecdea92684d3c412a53b2518257e58d0af82a8d15dcb5bcad14a89/8.8.3/lib/x86_64-linux-ghc-8.8.3/appar-0.1.8-Ivne98HYrwb7S9Jxqgw6mR
2020-09-22T06:43:49.4233626Z appar > Registering library for appar-0.1.8..
2020-09-22T06:43:49.4647761Z HttpExceptionRequest Request {
2020-09-22T06:43:49.4648362Z host = "casa.fpcomplete.com"
2020-09-22T06:43:49.4648932Z port = 443
2020-09-22T06:43:49.4649598Z secure = True
2020-09-22T06:43:49.4649918Z requestHeaders = []
2020-09-22T06:43:49.4650235Z path = "/v1/pull"
2020-09-22T06:43:49.4650772Z queryString = ""
2020-09-22T06:43:49.4651264Z method = "POST"
2020-09-22T06:43:49.4651553Z proxy = Nothing
2020-09-22T06:43:49.4651843Z rawBody = False
2020-09-22T06:43:49.4652142Z redirectCount = 10
2020-09-22T06:43:49.4652652Z responseTimeout = ResponseTimeoutDefault
2020-09-22T06:43:49.4653187Z requestVersion = HTTP/1.1
2020-09-22T06:43:49.4653485Z }
2020-09-22T06:43:49.4653943Z ConnectionTimeout
2020-09-22T06:43:49.6827680Z The command '/bin/sh -c stack install --dependencies-only' returned a non-zero code: 1
2020-09-22T06:43:49.6862619Z Error: exit status 1
Here it is also still happening, for me on AppVeyor, both casa and get.haskellstack are failing from time to time.
I'm seeing this too for both windows and macOS builds for pandoc on GitHub actions. It occurs regularly now, but I don't remember seeing it until about a week ago.
~Stack on master works for us, after clearing all .stack-work and ~/.stack directories.~
Never mind, it worked the first run, but still fails on subsequent runs
These continuous timeouts are blocking our development process in ghcide (azure) and haskell-language-server (circleci). It failed last time just 2 hours ago.
I think it might help us to move forward if we can gather some information:
EDIT: Missed some comments above. Thanks, Brandon, for answering
master version of stack, but not any of the released versions yetUpdate: I'm now able to get this locally by doing
mv ~/.stack ~/.stack.bak
stack build
These continuous timeouts are blocking our development process in ghcide (azure) and haskell-language-server (circleci). It failed last time just 2 hours ago.
ghc-lib azure too is blocked both the digital asset repo and my clone. Manifests as failure at random points during package downloads.
mv ~/.stack ~/.stack.bak stack build
Yes, because ~/.stack is moved out of the way, the build invokes downloads.
Are there workarounds other than running a casa server locally? Stack is completely unusable for me atm.
The casa.fpcomplete.com server seems quite bad today: I've had to make my CI builds retry stack commands up to 10 times to start getting successful builds due to the POST /v1/pull timing out. I also had this happen locally with stack install shake on a clean machine, and I had to retry 3 times before it completed successfully.
Here's an example build where stack build --only-snapshot took 5 retries on windows, and 5 retries on mac: https://github.com/avh4/elm-format/actions/runs/267891714 (to see the logs, click into either build job in the left sidebar, and then expand the "stack build --only-snapshot" step).
My apologies that this has remained an issue, I don't know why the new server is giving so many connection timeouts. I've moved the service over to a new EC2 instance, I'm hoping that that resolves the intermittent connection timeouts.
FYI: DNS is still propagating, the new IP address is 107.20.95.51
I think a post mortem on this issue — once resolved — would be fitting. I had to spend most of my time yesterday on migrating my Docker builds to cabal in order to deliver software to a customer. I’d like to understand why this issue occurred in the first place, and gain confidence in it not happening again.
The post mortem I can share now is:
Mitigation going forward is really focused on the Stack bug. While we want to have Casa as a high availability service, we did not intend to rely on it that way. We try to only rely on services like S3 with strong SLAs for that.
I've tested in multiple ways since the comment above, and I believe this issue is now fully resolved. We are still isolating which DevOps tool introduced the instability, but I believe it was an interaction between Istio ingress controller and Windows node groups. We are running this again on our Kubernetes cluster, which has been defenestrated while we test the ingress controller elsewhere.
Has anyone seen these timeout issues occur in the past 6 hours?
All builds are ok for us since 23 sept. 2020 at 7:31 CEST
@snoyberg Many thanks for take care of this and share the post mortem info
Builds succeeding on my end too. Thanks for taking care of this!
Thanks for the confirmation, and again, apologies to everyone for the hassle this caused. Hopefully the new version of Stack will be ready for testing and then release soon, and even if we have server issues again (which I hope isn't the case), things will continue working.
Can we re-open? Been getting this error for the past few days when attempting to build on GitLab.
HttpExceptionRequest Request {
host = "casa.fpcomplete.com"
port = 443
secure = True
requestHeaders = []
path = "/v1/pull"
queryString = ""
method = "POST"
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
}
ConnectionTimeout
ERROR: Job failed: exit code 1
@mattaudesse Thank you ! @snoyberg @mattaudesse, any ideas on how best to combat this issue?
@dmjio I'm late to the game and don't have much to add RE: possible mitigations, but did you already try using a more recent stack release candidate and/or manually building from git?
@mattaudesse
I've attempted to use the latest stack 2.3.3 (as shown below), but am still hitting timeout errors. Is there some kind of aggressive throttling policy in place?
```
$ stack --version
Version 2.3.3, Git revision cb44d51bed48b723a5deb08c3348c0b3ccfc437e x86_64 hpack-0.33.0
$ ./scripts/build.sh
@dmjio Sorry - I can't speak to the throttling question since it sounds like that'd be something internal to FP Complete's hosting (although reading through the comments here, I would've guessed the answer is probably no?).
I also haven't bumped into this issue personally, so my suggestion about using a newer stack comes from this response from @snoyberg rather than direct experience:
- Separately: there was a bug in Stack 2.3 that treated a Casa downtime as a failure. It was never supposed to do that, and that issue has been fixed. We're testing the new version and intend to release a patched Stack
2.3.3 seems to be the latest stack, so not sure what else I can do here. Sometimes the build finishes, which leads me to believe this is a server issue. Would be happy to be proven wrong.
Correct, 2.3.3 is the latest official Stack. But there's a _release candidate_ for 2.5.0.1. You can upgrade to it with:
stack upgrade --binary-version 2.5.0.1
That seems to have done it. Thanks @snoyberg ! 🥳
No problem. FTR, I don't believe the current issue is a server issue. Our monitoring has not shown any specific outage. But it's certainly expected that when making thousands of HTTP requests to the Casa server, one of them may fail. And the previous code treated that as fatal. We're not going to be investigating server outages further (though of course we'll keep alerts and metrics in place), and are pushing to get 2.5 stabilized and released.
Maybe related to hackage.fpcomplete.com timeout? https://github.com/commercialhaskell/stack/issues/5417
This issue happens again in stack 2.5.1 using Docker.
updates:
This only happens to the local Docker build image process, after changed the Docker DNS, the issue is gone:
(Click Docker, Preference -> Docker Engine)
{
"features": {
"buildkit": true
},
"experimental": false,
"dns": [
"8.8.8.8",
"192.168.0.1"
]
}
Most helpful comment
The post mortem I can share now is:
Mitigation going forward is really focused on the Stack bug. While we want to have Casa as a high availability service, we did not intend to rely on it that way. We try to only rely on services like S3 with strong SLAs for that.