Etcd: Cannot run etcd as a Windows service

Created on 26 Jan 2019  ยท  24Comments  ยท  Source: etcd-io/etcd

Repro:

  1. Extract etcd binaries to C:\etcd
  2. mkdir C:\etcd\data
  3. Grant "Full Access" (rwx) to "NT AUTHORITY\Local Service" on C:\etcd
  4. Start an elevated command prompt
  5. Install the service: sc create etcd binpath= "C:\etcd\etcd.exe --data-dir C:\etcd\data" obj= "NT AUTHORITY\Local Service
  6. Start the service: net start etcd

Expected:

ectd service starts

Actual:

  • etcd service start times out
  • Windows event log shows two errors:

    • Service Control Manger: "A timeout was reached (120000 milliseconds) while waiting for the etcd service to connect."

    • Service Control Manger: "The etcd service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion."

  • C:\etcd\data contains the following files:

    FullName, Length
    C:\etcd\data\member, 1
    C:\etcd\data\member\snap, 1
    C:\etcd\data\member\wal, 1
    C:\etcd\data\member\snap\db, 32768
    C:\etcd\data\member\snap\db.lock, 0
    C:\etcd\data\member\wal\0.tmp, 64000000
    C:\etcd\data\member\wal\0000000000000000-0000000000000000.wal, 64000000

Workaround:

  • None

Additional information:

Running etcd.exe from the command prompt works fine. However, etcd service won't even run as "LocalSystem" (that's the "Do whatever you want" built-in account).
I was able to reproduce the issue on multiple Win10 machines.
I assume that it has something to do with the working directory (that's at least the most likely cause from my experience if an application can be started from cmd.exe but not as a service). The default working directory for a Windows service is C:\Windows\system32 (which is locked down for good reasons).

Environment:

  • Windows 10.0.16299 Build 16299 x64
  • etcd.exe --version
    etcd Version: 3.3.11
    Git SHA: 2cf9e51d2
    Go Version: go1.10.7
    Go OS/Arch: windows/amd64
Help Wanted Windows good first issue stale

Most helpful comment

@haroldHT I managed to botch up the code base enough to make etcd run as a windows service. It properly interacts with Windows Service Control Manager through x/sys/windows/svc. I haven't fully tested it yet, but as far as I can tell it works for ordinary cluster members, level 4 gateways and gRPC proxies. Log output is redirected to Windows Event Log or a file (logs are confusing indeed, I ended up redirecting every log that didn't hide well enough ๐Ÿ™ƒ).
I need to clean up some stuff before I can push it to a public repo, but I will do so tomorrow so you can take a look.

cc: @hexfusion @tskarman

All 24 comments

Hi @jasper-d Looks like this is a known issue maybe you can help us to fix it? I do not have access to or have expertise with Windows machines to test so your help would be greatly appreciated.

ref:
https://github.com/etcd-io/etcd/issues/3351
https://github.com/etcd-io/etcd/pull/3410

@hexfusion I'll look into it but it may take a few days becasue I have little to no experience with Go.

@jasper-d we can assist with the go if you can assist with windows testing. Take a look at the old existing PR above and see if it gives you any hints. Basically can you review the existing research that was done and see what is the proper method for managing a Windows service with golang? Maybe it is is the same in which case we can reuse that PR as a starting point.

1.) review existing PRs and issues.
2.) research current best practices for Windows service and golang

From here we have a good place to start, this will move it forward without code. Thanks!

@hexfusion I dont mind trying out some things and learning some go in the process. I got the PR working with some minor changes and will take a look at some go services that run on windows (i.e. gnatsd, Elastic Filebeat) to see how they do it.

Hi @jasper-d just checking in do you have any questions?

@hexfusion Not yet, I was occupied with some more pressing issues. I probably wont have time to look into it before next weekend.

Just wanted to let you know that this is not a general issue.
I am running etcd and the etcd grpc proxy as a Windows service across a wide variety of Windows machines (Windows Server 2012 R2, Windows Server 2016, Windows Server 2019, Windows 10) and have been for >6 months and across various etcd versions.

Of note:

  • I am also running them under the Local System account. I am not sure whether I tried running them with virtual service accounts
  • I am managing the installation and execution via nssm and not sc. I never tried installing them with sc so not sure whether that makes a difference.

I am using the pre-release version 2.2.4-101 linked on this page: https://nssm.cc/download
Not sure whether the normal version would work.

With nssm I am specifying the etcd directory as the startup directory. Since you mentioned working directories, that might make a difference.

I am not doing anything too special parameter-wise. I am specifying various bindings explicitly, though. And also I am not binding anything to localhost, 127.0.0.1 or 0.0.0.0. Not that that should make a difference, though. The service usually fails to start very promptly if a port/binding is in use.

Example:

etcd --name etcd3 --client-cert-auth=true --listen-client-urls https://1.2.3.4:2379 --advertise-client-urls https://etcd3.example.com:2379 --listen-peer-urls https://1.2.3.4:2380 --initial-advertise-peer-urls https://etcd3.example.com:2380 --initial-cluster-token etcd-cluster-1 --discovery-srv example.com --initial-cluster-state existing --peer-cert-file C:\somepath\member3.pem --peer-key-file C:\somepath\member3-key.pem --peer-trusted-ca-file C:\somepath\ca.pem --cert-file C:\somepath\member3.pem --key-file C:\somepath\member3-key.pem --trusted-ca-file C:\somepath\ca.pem

@jasper-d if u want etcd work in win, u must make it become win service.

3410 It could be achieved in windows but can not work in linux

I want to make it better

@tskarman
I can reproduce jasper-d's question when I use sc in windows 10.

@hexfusion
Can I create a new PR ?
Both work in windows and linux.

@haroldHT #3410 Does not gracefully stop etcd and has some other flaws. The reason that etcd does not work as a service is that it doesn't communicate with SCM. #3410 adds some basic support for it (using golang's svc package which essentially all windows services written in go use). Properly handling stop/shutdown as well as redirecting stdout/stderr (i.e enabling log output) requires some more work. You're welcome to contribute of course. ๐Ÿ™‚

@tskarman NSSM does a lot of stuff (i.e. stdout/stderr redirection). I reckon it does communicate with SCM as well which would explain why you can start etcd as a service when using it. However, relying on hackish 3rd party tools is a workaround, not a solution from my point of view.

@jasper-d yes, I completely understand and now am interested in a solution as well. let me know when I can help you. My go is rusty and not a priority for me right now, but I could help with testing across the aforementioned operating systems.

That being said. I run etcd like this in production and have not encountered any reliability or responsivity or service signalling issue. So I would recommend this as a workaround for the time being.

@jasper-d #10460

But I am confused with the log output.
The log(i.e etcd_err.log,etcd_out.log) position I can use cfg.ec.Dir,
But the output of log whether etcd have some utils so I can use it.

And I do not know how to connect etcd's log to service.
Thanks.

@hexfusion
Can I create a new PR ?
Both work in windows and linux.

@haroldHT thanks for showing interest in resolving this. Please work with @jasper-d and @tskarman on a solution then let me know if you have any questions.

@hexfusion Sorry,I always @ wrong people,
Etcd have so many kind of log that it make me confuse, I need to spend a lot of time to understand.

cc @wenjiaswe

Thank you all for helping out! @haroldHT also contacted me offline and showed interest in contributing on this as his first etcd contribution. I will assign @haroldHT for now, @jasper-d and @tskarman any help is welcome!

/assign @haroldHT

well, it seems like I cannot assign you @haroldHT now, this is a good place to start your contribution. Thanks!

@haroldHT I managed to botch up the code base enough to make etcd run as a windows service. It properly interacts with Windows Service Control Manager through x/sys/windows/svc. I haven't fully tested it yet, but as far as I can tell it works for ordinary cluster members, level 4 gateways and gRPC proxies. Log output is redirected to Windows Event Log or a file (logs are confusing indeed, I ended up redirecting every log that didn't hide well enough ๐Ÿ™ƒ).
I need to clean up some stuff before I can push it to a public repo, but I will do so tomorrow so you can take a look.

cc: @hexfusion @tskarman

+1

โ€”

Changes (with some comments) are here: https://github.com/jasper-d/etcd/commit/9d9235226cc0d4c0f8de72d5d6e99d76ea30062c

__The good:__

  • Runs as a Windows service, notifies SCM (Service Control Manager) when started (i.e. isn't timed out by SCM anymore) and shuts down gracefully when receiving a stop signal from SCM
  • Works for cluster members as well as gRPC and layer 4 proxy
  • Logs are (partially) redirected (stdout/stderr doesn't work for Windows services), either to Windows event log or a specified file (I'm using lumberjack here to get easy log rotation because there is no logrotate on Windows).

__The bad:__

  • 0 coverage ๐Ÿ˜ฑ
  • There are several logs that aren't redirected yet. The ones I know of are the gRPC log, TCP log and raft log. As @haroldHT mentioned, the number of logs is rather confusing.
  • SCM reports an error when stopping the service (A system error has occurred. System error 1067 has occurred. The process terminated unexpectedly.). I need to investigate what's happening there.
  • Linux/tests are probably broken, haven't checked yet
  • Running a multi-node cluster doesn't work right now. I have yet to verify if that's an issue with my cluster config or a bug introduced by my changes.
  • There is no shutdown logic yet for proxies/gateways. I assume that sockets/grpc client/server must be properly closed when terminating

So, there is still a lot of work to do. Before continuing I would need to add at least some tests and set up a proper testing environment. The main problems remain the wealth of logs (I would certainly need some advice here) and the different ways in which etcd is started. I think that should be unified ideally, but that would probably be quite a refactoring (i.e. should be done by someone with a better understanding of the code base and go).

@haroldHT How is it going for you?
@hexfusion I wouldn't mind investing some more time but you may wanna take a look at it first to determine if it's worth the effort. I also cannot make any definitive commitments to a timeline because it's essentially a pet project for learning some go in my spare time.

@wenjiaswe sorry, I did not reply in time.
I will continue to follow your suggestion.

@jasper-d At the beginning I want to use kardianos/service to make etcd become a service.
It seems a good solution if we can manage the number of logs.
your solution also make me benefit a lot, thanks.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings