Faas: Suggestion: surface logs in REST API

Created on 9 Jul 2018  路  30Comments  路  Source: openfaas/faas

Suggestion from @burtonr

Can we, and should we surface function logs via the REST API?

If we do then these could be queried from the OpenFaaS UI/CLI also.

faas-cli store deploy figlet
faas-cli logs figlet

Is tailing logs in or out for the first version?

areux desigreview help wanted sizxl skiladvanced

Most helpful comment

Exposing logs via the REST API also means that the requesting user does not need cli access to the cluster (e.g.kubectl). This makes it easier to provide the needed information to more people without giving all of them direct access to the cluster. This should make it easier to deploy and operate OpenFaaS with larger teams and in organizations with more access controls in place.

I think we should target logs streaming. The orchestrator sdks support streaming logs already so it is mostly a matter of picking a streaming method (websockets versus http-streaming with ndjson).

All 30 comments

Being able to access the logs from the faas-cli would mean that users wouldn't need to switch context when working with their functions between faas-cli deploy and docker service logs <function> or kubectl logs <function pod>

This would make it easier to be able to deploy and work with OpenFaaS regardless of the orchestrator used.

Also, being able to view the logs from the UI would make it easier for users who are not familiar or comfortable with Swarm or Kubernetes to be able to easier debug their functions and add more utility to the UI.

Exposing logs via the REST API also means that the requesting user does not need cli access to the cluster (e.g.kubectl). This makes it easier to provide the needed information to more people without giving all of them direct access to the cluster. This should make it easier to deploy and operate OpenFaaS with larger teams and in organizations with more access controls in place.

I think we should target logs streaming. The orchestrator sdks support streaming logs already so it is mostly a matter of picking a streaming method (websockets versus http-streaming with ndjson).

Wouldn't this mean the introduction of another component to store those logs? As aggregation and possibly rotation would be necessary on garbage collected functions.

I think that would depend on what kind of support we want to offer. If the goal is to show the current logs, like tail, it would be enough to scrape the current containers and stream those back to the user.

_Side note_ This would required multiple goroutines, at least for the kubernetes implementation because I believe the API only allows you to fetch/stream for a single container

If we want to provide functionality like "logs since \

@shaunwarman no. It would expose whatever is in the Swarm/K8s/Fargate logs API.

The first version of this should be scoped to fetch the "tail" of N log lines, but not "watch/follow" as in "tail -f" or "kubectl logs -f".

If aggregating to a third-party component such as ELK then we'd have to interface with ELK to fetch logs.

Based upon feedback from the community some people already do this and have their own solutions for it.

There is value in providing a link to the logs API provided by kubectl logs / docker service logs especially in development. I was speaking to Wireline.io about this with @telackey (ping 馃敂 )

If we were to provide an endpoint to follow logs such as /system/logs/ - this would need to connect to all available replicas (which can change at any time) so may be best implemented in a web socket and would probably have to have its own go-routine looking after detecting which replicas to collect from and which to disconnect from.

If not following logs, but just tailing N lines from all current replicas this is made much simpler but may be of more limited use.

I'd like more input on this and suggestions around design.

@alexellis

I have returned from vacation. I'd be interested in taking a look at this (with the warning that I'll have to work it in amongst other commitments).

For development purposes, I think the most attractive option is, as you mention, tapping into the APIs used for kubectl logs / docker service logs.

@telackey I have worked with the logs sdk for Kubernetes quiet a bit, so if you run into any issues there, please feel free to reach out to me

Would be good if you guys could share some analysis or thoughts around design/implementation. What does kubetail do for instance on Kubernetes?

Hey, @telackey just wanted to check in an see how things are going. Is this something you would still like to work on?

I think it's fine if we want to start exploring prototypes individually. Don't let anyone be put off looking into this because it's a big feature and we have to be mindful of the design and any impact on performance.

+1

Having spoken to @stefanprodan we think that in production people likely will use their own stuff for log aggregation and analysis like ELK, but in dev and smaller deployments perhaps we could connect to each pod and stream? This would offer a lot of utility to developers having one consistent way to get function logs.

An alternative is putting the burden of ELK into our core stack which can take GB and GB of RAM even for a small catalog of data.

Any other thoughts from those on the thread?

Alex

For Kubernetes we could import the kubectl package and use this https://github.com/kubernetes/kubernetes/blob/master/pkg/kubectl/polymorphichelpers/logsforobject.go#L50

Since we have a standard name for the function container we can use it to avoid getting the logs for sidecars like Envoy.

@LucasRoesler didn't you also find the clientset code for accessing logs? Can you share a link here too?

Then for streaming what do we do? Http with an near infinite timeout or websockets?

The CoreV1 clientset has a GetLogs method for pods,

https://github.com/kubernetes/client-go/blob/master/kubernetes/typed/core/v1/pod_expansion.go
and https://github.com/kubernetes/api/blob/de5c567eef5cb050b5476484727b074df9185088/core/v1/types.go#L4290

usage looks roughly like this

logReq := c.clientset.CoreV1().Pods(namespace).GetLogs(name, opts)
stream, err = logReq.Stream()
...

I had never seen the kubectl code though, it looks nice.

What if we have 5 replicas in the cluster when we start then that goes up to 7.. how do we respond to that or if it goes down to 1?

During scale up, I think we would have to listen to the Pod events and see when one moves to ready and check if it matches the query. If the pod dies while we are streaming logs, that log stream would die anyway, we could see the stream end and report it to stdout, when the number of streams hits zero, we stop the cli.

Derek assign me

I have been working on an implementation that would provide live developer logs via the API by leveraging the swarm/k8s apis to query the current containers and stream those logs.

Broadly the design would result in

  • a new interface and and new endpoint being defined in the openfaas-provider
  • an implementation of the LogRequestor interface in swarm
  • an implementation of the LogRequestor interface in faas-netes

The API would allow the user to request logs for a function since a certain timestamp and/or matching a pattern.

I have not implemented the pattern matching yet, but the swarm API has otherwise been implemented in my swarm fork.

I have run into a couple issues in kubernetes. k8s is more complicated because it does not provide a simple API for querying the logs of a Deployment. You must watch for function pods to scale up or down and then start/stop a log stream from those containers.

I started a simplified implementation by leveraging kail This project has already implemented many of the required watchers and log streaming to provide a simplified CLI for tailing k8s container logs. This also mostly works with a couple of exceptions. I think to use kail to power the logs in the k8s provider we would _need_ to make at least 2 changes upstream and would like 1 other change:

  1. we need to request timestamps for each log line. This is an option in the k8s api and we would want to turn it one so that (like in the swarm api) we could parse it would of the logs
  2. we need more control over the log buffer in kail. Currently, requesting the history of a function or a function under even light load via hey causes messages to be dropped while watching logs live. curl -v -X GET "localhost:8081/system/logs?name=go-echo&limit=1000&since=2019-02-24T00:00:00Z" has at least 2000 dropped log messages
    The relevant section of code for why these messages are dropped is here https://github.com/boz/kail/blob/09872a2c4df097879642f4d0e43f1102638f0047/monitor.go#L196-L210
  3. enable control of the follow flag, it currently true by default, so we can't read until now and terminate. We also can't fake it because we don't have timestamps on the log lines

I have also noticed a couple issues with trying to stream the log results over http. I implemented a handler that returns the logs using Transfer-Encoding: chunked. This works ... but, if there is a long pause in the log stream (say 15 seconds) everything appears to behave as expected except that the http request closes. curl says

* transfer closed with outstanding read data remaining
* stopped the pause stream!
* Closing connection 0

while the server logs indicate that "the client closed the connection".

Additionally, I am currently communicating directly with the providers when testing my PoC implementations. The actual implementation would need to be exposed through the gateway server which enforces a short connection timeout. meaning, we can't hold a connection for 1+ minutes to steam a day's worth of log. This would require either disabling the server wide timeout and enforcing it via a middleware on the endpoints that require it _or_ using they hijack method to takeover the connection and reset/clear the timeouts for that one request.

type Hijacker interface {
        // Hijack lets the caller take over the connection.
        // After a call to Hijack the HTTP server library
        // will not do anything else with the connection.
        //
        // It becomes the caller's responsibility to manage
        // and close the connection.
        //
        // The returned net.Conn may have read or write deadlines
        // already set, depending on the configuration of the
        // Server. It is the caller's responsibility to set
        // or clear those deadlines as needed.
        //
        // The returned bufio.Reader may contain unprocessed buffered
        // data from the client.
        //
        // After a call to Hijack, the original Request.Body must not
        // be used. The original Request's Context remains valid and
        // is not canceled until the Request's ServeHTTP method
        // returns.
        Hijack() (net.Conn, *bufio.ReadWriter, error)
}

This of course brings other complications.

I would love to technical feedback and suggestions. Currently, I like how the code reads with kail, but there are some critical issues with using it this way. Implementing the logs directly would result in very nearly the same code as kail's in many cases. So, my thoughts are to suggest and "fix" these issues upstream.

It is worth adding, the base implementations that I am currently working on are intended to be minimal "allow developers to see the current logs" type of implementations. More fully featured logging support can be provided via the interface but backed by a proper logging system like Elasticsearch or a cloud provider.

These base implementations should still be good for local development and dev clusters so that engineers can still gain insight and evaluate the system.

I choose to continue forward by forking kail for now and making two changes to it:

  1. requesting timestamps from k8s
  2. removing the ticker that causes log messages to drop when the stream buffer is full, instead it will block

The second change definitely requires some feedback about potential downsides. I suspect there could be issues if the system is under high load.

I pushed my initial working implementation of the k8s log provider to Docker Hub. Brief description of testing it is provided here:

https://gist.github.com/LucasRoesler/254fa07e42adb980d87e9e5a658e0743

I have left one issue in the code. There is a known issue with the "long" periods of inactivity causing the connection to close and therefore stop the log stream. This is due to the server-wide write timeout configuration. This was mentioned in my last comment.

I am still tinkering with some alternative implementations in the provider, essentially I need to learn more about the Hijack interface. But, I view this as a bug that can be resolved while we assess the quality of the interface and experience.

I have spent most of today looking at how to properly hijack the http request and to ensure that I still listen to the client, close connections, and clean up correctly. At the moment I think it would be simpler to

  1. write a "TimeoutMiddleware" that we apply to the other handlers and then remove the server timeout settings to that the logging endpoint does not need to use hijack, _or_
  2. we split the logger endpoint to a separate server (so a separate port) and ultimately have the gateway talk to that extra port. This also provides for future flexibility to allow the log provider to be implemented by a separate server/service, e.g. plugin an ElasticSearch provider.

I favor the second approach

@alexellis and @burtonr i have implemented log handlers in the gateway, faas-netes, and in faas-swarm. This can now be tested together using these instructions https://gist.github.com/LucasRoesler/ec0c365d02296c29e66be683247be8aa

The implementation diffs for each repo:

One thing that I noticed at this point, I put the logs endpoint at /system/logs. This endpoint might be better at /function/logs. We probably should not nest it under /function/<name>/logs due to the sub-path routing support for functions.

At this point, i need to write the CLI command to query the logs _and_ I need code review to assess the code. I could open draft PRs, thoughts?

Also, I haven't written any tests yet.

I have started writing tests and have gotten basic tests into the CLI, gateway, and provider. My initial tests focus on the parsing in the CLI and avoiding goroutine leaks in the gateway and provide layers.

I have created a quick demo of the functionality that demonstrates filtering, following the logs, and setting the starting time.

asciicast

The CLI also supports filtering on the function instance name and you can limit the number of log lines returned. As much as possible, the filter and limit behavior is implemented in the provider layer, so that the upstream (swarm, k8s, nomad, etc) providers only need to focus on creating log channel.

I will be opening the initial pull requests soon so that people can start reviewing it while i work on an ELK stack provider, which will be more suitable for production deployments.

I have a couple of concerns about this proposal.

  1. Streaming logs from all instances can be too voluminous to be useful at scale.
  2. Making the user specify the instance-id breaks the 'serverless' semantics.

Is there a way to get logs per invocation? There are two scenarios that can be anticipated:

  1. Show me logs for my current invocation.
  2. Show me logs for invocation with a particular ID.

Is it possible to add an invoke-with-logs command/ header to enable logs for current invocation? For allowing historical searches by invocation id, it might be necessary to leverage ELK?

I agree that adding an "invocation" filter makes a lot of sense and is not precluded from this design. We should figure out a way to make it well integrated. It should even be straight forward to attach an id and propagate it through the headers. once the value is exposed and emitted in the logs, we can parse it and apply a filter on it. (this is of course going to work better with ELK or in a low load system for the default loggers)

One possible issue is that a function can be any container server on port 8080, it might not always emit logs with the required invocation id, so for those cases when the developer has fully customized the function image, we would need to ensure that the invocation_id can be safely ignored.

I am not sure how scenario 1 would work. In particular, how would the server/cli know the "current" invocation, especially in a high load system? Most likely you would need to already know an id value and then it just becomes scenario 2.

Integration with ELK or any other provider with an API is definitely possible and was explicitly thought about in the proposed implementation. The ELK implementation simply needs to be written. See https://github.com/openfaas/faas-provider/pull/15 for the interface. This interface is designed to be split from the provider explicitly for the ELK use case.

I also think that using ELK or some other log provider is going to be important in production systems. I wouldn't recommend using the default k8s/swarm implementations other than in local/dev environments. But it is nice and good that for a developer looking to try out OF, or running on a small home RaspPi cluster, they don't need to know how to manage ELK right away.

The idea is that the default for development is to get logs from all pods behind the function, for production it is to use ELK or similar.

Lucas has implementations for the Kubernetes pod streaming based upon nail, which we would appreciate your direct testing on and the ELK library will be developed after that.

The feature is optional and of course you don't _have_ to use it. We do have a request-id that is injected at the gateway if you want to trace logs.

@LucasRoesler 馃憢 could we close this tracking issue now? I think we have the feature in faas-netes/swarm and the Operator?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alexellis picture alexellis  路  4Comments

derailed picture derailed  路  6Comments

saad749 picture saad749  路  5Comments

maxbry picture maxbry  路  7Comments

edouardkleinhans picture edouardkleinhans  路  8Comments