Netdata: python.d enhancements

Created on 14 Jul 2016 · 100Comments · Source: netdata/netdata

@paulfantom I am writing here a TODO list for python.d based on my findings.

[x] DOCUMENTATION in wiki.
[x] log flood protection - it will require 2 parameters: logs_per_interval = 200 and log_interval = 3600. So, every hour (this_hour = int(now / log_interval)) it should reset the counter and allow up to logs_per_interval log entries until the next hour.

This is how netdata does it: https://github.com/firehol/netdata/blob/d7b083430de1d39d0196b82035162b4483c08a3c/src/log.c#L33-L107

[x] support ipv6 for SocketService (currently redis and squid)
[x] netdata passes the environment variable NETDATA_HOST_PREFIX. cpufreq should use this to prefix sys_dir automatically. This variable is used when netdata runs in a container. The system directories /proc, /sys of the host should be exposed with this prefix.
[ ] the URLService should somehow support proxy configuration.
[ ] the URLService should support Connection: keep-alive.
[x] The service that runs external commands should be more descriptive. Example running exim plugin when exim is not installed:

python.d ERROR: exim_local exim [Errno 2] No such file or directory python.d ERROR: exim_local exim [Errno 2] No such file or directory python.d ERROR: exim: is misbehaving. Reason:'NoneType' object has no attribute '__getitem__'

[x] This message should be a debug log No unix socket specified. Trying TCP/IP socket.
[x] This message could state where it tried to connect: [Errno 111] Connection refused
[x] This message could state the hostname it tried to resolve: [Errno -9] Address family for hostname not supported
[x] This should state the job name, not the name:

python.d ERROR: redis/local: check() function reports failure.

[x] This should state with is the problem:

sh # ./plugins.d/python.d.plugin debug cpufreq 1 INFO: Using python v2 python.d INFO: reading configuration file: /etc/netdata/python.d.conf python.d INFO: MODULES_DIR='/root/netdata/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['cpufreq'] python.d DEBUG: cpufreq: loading module configuration: '/etc/netdata/python.d/cpufreq.conf' python.d DEBUG: cpufreq: reading configuration python.d DEBUG: cpufreq: job added python.d INFO: Disabled cpufreq/None python.d ERROR: cpufreq/None: check() function reports failure. python.d FATAL: no more jobs DISABLE

[x] ~~There should be a configuration entry in python.d.conf to set the PATH to be searched for commands. By default everything in /usr/sbin/ is not found.~~ Added #695 to do this at the netdata daemon for all its plugins.
[x] The default retries in the code, for all modules, is 5 or 10. I suggest to make them 60 for all modules. There are many services that cannot be restarted within 5 seconds.

Made it in #695

[x] When a service reports failure to collect data (during update()), there should be log entry stating the reason of failure.
[x] Handling of incremental dimensions in LogService
[x] Better autodetection of disk count in hddtemp.chart.py
[ ] Move logging mechanism to utilize logging module.

more to come...

areexternapython

Source

ktsaou

All 100 comments

Last four should be easy to implement, but first 3 will need some more work.

paulfantom on 15 Jul 2016

added connection keep-alive to urlservice

ktsaou on 15 Jul 2016

added issue with cpufreq and exim (the PATH).

ktsaou on 15 Jul 2016

added default retries and logging of update() failures

ktsaou on 15 Jul 2016

I think this one is solved:

This should state the job name, not the name

paulfantom on 15 Jul 2016

When a service reports failure to collect data (during update()), there should be log entry stating the reason of failure.

In SimpleService (which is a parent of all modules) method update() can fail only in two cases:

if there is no data (error message should be in _get_data(), which is implemented in modules, but I also added self.debug("_get_data() returned no data") as a precaution).
if there are no charts to update (error message is implemented with #696)

paulfantom on 15 Jul 2016

Those are basically the same:
1.

This message could state where it tried to connect: [Errno 111] Connection refused

This message could state the hostname it tried to resolve: [Errno -9] Address family for hostname not supported

Those errors come from exception handler in _get_raw_data() method of SocketService. Lines

except Exception as e:
    self.error(str(e))

are responsible for it.

And it is solved in #696

paulfantom on 15 Jul 2016

I believe this:

This should state with is the problem

# ./plugins.d/python.d.plugin debug cpufreq 1
INFO: Using python v2
python.d INFO: reading configuration file: /etc/netdata/python.d.conf
python.d INFO: MODULES_DIR='/root/netdata/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['cpufreq']
python.d DEBUG: cpufreq: loading module configuration: '/etc/netdata/python.d/cpufreq.conf'
python.d DEBUG: cpufreq: reading configuration
python.d DEBUG: cpufreq: job added
python.d INFO: Disabled cpufreq/None
python.d ERROR: cpufreq/None: check() function reports failure.
python.d FATAL: no more jobs
DISABLE

is also solved in #696

paulfantom on 15 Jul 2016

This message should be a debug log No unix socket specified. Trying TCP/IP socket.

solved in #696

paulfantom on 15 Jul 2016

marked the ones you said are included in #696 as done.

ktsaou on 15 Jul 2016

added log flood protection.

ktsaou on 15 Jul 2016

do you want me to add to this list the redis response parsing to keep the socket alive?

ktsaou on 15 Jul 2016

Currently I don't close socket in redis. I introduced variable _keep_alive in SocketService which has default value of True.

paulfantom on 15 Jul 2016

But I think you could check this as completed:

This should state the job name, not the name

paulfantom on 15 Jul 2016

Also I think I did this today:

When a service reports failure to collect data (during update()), there should be log entry stating the reason of failure.

paulfantom on 15 Jul 2016

Currently I don't close socket in redis. I introduced variable _keep_alive in SocketService which has default value of True.

I am confused. How this works?

ktsaou on 15 Jul 2016

look at the end of _get_raw_data in SocketService. It just checks if connection should be closed, and _sock freed or not. Since I cannot reconnect to socket and I can only recreate it, this variable decides if socket should be recreated every time _get_raw_data is executed.

paulfantom on 15 Jul 2016

Also I am thinking about different approach to SocketService. I will probably implement some basic wrappers like _connect(), _close(), _send(), and _receive() and leave them to be used in _get_data() implemented per module.

Maybe I will leave slightly modified version of _get_raw_data just to be used for quick prototyping.

paulfantom on 15 Jul 2016

look at the end of _get_raw_data in SocketService. It just checks if connection should be closed, and _sock freed or not. Since I cannot reconnect to socket and I can only recreate it, this variable decides if socket should be recreated every time _get_raw_data is executed.

ok. let's assume you want to keep the socket open. The key question is: does the socket already include any data from the previous request?

The only way to know, is if you have parsed the response according to its protocol. I don't think there is any other way of doing it right.

So, if this was a squid (http-like), you should parse Content-Length in response headers and match exactly that many bytes. For redis you should parse the first line, and again get exactly that many bytes.

Am I missing something?

Also I am thinking about different approach to SocketService. I will probably implement some basic wrappers like _connect(), _close(), _send(), and _receive() and leave them to be used in _get_data() implemented in module.

You can do this, but the callback function seems pretty awesome to me. It simplifies the modules a lot.

ktsaou on 15 Jul 2016

The only way to know, is if you have parsed the response according to its protocol

That's why I want to create wrappers and move the data gathering logic to module instead of parent class.

Also I don't want to get rid of the callback function.

And currently I can be pretty sure socket doesn't have any data from previous request. Only way there could be data from previous request is if response is fragmented, and fragments arrive with large enough delays.

paulfantom on 15 Jul 2016

And currently I can be pretty sure socket doesn't have any data from previous request. Only way there could be data from previous request is if response is fragmented, and fragments arrive with large enough delays.

It might work like that for localhost. But it will lead to errors if redis, for example, is on a distant location (e.g. cross data-center, VPNs, etc). I wouldn't rely on it.

ktsaou on 15 Jul 2016

Also I don't want to get rid of the callback function.

Then how it will work with _connect(), _recv(), etc and the callback?

I think that if a module is willing to use its own socket implementation, it can - using directly python features. There is no need to provide wrappers.

The only reason to provide a wrapper is to simplify it.

ktsaou on 15 Jul 2016

Here is why:

Socket is created with timeout of 500ms (or 50ms if it is unix_socket)
at first I grab 2 bytes and if they don't arrive in timeout time, I return None
then I grab another 4 bytes (also with timeout), and if got less than 4 bytes, I return what I got.
next I grab 8 bytes like in previous point.
this scaling continues until I get to 4096 bytes.

Basically it is TCP Window Scaling (and this is very reliable mechanism).

It won't work if ping between netdata host and service host is greater than 500ms.
And that is the reason I want to move logic to modules.

paulfantom on 15 Jul 2016

The only reason to provide a wrapper is to simplify it.

That's exactly the reason.

paulfantom on 15 Jul 2016

For example this could be _connect():

    def _connect(self):
        if self._sock is None:
            try:
                if self.unix_socket is None:
                    self._sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
                    self._sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
                    #sock.setsockopt(socket.SOL_SOCKET, socket.TCP_NODELAY, 1)
                    #sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
                    #sock.settimeout(self.update_every)
                    self._sock.settimeout(0.5)
                    self._sock.connect((self.host, self.port))
                    self._sock.settimeout(0.5)  # Just to be sure
                else:
                    self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
                    #self._sock.settimeout(self.update_every)
                    self._sock.settimeout(0.05)
                    self._sock.connect(self.unix_socket)
                    self._sock.settimeout(0.05)  # Just to be sure
            except Exception as e:
                self.error(str(e),
                           "Cannot create socket with following configuration: host:", str(self.host),
                           "port:", str(self.port),
                           "socket:", str(self.unix_socket))
                self._sock = None
                return None

paulfantom on 15 Jul 2016

Wait a moment. I am preparing a response.

ktsaou on 15 Jul 2016

It won't work if ping between netdata host and service host is greater than 500ms.

Yes, that is the problem with this. You cannot rely on timings for socket communication. TCP has a default timeout of 2 days!

I think the way it is now it is very close to what we need:

remove the timeouts completely (keep a very long one, e.g. 20 seconds)
SocketService opens the socket and sends the request
It waits for data and receives some - put a large buffer there, e.g. 16384 or more (or if this is too big for python - the python max).
Receive some data, append them to the data you already have and give the whole of it to the callback
The callback() returns TRUE for stop, or FALSE for get more data.
While it returns TRUE, just wait for more data (same long timeout). If the server closes the connection, stop and return that data to the module.
If at any time returns FALSE, stop and return the data to the module.

This does not depend on timing and will allow the module to handle the keep-alive (via its callback).
Also, the callback should have verified the data according to the protocol implemented.

Can you find something wrong in this scenario?

ktsaou on 15 Jul 2016

If I remove timeout completely, create in python.d.plugin can be blocked, thus whole python.d.plugin will be blocked. And if I set socket timeout to something greater or equal update_every we will have wrong chart timings (as we had previously, when socket timeout was equal to update_every).

Size of variable in python is limited only by host memory, so I can create buffer with almost any size.

Receive some data, append them to the data you already have and give the whole of it to the callback

But when should I stop appending and return it to callback? I don't have any mechanism to check if connection is closed or not (only timeout).

paulfantom on 15 Jul 2016

SocketService open the socket and sends the request

Not every time we need to send something. Look at hddtemp.chart.py

paulfantom on 15 Jul 2016

If I remove timeout completely, create in python.d.plugin can be blocked, thus whole python.d.plugin will be blocked.

I thought you were using non-blocking for check() and threads for update(). How this will block the entire plugin?

And if I set socket timeout to something greater or equal update_every we will have wrong chart timings (as we had previously, when socket timeout was equal to `update_every).

No problem with this. netdata will show a gap on the chart as soon as the plugin reports new data.

But when should I stop appending and return it to callback? I don't have any mechanism to check if connection is closed or not (only timeout).

You stop on 3 conditions:

Socket closed by the server
callback returns FALSE
very long timeout (you can even omit this - leave the thread there forever)

Why do you need any other way?

ktsaou on 15 Jul 2016

I thought you were using non-blocking for check() and threads for update(). How this will block the entire plugin?

create is blocking.

No problem with this. netdata will show a gap on the chart as soon as the plugin reports new data.

It isn't about that. Do you remember that case when squid had update_every set to 1, but was running every 2 seconds? It is about this.

callback returns FALSE

Callback cannot return anything, since it wasn't called. I am asking when should I stop gathering data to pass to callback.

paulfantom on 15 Jul 2016

And I don't use non-blocking sockets (because they cannot have timeout)

paulfantom on 15 Jul 2016

And I don't know when socket was closed by server. I couldn't find how to do it correctly.

paulfantom on 15 Jul 2016

I am totally lost... I don't know python and this is probably the problem...

Here is what I know (not python or python.d.plugin specific):

create() is used the generate the CHART and DIMENSION statements. I don't understand why you mention it.
Timing and sockets is bad. My suggestion is to totally forget using sockets based on any timing. The problem we had with the double update_every was that you needed a timeout at recv() to detect EOF.
I understand your problem is recv(). This call in C returns either a positive number of bytes or -1 to indicate error. Error can be "connection closed", "no data" when in non-blocking mode, among other things.
non-blocking sockets do not have a timeout because you are not blocked at recv(). If it has data you get them, if it has no more data it just returns -1 (in C) and errno is set accordingly. So, to timeout you have to do it yourself.

So, I guess this will work for blocking sockets (sorry I am not a python expert so it may have syntax errors):

all_data = ""
wait=True
while wait:
    data = socket.recv(...)
    if not data: # this is equivalent to -1 in C
       # socket closed
       break

    all_data += data
    wait = callback(all_data)

return all_data

In the squid example, the server will close the connection. In the redis example, the callback will return FALSE.

ktsaou on 15 Jul 2016

Summing up:

Cannot use non-blocking sockets with timeout. (reason)
I need timeouts since check and create of every module are run before spawning new threads
socket module doesn't support any type of notification about closing connection by server (it seems everyone is using recv with timeout).
I have no idea when (or how) should data gathering be stopped and data passed to callback for further analysis
if buffer is very large, blocking nature of socket will wait for timeout, which in turn will result in very long module run_time

I tried to do this like you suggested in example. It should work, but it didn't and I really don't know why.

paulfantom on 15 Jul 2016

I would be very happy to get rid of timeouts and move to non-blocking sockets, I just don't know how to do this.
And I spent last two days to get this working in current version.

paulfantom on 15 Jul 2016

ok, one last try because I think what you say is not valid.

Give me 10 mins. I will try to adapt the test for squid you gave me yesterday. 10 mins.

ktsaou on 15 Jul 2016

If you are trying to adapt code to run in non-blocking mode, don't bother. I know that squid testing snippet can run without timeouts. I tried it. But when I tried to include it into SocketService it failed.

It is very probable I did something wrong.

I will try to clean up SocketService, write wrappers, and educate myself on sockets in python. After this I will start tweaking with data gathering implementation.
I just don't want to break something that is currently running for majority of implementations.

paulfantom on 15 Jul 2016

for squid this works:

#!/usr/bin/env python
import socket

HOST = 'localhost'
PORT = 3128
URL = 'GET cache_object://localhost:3128/counters HTTP/1.0\r\n\r\n'


def callback(data):
    return True

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.connect((HOST, PORT))

sock.send(URL.encode())
raw = ""
wait = True
while wait:
    buf = sock.recv(65536)
    if not buf: break

    raw += buf
    wait = callback(raw)

print(raw)

I bet it will also work for redis with the proper callback that can return False if it received the whole reply.

I am 100% confident you should not rely on timing. Please don't do this.

EDIT: updated the code with just one call to recv().

ktsaou on 15 Jul 2016

wrong paste. I fixed it to use just one call to recv()

ktsaou on 15 Jul 2016

Ok, I will move it to non-blocking, just as I said in previous message - not now. Could you add this to the list at the beggining?

paulfantom on 15 Jul 2016

probably non-blocking confused you. To use non-blocking you need select.select() or select.poll() or select.epoll(). So you connect one or more sockets and then you select() on them to find which one is ready to send or receive data.

ktsaou on 15 Jul 2016

I.e. in non-blocking mode you should only pause on select(), not on recv() or send().

ktsaou on 15 Jul 2016

I have to leave. I'll not respond for 1-2 hours.

ktsaou on 15 Jul 2016

@paulfantom sorry for insisting on this, but I am sure we should not rely on timing.

An example of non-blocking I/O with timeout:
http://whitepythons.blogspot.gr/2014/05/non-blocking-mode-and-timeouts-sockets.html

The above example is not optimal, but it shows how to use errno and timeouts.

At the end of the article it mentions select which allows the program to wait for one or more non-blocking sockets. This is the right way to do non-blocking I/O.

To effectively use non-blocking, the python core should work like that:

It creates jobs, but all jobs just setup their network connections without transferring any data on them. All the jobs add their sockets/files to a list of sockets/files for the core to use. The jobs should state if they want to read from or write to each socket/file.
The core, calls select.select() or select.poll(), or select.epoll() giving it all sockets/files (the optimal for linux is select.epoll(), but only select.select() is portable. This call will freeze the process, until any of the sockets/files is ready for read or write (depending on what was requested).
Once this call completes, the core should find the job that owns the socket/file and call it to continue its operation.
The core continues at point 2 (loop forever).

I have seen a lot of python examples on the net that use non-blocking I/O by waiting between recv() calls for 10ms, 20ms, etc and looping 500.000 times until they receive the data. All these examples are garbage. They shouldn't use non-blocking I/O that way.

In netdata there is a single threaded web server implementation. Check it here: https://github.com/firehol/netdata/blob/9dd69c55d7f4c7f80c779029f83bb9caebabb935/src/web_server.c#L389-L528

Here is the select statement: https://github.com/firehol/netdata/blob/9dd69c55d7f4c7f80c779029f83bb9caebabb935/src/web_server.c#L429

Here it finds the "job" responsible for the socket:
https://github.com/firehol/netdata/blob/9dd69c55d7f4c7f80c779029f83bb9caebabb935/src/web_server.c#L473-L477

Here it calls the "job" based on what was requested (read or write):
https://github.com/firehol/netdata/blob/9dd69c55d7f4c7f80c779029f83bb9caebabb935/src/web_server.c#L491-L509

In netdata, I have in one case made the assumption that a very small request or reply can call send() immediately after opening or accepting a connection. This works, since we know the socket is already ready to receive data from us. However, even this "innocent" assumption should be avoided to properly use non-blocking I/O. Example of this "fault" (for sending the HTTP header back to the web browser) is here: https://github.com/firehol/netdata/blob/9dd69c55d7f4c7f80c779029f83bb9caebabb935/src/web_client.c#L2059

On the other hand, in blocking mode, timing is not needed. Calls to recv() or send() block. If we have one thread per job, there is no need to timeout. So a blocking design should only call recv() and send() while in a job thread.

ktsaou on 16 Jul 2016

I have the feeling you want me to do this before any other thing I might do for netdata.

I know it shouldn't use timeout, I also said I will move it to non-blocking, just give me some time. For now if you think this is totally wrong, you can disable squid.chart.py, redis.chart.py and hddtemp.chart.py in installer and configuration files and use bash counterparts for new installations.

paulfantom on 16 Jul 2016

Remember that I already did an example of using select with timeout on non-blocking sockets. It is here and is very similar to the first example you mentioned. Probably this will be the base for non-blocking SocketService.

Once again. I cannot do blocking sockets since check and create are not in multiple threads. If you want me to change this design issue, I would propose to revert state of git repository to the point where we have bash plugin as a default and python as experimental, since this will take me some time to accomplish.

paulfantom on 16 Jul 2016

Nonetheless thank you for providing me some examples.

paulfantom on 16 Jul 2016

I think this:

The service that runs external commands should be more descriptive. Example running exim plugin when exim is not installed:

is resolved in #703

paulfantom on 16 Jul 2016

ok. I had the impression you were resisting to this change. Since you understand you have to change it, I am good.

I don't mind if you are going to make it blocking or non-blocking. Both are good designs, the threads overhead is not that high, and non-blocking is more efficient but has some complexity. So there is a trade-off here: best efficiency (non-blocking) or simplicity (blocking)? Your call. I am happy with either, as long as they are implemented properly. I advice you though to avoid using both of them, at different states of the plugin. You will get the cons of both and the pros of none.

I don't mind about the time this is done. What I know is that if you move on and add more modules, you will probably multiply the amount of work you will have to do to when you fix it. This is technical debt. If you believe the changes required will only affect SocketService, I think you can delay it. But if a total re-structuring of the plugin might be required, I believe it is best to do it now. But this is again your call.

Finally, there is no going back. netdata works primarily with the python plugin. Period. This also why I asked you to add pyyaml to it.

ktsaou on 16 Jul 2016

@paulfantom I send you an invitation to join the netdata development team.

ktsaou on 16 Jul 2016

I think that current python.d.plugin design is quite good. Yes there is an issue with blocking check or create, but I am resolving it by implementing precautions in framework classes (like SocketService).

Also since I know how much you don't like current design of SocketService (I don't like it either), I will try to change it by tomorrow (maybe even today).

Thank you for the invitation.

paulfantom on 16 Jul 2016

👍1

@ktsaou I believe I got SocketService working with you guidelines (well most of them). Basically what I did is in this commit. However I got one problem.

I've implemented callback method _check_raw_data which should return True if all data is received, or false if it isn't. In redis it was simple, just parse first line and voila, but when I wanted to use Content-Length http header in squid.chart.py I found one disturbing issue: squid doesn't set Content-Length. Do you have any proposition, how can I check if squid.chart.py received all data from squid?

Here is squid response header:

HTTP/1.1 200 OK
Server: squid/3.5.19
Mime-Version: 1.0
Date: Mon, 18 Jul 2016 21:20:35 GMT
Content-Type: text/plain;charset=utf-8
Expires: Mon, 18 Jul 2016 21:20:35 GMT
Last-Modified: Mon, 18 Jul 2016 21:20:35 GMT
X-Cache: MISS from localhost.localdomain
X-Cache-Lookup: MISS from localhost.localdomain:3128
Via: 1.1 localhost.localdomain (squid/3.5.19)
Connection: close

Also I think I have implemented IPv6 in _connect method. I haven't tested this, but it is taken from python documentation so it should work (it works for IPv4).

Currently I don't think SocketService from my repo is ready for PR, since it needs some polishing, but do you think I am going in the right direction?

Short description what is going on in code (example of executing _get_raw_data):

check if there is established socket connection
if not, create and connect to socket
send request and return None if sending failed (also disconnect from socket)
receive some data from socket
check if all data is received (with _check_raw_data)
disconnect if _keep_alive is set to False

No. 5 detailed:

check if socket is readable (with 0.01s timeout, although I think it should be longer)
receive max 4096 bytes of data if socket is readable
otherwise disconnect socket and quit loop

paulfantom on 18 Jul 2016

I've implemented callback method _check_raw_data which should return True if all data is received, or false if it isn't. In redis it was simple, just parse first line and voila, but when I wanted to use Content-Length http header in squid.chart.py I found one disturbing issue: squid doesn't set Content-Length. Do you have any proposition, how can I check if squid.chart.py received all data from squid?

Let me check:

# curl -v --raw "http://localhost:3128/squid-internal-mgr/counters"
*   Trying ::1...
* Connected to localhost (::1) port 3128 (#0)
> GET /squid-internal-mgr/counters HTTP/1.1
> Host: localhost:3128
> User-Agent: curl/7.49.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: squid/3.5.20
< Mime-Version: 1.0
< Date: Mon, 18 Jul 2016 22:41:14 GMT
< Content-Type: text/plain;charset=utf-8
< Expires: Mon, 18 Jul 2016 22:41:14 GMT
< Last-Modified: Mon, 18 Jul 2016 22:41:14 GMT
< X-Cache: MISS from boxe
< X-Cache-Lookup: MISS from boxe:3128
< Transfer-Encoding: chunked
< Connection: keep-alive
<
528
sample_time = 1468881627.947637 (Mon, 18 Jul 2016 22:40:27 GMT)
client_http.requests = 282328
client_http.hits = 4138
client_http.errors = 89
client_http.kbytes_in = 116414
client_http.kbytes_out = 16647550
client_http.hit_kbytes_out = 27102
server.all.requests = 109631
server.all.errors = 0
server.all.kbytes_in = 16347655
server.all.kbytes_out = 106222
server.http.requests = 94305
server.http.errors = 0
server.http.kbytes_in = 5432451
server.http.kbytes_out = 31545
server.ftp.requests = 16
server.ftp.errors = 0
server.ftp.kbytes_in = 11192
server.ftp.kbytes_out = 1
server.other.requests = 15310
server.other.errors = 0
server.other.kbytes_in = 10904011
server.other.kbytes_out = 74675
icp.pkts_sent = 0
icp.pkts_recv = 0
icp.queries_sent = 0
icp.replies_sent = 0
icp.queries_recv = 0
icp.replies_recv = 0
icp.query_timeouts = 0
icp.replies_queued = 0
icp.kbytes_sent = 0
icp.kbytes_recv = 0
icp.q_kbytes_sent = 0
icp.r_kbytes_sent = 0
icp.q_kbytes_recv = 0
icp.r_kbytes_recv = 0
icp.times_used = 0
cd.times_used = 0
cd.msgs_sent = 0
cd.msgs_recv = 0
cd.memory = 0
cd.local_memory = 349
cd.kbytes_sent = 0
cd.kbytes_recv = 0
unlink.requests = 3508
page_faults = 20
select_loops = 33936090
cpu_time = 1311.020000
wall_time = 46.297577
swap.outs = 5836
swap.ins = 746
swap.files_cleaned = 82
aborted_requests = 525

0

* Connection #0 to host localhost left intact

web servers may respond with either Content-Length: BYTES or Transfer-Encoding: chunked. In the chunked case, the first line after \r\n\r\n has the number of bytes following (in hex, check in netdata here: https://github.com/firehol/netdata/blob/8027db0a458960e75285732986d62f15795e8cbc/src/web_client.c#L2121). A zero, means EOF (check in netdata here: https://github.com/firehol/netdata/blob/8027db0a458960e75285732986d62f15795e8cbc/src/web_client.c#L2167).

To avoid implementing all the HTTP protocol, you could use HTTP/1.0 where squid does not seem to support keep-alive, does not send Content-Length and closes the connection.

# curl -v -0 --raw -H "Connection: keep-alive" "http://localhost:3128/squid-internal-mgr/counters"
*   Trying ::1...
* Connected to localhost (::1) port 3128 (#0)
> GET /squid-internal-mgr/counters HTTP/1.0
> Host: localhost:3128
> User-Agent: curl/7.49.1
> Accept: */*
> Connection: keep-alive
>
< HTTP/1.1 200 OK
< Server: squid/3.5.20
< Mime-Version: 1.0
< Date: Mon, 18 Jul 2016 22:50:31 GMT
< Content-Type: text/plain;charset=utf-8
< Expires: Mon, 18 Jul 2016 22:50:31 GMT
< Last-Modified: Mon, 18 Jul 2016 22:50:31 GMT
< X-Cache: MISS from boxe
< X-Cache-Lookup: MISS from boxe:3128
< Connection: close
<
sample_time = 1468882227.957787 (Mon, 18 Jul 2016 22:50:27 GMT)
client_http.requests = 283186
client_http.hits = 4139
client_http.errors = 89
client_http.kbytes_in = 116668
client_http.kbytes_out = 16649723
client_http.hit_kbytes_out = 27103
server.all.requests = 109930
server.all.errors = 0
server.all.kbytes_in = 16348915
server.all.kbytes_out = 106446
server.http.requests = 94553
server.http.errors = 0
server.http.kbytes_in = 5432833
server.http.kbytes_out = 31612
server.ftp.requests = 16
server.ftp.errors = 0
server.ftp.kbytes_in = 11192
server.ftp.kbytes_out = 1
server.other.requests = 15361
server.other.errors = 0
server.other.kbytes_in = 10904889
server.other.kbytes_out = 74832
icp.pkts_sent = 0
icp.pkts_recv = 0
icp.queries_sent = 0
icp.replies_sent = 0
icp.queries_recv = 0
icp.replies_recv = 0
icp.query_timeouts = 0
icp.replies_queued = 0
icp.kbytes_sent = 0
icp.kbytes_recv = 0
icp.q_kbytes_sent = 0
icp.r_kbytes_sent = 0
icp.q_kbytes_recv = 0
icp.r_kbytes_recv = 0
icp.times_used = 0
cd.times_used = 0
cd.msgs_sent = 0
cd.msgs_recv = 0
cd.memory = 0
cd.local_memory = 349
cd.kbytes_sent = 0
cd.kbytes_recv = 0
unlink.requests = 3508
page_faults = 20
select_loops = 33948560
cpu_time = 1312.190000
wall_time = 3.918642
swap.outs = 5844
swap.ins = 748
swap.files_cleaned = 82
aborted_requests = 525
* Closing connection 0

As you can see, we requested keep-alive, but the server responded Connection: close.

ktsaou on 19 Jul 2016

Regarding the implementation, all seem good, except the first bullet of point 5.
Can you point the code you do this check?

ktsaou on 19 Jul 2016

I verified squid behaves the same with the cache_object://... URLs

ktsaou on 19 Jul 2016

file base.py, method _receive, lines:

            try:
                ready_to_read, _, in_error = select.select([self._sock], [], [], 0.01)
            except Exception as e:
                self.debug("SELECT", str(e))
                self._disconnect()
                break
            if len(ready_to_read) > 0:
                buf = self._sock.recv(4096)

I am using select to check if socket is readable, then I check if array ready_to_read has something in it.

Also thanks for the information about HTTP.
I have already run some test and keeping socket alive (HTTP/1.1) cuts down run time by ~9% (from 11ms to 10ms). But I don't know if it will be the case after changing _check_raw_data from just returning True to something else.

paulfantom on 19 Jul 2016

I know you would prefer to change this line:

ready_to_read, _, in_error = select.select([self._sock], [], [], 0.01)

to something like:

ready_to_read, _, in_error = select.select([self._sock1, self._sock2, self._sock3, self._sock4], [], [], 0.01)

and then check which socket is available, but I don't know how to implement it in current version.

paulfantom on 19 Jul 2016

I was wrong. Using HTTP 1.0 and closing connections won't time out select.select and data gathering will be run in 1ms. However if I use HTTP 1.1, select.select will wait and time out (thus running in minimal time of 10ms).

I will be still investigating.

paulfantom on 19 Jul 2016

@paulfantom what would happen if you just self._sock.recv(4096) directly without the select? Do you need this to prevent it from waiting forever or is there any other reason?

ktsaou on 19 Jul 2016

Do you need this to prevent it from waiting forever

Exactly this.

paulfantom on 19 Jul 2016

Exactly this.

ok. What will happen if this waits forever? Is it about check()/create()? Because I guess in update() it is ok to wait forever. Right?

ktsaou on 19 Jul 2016

It is about check/create. You are right.

paulfantom on 19 Jul 2016

ok, why don't you then call the select() statement only if this code is called in check()/create()? Can you know when it is called?

ktsaou on 19 Jul 2016

I can set some variable with initial value of 2 and decrease it.

But I will try to find some other way.

paulfantom on 19 Jul 2016

I think this good. If you just do it in create/update only, then:

You will have a timeout during create/update. It has to be longer though to allow receiving data from slow/distant servers
You will not rely on timing during update. If the job initialized, then you will wait there until you receive something.

Keep in mind in both cases this has to be in a loop. Do not assume the response is just one packet.

ktsaou on 19 Jul 2016

Ok, will work on it.

paulfantom on 19 Jul 2016

Hmmm this way I can set up socket in blocking mode, and just change timeout after create to something like 5 minutes.

paulfantom on 19 Jul 2016

Hmmm this way I can set up socket in blocking mode, and just change timeout after create to something like 5 minutes.

initialization or update() ?

ktsaou on 19 Jul 2016

When I said to bypass select I meant to use blocking mode. Otherwise it will not work. non-blocking without select is a no-go (you will again end-up playing with timings).

What is not clear to me is if check/create and update both use the same blocking mode or not. Are you using non-blocking and blocking?

ktsaou on 19 Jul 2016

Hmmm this way I can set up socket in blocking mode, and just change timeout after create to something like 5 minutes.

I think it will work. As I see in python docs the settimeout() method can be used to set a timeout on blocking sockets.

So, you could just set it blocking everywhere and settimeout() something like 30secs or 1min to be used for both check/create and update.

Remember to collect data in a loop. Either the server will close the connection (no need to timeout) or the callback will let you know it is done.

ktsaou on 19 Jul 2016

Ok, I got it working. With select, with non-blocking socket, with HTTP/1.0 and HTTP/1.1 and with 60 second timeout on select.

I got one thing wrong. Placement of callback method _check_raw_data. It was in _get_raw_data but should be in _receive. When I changed it, squid and redis have run times under 1ms.

paulfantom on 19 Jul 2016

nice!

ktsaou on 19 Jul 2016

I have committed those changes it to my repository. It still needs some testing, but as of now it works pretty good.

paulfantom on 19 Jul 2016

Main problem is with _check_raw_data methods. In redis it was easy, but squid is not that friendly, so for now it just checks if there are needed squid data. Tomorrow I will be looking for some better way.

paulfantom on 19 Jul 2016

Also I saw you have redis on IPv6, could you test if my current version can connect on IPv6?

paulfantom on 19 Jul 2016

Also do you think if I use 'HTTP/1.1' to communicate with squid, it will be enough to check if last 7 bytes are equal to "\r\n0\r\n\r\n"?

paulfantom on 19 Jul 2016

For some reason, redis has stopped working for me, even on the merged version.

Redis is here:

# telnet localhost 6379
Trying ::1...
Connected to localhost.
Escape character is '^]'.
INFO
$2159
# Server
redis_version:3.2.1
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:e19bb8c3d1c28291
redis_mode:standalone
os:Linux 4.4.6-gentoo x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:5.3.0
process_id:1
run_id:20358fe7a1243ebc0d4adb7d8835d0519d884093
tcp_port:6379
uptime_in_seconds:524905
uptime_in_days:6
hz:10
lru_clock:9269583
executable:/data/redis-server
config_file:

# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:821736
used_memory_human:802.48K
used_memory_rss:1056768
used_memory_rss_human:1.01M
used_memory_peak:965584
used_memory_peak_human:942.95K
total_system_memory:8266366976
total_system_memory_human:7.70G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
mem_fragmentation_ratio:1.29
mem_allocator:jemalloc-4.0.3

# Persistence
loading:0
rdb_changes_since_last_save:14
rdb_bgsave_in_progress:0
rdb_last_save_time:1468362470
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:497205
total_commands_processed:497470
instantaneous_ops_per_sec:0
total_net_input_bytes:2986385
total_net_output_bytes:1073491329
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:1
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:527.06
used_cpu_user:237.62
used_cpu_sys_children:527.06
used_cpu_user_children:237.62

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=1,expires=0,avg_ttl=0

^]
telnet> c
Connection closed.

Your repo:

# ./plugins.d/python.d.plugin debug redis 1
 INFO: Using python v3
python.d INFO: reading configuration file: /etc/netdata/python.d.conf
python.d INFO: MODULES_DIR='/data/src/netdata-paulfantom.git/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['redis']
python.d DEBUG: redis: loading module configuration: '/etc/netdata/python.d/redis.conf'
python.d DEBUG: redis: reading configuration
python.d DEBUG: redis/socket2: job added
python.d DEBUG: redis/socket1: job added
python.d DEBUG: redis/localhost: job added
python.d DEBUG: redis/socket3: job added
python.d DEBUG: redis/localipv4: job added
python.d DEBUG: redis/localipv6: job added
python.d DEBUG: all job objects [<Service(socket2, initial daemon)>, <Service(socket1, initial daemon)>, <Service(localhost, initial daemon)>, <Service(socket3, initial daemon)>, <Service(localipv4, initial daemon)>, <Service(localipv6, initial daemon)>]
python.d DEBUG: redis_socket2 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket2 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/run/redis/redis.sock
python.d ERROR: redis_socket2 no data received
python.d ERROR: redis_socket2 check function failed.
python.d INFO: Disabled redis/socket2
python.d DEBUG: redis_socket1 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket1 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /tmp/redis.sock
python.d ERROR: redis_socket1 no data received
python.d ERROR: redis_socket1 check function failed.
python.d INFO: Disabled redis/socket1
python.d DEBUG: redis_localhost No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localhost No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localhost
python.d ERROR: redis_localhost cannot find check() function.
python.d DEBUG: redis_socket3 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket3 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/lib/redis/redis.sock
python.d ERROR: redis_socket3 no data received
python.d ERROR: redis_socket3 check function failed.
python.d INFO: Disabled redis/socket3
python.d DEBUG: redis_localipv4 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv4 No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localipv4
python.d ERROR: redis_localipv4 cannot find check() function.
python.d DEBUG: redis_localipv6 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv6 No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localipv6
python.d ERROR: redis_localipv6 cannot find check() function.
python.d DEBUG: overridden job names: []
python.d DEBUG: all remaining job objects: []
python.d FATAL: no more jobs
DISABLE

merged repo:

# ./plugins.d/python.d.plugin debug redis 1
 INFO: Using python v3
python.d INFO: reading configuration file: /etc/netdata/python.d.conf
python.d INFO: MODULES_DIR='/data/src/netdata.git/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['redis']
python.d DEBUG: redis: loading module configuration: '/etc/netdata/python.d/redis.conf'
python.d DEBUG: redis: reading configuration
python.d DEBUG: redis/localipv4: job added
python.d DEBUG: redis/socket2: job added
python.d DEBUG: redis/socket1: job added
python.d DEBUG: redis/localipv6: job added
python.d DEBUG: redis/socket3: job added
python.d DEBUG: redis/localhost: job added
python.d DEBUG: all job objects [<Service(localipv4, initial daemon)>, <Service(socket2, initial daemon)>, <Service(socket1, initial daemon)>, <Service(localipv6, initial daemon)>, <Service(socket3, initial daemon)>, <Service(localhost, initial daemon)>]
python.d DEBUG: redis_localipv4 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv4 No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localipv4
python.d ERROR: redis_localipv4_localipv4 cannot find check() function.
python.d DEBUG: redis_socket2 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket2_socket2 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/run/redis/redis.sock
python.d ERROR: redis_socket2_socket2 no data received
python.d ERROR: redis_socket2_socket2 check function failed.
python.d INFO: Disabled redis/socket2
python.d DEBUG: redis_socket1 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket1_socket1 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /tmp/redis.sock
python.d ERROR: redis_socket1_socket1 no data received
python.d ERROR: redis_socket1_socket1 check function failed.
python.d INFO: Disabled redis/socket1
python.d DEBUG: redis_localipv6 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv6 No request specified. Using: 'INFO
'
python.d ERROR: redis_localipv6_localipv6 [Errno -9] Address family for hostname not supported Cannot create socket with following configuration: host: ::1 port: 6379 socket: None
python.d ERROR: redis_localipv6_localipv6 no data received
python.d ERROR: redis_localipv6_localipv6 check function failed.
python.d INFO: Disabled redis/localipv6
python.d DEBUG: redis_socket3 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket3_socket3 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/lib/redis/redis.sock
python.d ERROR: redis_socket3_socket3 no data received
python.d ERROR: redis_socket3_socket3 check function failed.
python.d INFO: Disabled redis/socket3
python.d DEBUG: redis_localhost No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localhost No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localhost
python.d ERROR: redis_localhost_localhost cannot find check() function.
python.d DEBUG: overridden job names: []
python.d DEBUG: all remaining job objects: []
python.d FATAL: no more jobs
DISABLE

Any ideas?

ktsaou on 19 Jul 2016

Also do you think if I use 'HTTP/1.1' to communicate with squid, it will be enough to check if last 7 bytes are equal to "\r\n0\r\n\r\n"?

Nice idea! Yes. I suggest though to also check Transfer-Encoding: chunked. Otherwise this check will never match anything.

Keep in mind you should also handle a connection close from the server.

ktsaou on 19 Jul 2016

I think this message is the key:

python.d ERROR: redis_localhost_localhost cannot find check() function.

it is printed only if an AttributeError is thrown in python.d.plugin in:

                if not job.check():
                    msg.error(job.chart_name, "check function failed.")
                    self._stop(job)
                else:
                    msg.debug(job.chart_name, "check succeeded")
                    i += 1

I pushed a possible fix and more debug statements to my branch. Could you check?

paulfantom on 19 Jul 2016

Didn't work:

# ./plugins.d/python.d.plugin debug redis 1
 INFO: Using python v3
python.d INFO: reading configuration file: /etc/netdata/python.d.conf
python.d INFO: MODULES_DIR='/data/src/netdata-paulfantom.git/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['redis']
python.d DEBUG: redis: loading module configuration: '/etc/netdata/python.d/redis.conf'
python.d DEBUG: redis: reading configuration
python.d DEBUG: redis/localipv4: job added
python.d DEBUG: redis/socket3: job added
python.d DEBUG: redis/socket2: job added
python.d DEBUG: redis/localipv6: job added
python.d DEBUG: redis/socket1: job added
python.d DEBUG: redis/localhost: job added
python.d DEBUG: all job objects [<Service(localipv4, initial daemon)>, <Service(socket3, initial daemon)>, <Service(socket2, initial daemon)>, <Service(localipv6, initial daemon)>, <Service(socket1, initial daemon)>, <Service(localhost, initial daemon)>]
python.d DEBUG: redis_localipv4 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv4 No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localipv4
python.d ERROR: redis_localipv4 cannot find check() function.
python.d DEBUG: 'str' object has no attribute 'decode'
python.d DEBUG: redis_socket3 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket3 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/lib/redis/redis.sock
python.d ERROR: redis_socket3 no data received
python.d ERROR: redis_socket3 check function failed.
python.d INFO: Disabled redis/socket3
python.d DEBUG: redis_socket2 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket2 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/run/redis/redis.sock
python.d ERROR: redis_socket2 no data received
python.d ERROR: redis_socket2 check function failed.
python.d INFO: Disabled redis/socket2
python.d DEBUG: redis_localipv6 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv6 No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localipv6
python.d ERROR: redis_localipv6 cannot find check() function.
python.d DEBUG: 'str' object has no attribute 'decode'
python.d DEBUG: redis_socket1 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket1 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /tmp/redis.sock
python.d ERROR: redis_socket1 no data received
python.d ERROR: redis_socket1 check function failed.
python.d INFO: Disabled redis/socket1
python.d DEBUG: redis_localhost No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localhost No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localhost
python.d ERROR: redis_localhost cannot find check() function.
python.d DEBUG: 'str' object has no attribute 'decode'
python.d DEBUG: overridden job names: []
python.d DEBUG: all remaining job objects: []
python.d FATAL: no more jobs
DISABLE

ktsaou on 19 Jul 2016

works with python2 though

ktsaou on 19 Jul 2016

because of this:
'str' object has no attribute 'decode'

check now if you can.

paulfantom on 19 Jul 2016

yes it works python3 and it works ipv6 too.

ktsaou on 19 Jul 2016

And squid should also be finished. I have implemented handling of server disconnect in _receive and check for "\r\n0\r\n\r\n" and "Transfer-Encoding: chunked" in _check_raw_data.

All should work.

paulfantom on 19 Jul 2016

👍1

created PR #709.

What's going on with those labels when I create PR (just curious)?

ktsaou added the in progress label just now

paulfantom on 19 Jul 2016

netdata passes the environment variable NETDATA_HOST_PREFIX. cpufreq should use this to prefix sys_dir automatically.

currently it is possible to specify sys_dir in configuration file apart from default /sys/devices. Should I append NETDATA_HOST_PREFIX to user defined sys_dir or not?

paulfantom on 19 Jul 2016

What's going on with those labels when I create PR (just curious)?

I don't know either. I didn't add the in progress label by hand. Github did it somehow...

currently it is possible to specify sys_dir in configuration file apart from default /sys/devices. Should I append NETDATA_HOST_PREFIX to user defined sys_dir or not?

Let me suggest this:

We will comment it in the config file and compose it in code.
Then the user can uncomment it in the config file to set his own.

How does that sound?

ktsaou on 19 Jul 2016

Just like I just implemented in #709.

paulfantom on 19 Jul 2016

👍1

I marked

netdata passes the environment variable NETDATA_HOST_PREFIX. cpufreq should use this to prefix sys_dir automatically. This variable is used when netdata runs in a container. The system directories /proc, /sys of the host should be exposed with this prefix.

as done, since cpufreq is using this variable.

paulfantom on 14 Aug 2016

ok. What would also be nice is to log using a date in-front of each log, much like netdata does. The way it is now, it is hard to find delays...

ktsaou on 14 Aug 2016

Already added.

paulfantom on 14 Aug 2016

Hi,

I think is the right place to report what I've struggling with. I wanted to use the postfix python.d chart but in my case, the SMTP server is running in a LXC container. I've set up everything needed with sudo and I tried command: 'sudo /usr/bin/lxc-attach -n smtp-lxc -- postqueue -p' which failed miserably. I decided to debug this to the core. Thinking it was a parsing error, I've wrapped this command in a shell script like this

#!/bin/sh
exec sudo /usr/bin/lxc-attach -n $1 -- postqueue -p

and used command: NDPostfixChart smtp-lxc which also failed! [Errno 2] No such file or directory.

Further investigation of base.py showed the problem(s)

Why impose only /sbin and /usr/sbin as possible paths for the commands in method ExecutableService.check()? This isn't documented anywhere. sudo resides in "/usr/bin" which prevents anyone to use it. IHMO, this part of the code is simply useless and confusing:

# test command and search for it in /usr/sbin or /sbin when failed
base = command[0].split('/')[-1]
if self._get_raw_data() is None:
    for prefix in ['/sbin/', '/usr/sbin/']:
        command[0] = prefix + base
            if os.path.isfile(command[0]):
                break

Using msg.debug() showed that my wrapper wasn't found at /usr/sbin/NDPostfixChart where it was NOT because the for loop above never breaks and returns a WRONG command variable content!

Still in ExecutableService.check() method, the line command = self.command.split(' ') is very bad practice to split a shell command line, It will FAIL on numerous simple cases like command: 'somecommand -option "this is a single quoted arg". The proper way to split a shell command line in Python is :

>>> import shlex
>>> cmd = 'somecommand -option "this is a single quoted arg"'
>>> shlex.split( cmd )
['somecommand', '-option', 'this is a single quoted arg']

Can these issues please get fixed? thank you.

MelkorLord on 19 Oct 2016

If it helps, here's the patch I'm applying on my local NetData version. It works perfectly and without side effects so far.

--- a/python.d/python_modules/base.py
+++ b/python.d/python_modules/base.py
@@ -28,6 +28,7 @@
     import urllib2

 from subprocess import Popen, PIPE
+import shlex

 import threading
 import msg
@@ -780,19 +781,12 @@
             self.command = str(self.configuration['command'])
         except (KeyError, TypeError):
             self.error("No command specified. Using: '" + self.command + "'")
-        command = self.command.split(' ')
+        command = shlex.split(self.command)

         for arg in command[1:]:
             if any(st in arg for st in self.bad_substrings):
                 self.error("Bad command argument:" + " ".join(self.command[1:]))
                 return False
-        # test command and search for it in /usr/sbin or /sbin when failed
-        base = command[0].split('/')[-1]
-        if self._get_raw_data() is None:
-            for prefix in ['/sbin/', '/usr/sbin/']:
-                command[0] = prefix + base
-                if os.path.isfile(command[0]):
-                    break
         self.command = command
         if self._get_data() is None or len(self._get_data()) == 0:
             self.error("Command", self.command, "returned no data")

MelkorLord on 19 Oct 2016

@MelkorLord I know this part of code might be confusing. But it is not checking only in /sbin and /usr/sbin, first it should be looking for a command in locations specified by $PATH environment variable (by executing a command). Looking for a command in /sbin and /usr/sbin was a workaround for some systems (like fedora) which don't include those in $PATH.

Thanks for informing about shlex, I will take a look on this python module and maybe include your patch into netdata.

paulfantom on 22 Oct 2016

Keep in mind netdata sets these already for all its plugins, and there is a configuration option to allow the user set them via netdata.conf:

https://github.com/firehol/netdata/blob/54ea4803f2b9ea02838b6287ca7846c51cf996b7/src/main.c#L448-L449

ktsaou on 23 Oct 2016

@paulfantom This code snipplet is really confusing in fact. As you can see in my initial message, using a shell script as a wrapper placed in /usr/bin or directly using sudo didn't work because the for loop finished leaving the command[0] set as /usr/sbin/xxx where neither my shell wrapper nor sudo reside. Leaving this code snipplet is a recipe for trouble, both in usability and debug. Imagine my confusion when msg.debug( self.command ) right before the call to Popen() showed /usr/sbin/xxx...

Moreover, as @ktsaou clearly states, there's no need to search within known paths for the command as NetData already provides a comprehensive $PATH environment.

As of shlex this is clearly the way to go when manipulating shell command lines, especially when they are used by the subprocess module.

MelkorLord on 25 Oct 2016

@MelkorLord what you are saying is true. This confusing part of code was created before netdata could manipulate $PATH variable and now it is here only because I don't have time to remove it. Could you create pull request with your patch, so I can test and merge it asap?

paulfantom on 26 Oct 2016

I'll try to find some time for that. For now, my plate is full (of problems :)

MelkorLord on 29 Oct 2016

@ktsaou please close.

paulfantom on 22 Sep 2018

Was this page helpful?

0 / 5 - 0 ratings