@paulfantom I am writing here a TODO list for python.d based on my findings.
logs_per_interval = 200 and log_interval = 3600. So, every hour (this_hour = int(now / log_interval)) it should reset the counter and allow up to logs_per_interval log entries until the next hour.This is how netdata does it: https://github.com/firehol/netdata/blob/d7b083430de1d39d0196b82035162b4483c08a3c/src/log.c#L33-L107
redis and squid)NETDATA_HOST_PREFIX. cpufreq should use this to prefix sys_dir automatically. This variable is used when netdata runs in a container. The system directories /proc, /sys of the host should be exposed with this prefix.Connection: keep-alive.
python.d ERROR: exim_local exim [Errno 2] No such file or directory
python.d ERROR: exim_local exim [Errno 2] No such file or directory
python.d ERROR: exim: is misbehaving. Reason:'NoneType' object has no attribute '__getitem__'
No unix socket specified. Trying TCP/IP socket.[Errno 111] Connection refused[Errno -9] Address family for hostname not supportedname:
python.d ERROR: redis/local: check() function reports failure.
sh
# ./plugins.d/python.d.plugin debug cpufreq 1
INFO: Using python v2
python.d INFO: reading configuration file: /etc/netdata/python.d.conf
python.d INFO: MODULES_DIR='/root/netdata/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['cpufreq']
python.d DEBUG: cpufreq: loading module configuration: '/etc/netdata/python.d/cpufreq.conf'
python.d DEBUG: cpufreq: reading configuration
python.d DEBUG: cpufreq: job added
python.d INFO: Disabled cpufreq/None
python.d ERROR: cpufreq/None: check() function reports failure.
python.d FATAL: no more jobs
DISABLE
python.d.conf to set the PATH to be searched for commands. By default everything in /usr/sbin/ is not found.Made it in #695
incremental dimensions in LogServicehddtemp.chart.pymore to come...
Last four should be easy to implement, but first 3 will need some more work.
added connection keep-alive to urlservice
added issue with cpufreq and exim (the PATH).
added default retries and logging of update() failures
I think this one is solved:
This should state the job name, not the
name
When a service reports failure to collect data (during update()), there should be log entry stating the reason of failure.
In SimpleService (which is a parent of all modules) method update() can fail only in two cases:
_get_data(), which is implemented in modules, but I also added self.debug("_get_data() returned no data") as a precaution). Those are basically the same:
1.
This message could state where it tried to connect:
[Errno 111] Connection refused
2.
This message could state the hostname it tried to resolve:
[Errno -9] Address family for hostname not supported
Those errors come from exception handler in _get_raw_data() method of SocketService. Lines
except Exception as e:
self.error(str(e))
are responsible for it.
And it is solved in #696
I believe this:
This should state with is the problem
# ./plugins.d/python.d.plugin debug cpufreq 1 INFO: Using python v2 python.d INFO: reading configuration file: /etc/netdata/python.d.conf python.d INFO: MODULES_DIR='/root/netdata/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['cpufreq'] python.d DEBUG: cpufreq: loading module configuration: '/etc/netdata/python.d/cpufreq.conf' python.d DEBUG: cpufreq: reading configuration python.d DEBUG: cpufreq: job added python.d INFO: Disabled cpufreq/None python.d ERROR: cpufreq/None: check() function reports failure. python.d FATAL: no more jobs DISABLE
is also solved in #696
This message should be a debug log
No unix socket specified. Trying TCP/IP socket.
solved in #696
marked the ones you said are included in #696 as done.
added log flood protection.
do you want me to add to this list the redis response parsing to keep the socket alive?
Currently I don't close socket in redis. I introduced variable _keep_alive in SocketService which has default value of True.
But I think you could check this as completed:
This should state the job name, not the name
Also I think I did this today:
When a service reports failure to collect data (during update()), there should be log entry stating the reason of failure.
Currently I don't close socket in redis. I introduced variable _keep_alive in SocketService which has default value of True.
I am confused. How this works?
look at the end of _get_raw_data in SocketService. It just checks if connection should be closed, and _sock freed or not. Since I cannot reconnect to socket and I can only recreate it, this variable decides if socket should be recreated every time _get_raw_data is executed.
Also I am thinking about different approach to SocketService. I will probably implement some basic wrappers like _connect(), _close(), _send(), and _receive() and leave them to be used in _get_data() implemented per module.
Maybe I will leave slightly modified version of _get_raw_data just to be used for quick prototyping.
look at the end of _get_raw_data in SocketService. It just checks if connection should be closed, and _sock freed or not. Since I cannot reconnect to socket and I can only recreate it, this variable decides if socket should be recreated every time _get_raw_data is executed.
ok. let's assume you want to keep the socket open. The key question is: does the socket already include any data from the previous request?
The only way to know, is if you have parsed the response according to its protocol. I don't think there is any other way of doing it right.
So, if this was a squid (http-like), you should parse Content-Length in response headers and match exactly that many bytes. For redis you should parse the first line, and again get exactly that many bytes.
Am I missing something?
Also I am thinking about different approach to SocketService. I will probably implement some basic wrappers like _connect(), _close(), _send(), and _receive() and leave them to be used in _get_data() implemented in module.
You can do this, but the callback function seems pretty awesome to me. It simplifies the modules a lot.
The only way to know, is if you have parsed the response according to its protocol
That's why I want to create wrappers and move the data gathering logic to module instead of parent class.
Also I don't want to get rid of the callback function.
And currently I can be pretty sure socket doesn't have any data from previous request. Only way there could be data from previous request is if response is fragmented, and fragments arrive with large enough delays.
And currently I can be pretty sure socket doesn't have any data from previous request. Only way there could be data from previous request is if response is fragmented, and fragments arrive with large enough delays.
It might work like that for localhost. But it will lead to errors if redis, for example, is on a distant location (e.g. cross data-center, VPNs, etc). I wouldn't rely on it.
Also I don't want to get rid of the callback function.
Then how it will work with _connect(), _recv(), etc and the callback?
I think that if a module is willing to use its own socket implementation, it can - using directly python features. There is no need to provide wrappers.
The only reason to provide a wrapper is to simplify it.
Here is why:
Basically it is TCP Window Scaling (and this is very reliable mechanism).
It won't work if ping between netdata host and service host is greater than 500ms.
And that is the reason I want to move logic to modules.
The only reason to provide a wrapper is to simplify it.
That's exactly the reason.
For example this could be _connect():
def _connect(self):
if self._sock is None:
try:
if self.unix_socket is None:
self._sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self._sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
#sock.setsockopt(socket.SOL_SOCKET, socket.TCP_NODELAY, 1)
#sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
#sock.settimeout(self.update_every)
self._sock.settimeout(0.5)
self._sock.connect((self.host, self.port))
self._sock.settimeout(0.5) # Just to be sure
else:
self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
#self._sock.settimeout(self.update_every)
self._sock.settimeout(0.05)
self._sock.connect(self.unix_socket)
self._sock.settimeout(0.05) # Just to be sure
except Exception as e:
self.error(str(e),
"Cannot create socket with following configuration: host:", str(self.host),
"port:", str(self.port),
"socket:", str(self.unix_socket))
self._sock = None
return None
Wait a moment. I am preparing a response.
It won't work if ping between netdata host and service host is greater than 500ms.
Yes, that is the problem with this. You cannot rely on timings for socket communication. TCP has a default timeout of 2 days!
I think the way it is now it is very close to what we need:
This does not depend on timing and will allow the module to handle the keep-alive (via its callback).
Also, the callback should have verified the data according to the protocol implemented.
Can you find something wrong in this scenario?
If I remove timeout completely, create in python.d.plugin can be blocked, thus whole python.d.plugin will be blocked. And if I set socket timeout to something greater or equal update_every we will have wrong chart timings (as we had previously, when socket timeout was equal to update_every).
Size of variable in python is limited only by host memory, so I can create buffer with almost any size.
Receive some data, append them to the data you already have and give the whole of it to the callback
But when should I stop appending and return it to callback? I don't have any mechanism to check if connection is closed or not (only timeout).
SocketService open the socket and sends the request
Not every time we need to send something. Look at hddtemp.chart.py
If I remove timeout completely, create in python.d.plugin can be blocked, thus whole python.d.plugin will be blocked.
I thought you were using non-blocking for check() and threads for update(). How this will block the entire plugin?
And if I set socket timeout to something greater or equal update_every we will have wrong chart timings (as we had previously, when socket timeout was equal to `update_every).
No problem with this. netdata will show a gap on the chart as soon as the plugin reports new data.
But when should I stop appending and return it to callback? I don't have any mechanism to check if connection is closed or not (only timeout).
You stop on 3 conditions:
Why do you need any other way?
I thought you were using non-blocking for check() and threads for update(). How this will block the entire plugin?
create is blocking.
No problem with this. netdata will show a gap on the chart as soon as the plugin reports new data.
It isn't about that. Do you remember that case when squid had update_every set to 1, but was running every 2 seconds? It is about this.
callback returns FALSE
Callback cannot return anything, since it wasn't called. I am asking when should I stop gathering data to pass to callback.
And I don't use non-blocking sockets (because they cannot have timeout)
And I don't know when socket was closed by server. I couldn't find how to do it correctly.
I am totally lost... I don't know python and this is probably the problem...
Here is what I know (not python or python.d.plugin specific):
errno is set accordingly. So, to timeout you have to do it yourself.So, I guess this will work for blocking sockets (sorry I am not a python expert so it may have syntax errors):
all_data = ""
wait=True
while wait:
data = socket.recv(...)
if not data: # this is equivalent to -1 in C
# socket closed
break
all_data += data
wait = callback(all_data)
return all_data
In the squid example, the server will close the connection. In the redis example, the callback will return FALSE.
Summing up:
check and create of every module are run before spawning new threadssocket module doesn't support any type of notification about closing connection by server (it seems everyone is using recv with timeout).run_timeI tried to do this like you suggested in example. It should work, but it didn't and I really don't know why.
I would be very happy to get rid of timeouts and move to non-blocking sockets, I just don't know how to do this.
And I spent last two days to get this working in current version.
ok, one last try because I think what you say is not valid.
Give me 10 mins. I will try to adapt the test for squid you gave me yesterday. 10 mins.
If you are trying to adapt code to run in non-blocking mode, don't bother. I know that squid testing snippet can run without timeouts. I tried it. But when I tried to include it into SocketService it failed.
It is very probable I did something wrong.
I will try to clean up SocketService, write wrappers, and educate myself on sockets in python. After this I will start tweaking with data gathering implementation.
I just don't want to break something that is currently running for majority of implementations.
for squid this works:
#!/usr/bin/env python
import socket
HOST = 'localhost'
PORT = 3128
URL = 'GET cache_object://localhost:3128/counters HTTP/1.0\r\n\r\n'
def callback(data):
return True
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.connect((HOST, PORT))
sock.send(URL.encode())
raw = ""
wait = True
while wait:
buf = sock.recv(65536)
if not buf: break
raw += buf
wait = callback(raw)
print(raw)
I bet it will also work for redis with the proper callback that can return False if it received the whole reply.
I am 100% confident you should not rely on timing. Please don't do this.
EDIT: updated the code with just one call to recv().
wrong paste. I fixed it to use just one call to recv()
Ok, I will move it to non-blocking, just as I said in previous message - not now. Could you add this to the list at the beggining?
probably non-blocking confused you. To use non-blocking you need select.select() or select.poll() or select.epoll(). So you connect one or more sockets and then you select() on them to find which one is ready to send or receive data.
I.e. in non-blocking mode you should only pause on select(), not on recv() or send().
I have to leave. I'll not respond for 1-2 hours.
@paulfantom sorry for insisting on this, but I am sure we should not rely on timing.
An example of non-blocking I/O with timeout:
http://whitepythons.blogspot.gr/2014/05/non-blocking-mode-and-timeouts-sockets.html
The above example is not optimal, but it shows how to use errno and timeouts.
At the end of the article it mentions select which allows the program to wait for one or more non-blocking sockets. This is the right way to do non-blocking I/O.
To effectively use non-blocking, the python core should work like that:
select.select() or select.poll(), or select.epoll() giving it all sockets/files (the optimal for linux is select.epoll(), but only select.select() is portable. This call will freeze the process, until any of the sockets/files is ready for read or write (depending on what was requested).I have seen a lot of python examples on the net that use non-blocking I/O by waiting between recv() calls for 10ms, 20ms, etc and looping 500.000 times until they receive the data. All these examples are garbage. They shouldn't use non-blocking I/O that way.
In netdata there is a single threaded web server implementation. Check it here: https://github.com/firehol/netdata/blob/9dd69c55d7f4c7f80c779029f83bb9caebabb935/src/web_server.c#L389-L528
Here is the select statement: https://github.com/firehol/netdata/blob/9dd69c55d7f4c7f80c779029f83bb9caebabb935/src/web_server.c#L429
Here it finds the "job" responsible for the socket:
https://github.com/firehol/netdata/blob/9dd69c55d7f4c7f80c779029f83bb9caebabb935/src/web_server.c#L473-L477
Here it calls the "job" based on what was requested (read or write):
https://github.com/firehol/netdata/blob/9dd69c55d7f4c7f80c779029f83bb9caebabb935/src/web_server.c#L491-L509
In netdata, I have in one case made the assumption that a very small request or reply can call send() immediately after opening or accepting a connection. This works, since we know the socket is already ready to receive data from us. However, even this "innocent" assumption should be avoided to properly use non-blocking I/O. Example of this "fault" (for sending the HTTP header back to the web browser) is here: https://github.com/firehol/netdata/blob/9dd69c55d7f4c7f80c779029f83bb9caebabb935/src/web_client.c#L2059
On the other hand, in blocking mode, timing is not needed. Calls to recv() or send() block. If we have one thread per job, there is no need to timeout. So a blocking design should only call recv() and send() while in a job thread.
I have the feeling you want me to do this before any other thing I might do for netdata.
I know it shouldn't use timeout, I also said I will move it to non-blocking, just give me some time. For now if you think this is totally wrong, you can disable squid.chart.py, redis.chart.py and hddtemp.chart.py in installer and configuration files and use bash counterparts for new installations.
Remember that I already did an example of using select with timeout on non-blocking sockets. It is here and is very similar to the first example you mentioned. Probably this will be the base for non-blocking SocketService.
Once again. I cannot do blocking sockets since check and create are not in multiple threads. If you want me to change this design issue, I would propose to revert state of git repository to the point where we have bash plugin as a default and python as experimental, since this will take me some time to accomplish.
Nonetheless thank you for providing me some examples.
I think this:
The service that runs external commands should be more descriptive. Example running exim plugin when exim is not installed:
is resolved in #703
ok. I had the impression you were resisting to this change. Since you understand you have to change it, I am good.
I don't mind if you are going to make it blocking or non-blocking. Both are good designs, the threads overhead is not that high, and non-blocking is more efficient but has some complexity. So there is a trade-off here: best efficiency (non-blocking) or simplicity (blocking)? Your call. I am happy with either, as long as they are implemented properly. I advice you though to avoid using both of them, at different states of the plugin. You will get the cons of both and the pros of none.
I don't mind about the time this is done. What I know is that if you move on and add more modules, you will probably multiply the amount of work you will have to do to when you fix it. This is technical debt. If you believe the changes required will only affect SocketService, I think you can delay it. But if a total re-structuring of the plugin might be required, I believe it is best to do it now. But this is again your call.
Finally, there is no going back. netdata works primarily with the python plugin. Period. This also why I asked you to add pyyaml to it.
@paulfantom I send you an invitation to join the netdata development team.
I think that current python.d.plugin design is quite good. Yes there is an issue with blocking check or create, but I am resolving it by implementing precautions in framework classes (like SocketService).
Also since I know how much you don't like current design of SocketService (I don't like it either), I will try to change it by tomorrow (maybe even today).
Thank you for the invitation.
@ktsaou I believe I got SocketService working with you guidelines (well most of them). Basically what I did is in this commit. However I got one problem.
I've implemented callback method _check_raw_data which should return True if all data is received, or false if it isn't. In redis it was simple, just parse first line and voila, but when I wanted to use Content-Length http header in squid.chart.py I found one disturbing issue: squid doesn't set Content-Length. Do you have any proposition, how can I check if squid.chart.py received all data from squid?
Here is squid response header:
HTTP/1.1 200 OK
Server: squid/3.5.19
Mime-Version: 1.0
Date: Mon, 18 Jul 2016 21:20:35 GMT
Content-Type: text/plain;charset=utf-8
Expires: Mon, 18 Jul 2016 21:20:35 GMT
Last-Modified: Mon, 18 Jul 2016 21:20:35 GMT
X-Cache: MISS from localhost.localdomain
X-Cache-Lookup: MISS from localhost.localdomain:3128
Via: 1.1 localhost.localdomain (squid/3.5.19)
Connection: close
Also I think I have implemented IPv6 in _connect method. I haven't tested this, but it is taken from python documentation so it should work (it works for IPv4).
Currently I don't think SocketService from my repo is ready for PR, since it needs some polishing, but do you think I am going in the right direction?
Short description what is going on in code (example of executing _get_raw_data):
None if sending failed (also disconnect from socket)_check_raw_data)_keep_alive is set to FalseNo. 5 detailed:
I've implemented callback method _check_raw_data which should return True if all data is received, or false if it isn't. In redis it was simple, just parse first line and voila, but when I wanted to use Content-Length http header in squid.chart.py I found one disturbing issue: squid doesn't set Content-Length. Do you have any proposition, how can I check if squid.chart.py received all data from squid?
Let me check:
# curl -v --raw "http://localhost:3128/squid-internal-mgr/counters"
* Trying ::1...
* Connected to localhost (::1) port 3128 (#0)
> GET /squid-internal-mgr/counters HTTP/1.1
> Host: localhost:3128
> User-Agent: curl/7.49.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: squid/3.5.20
< Mime-Version: 1.0
< Date: Mon, 18 Jul 2016 22:41:14 GMT
< Content-Type: text/plain;charset=utf-8
< Expires: Mon, 18 Jul 2016 22:41:14 GMT
< Last-Modified: Mon, 18 Jul 2016 22:41:14 GMT
< X-Cache: MISS from boxe
< X-Cache-Lookup: MISS from boxe:3128
< Transfer-Encoding: chunked
< Connection: keep-alive
<
528
sample_time = 1468881627.947637 (Mon, 18 Jul 2016 22:40:27 GMT)
client_http.requests = 282328
client_http.hits = 4138
client_http.errors = 89
client_http.kbytes_in = 116414
client_http.kbytes_out = 16647550
client_http.hit_kbytes_out = 27102
server.all.requests = 109631
server.all.errors = 0
server.all.kbytes_in = 16347655
server.all.kbytes_out = 106222
server.http.requests = 94305
server.http.errors = 0
server.http.kbytes_in = 5432451
server.http.kbytes_out = 31545
server.ftp.requests = 16
server.ftp.errors = 0
server.ftp.kbytes_in = 11192
server.ftp.kbytes_out = 1
server.other.requests = 15310
server.other.errors = 0
server.other.kbytes_in = 10904011
server.other.kbytes_out = 74675
icp.pkts_sent = 0
icp.pkts_recv = 0
icp.queries_sent = 0
icp.replies_sent = 0
icp.queries_recv = 0
icp.replies_recv = 0
icp.query_timeouts = 0
icp.replies_queued = 0
icp.kbytes_sent = 0
icp.kbytes_recv = 0
icp.q_kbytes_sent = 0
icp.r_kbytes_sent = 0
icp.q_kbytes_recv = 0
icp.r_kbytes_recv = 0
icp.times_used = 0
cd.times_used = 0
cd.msgs_sent = 0
cd.msgs_recv = 0
cd.memory = 0
cd.local_memory = 349
cd.kbytes_sent = 0
cd.kbytes_recv = 0
unlink.requests = 3508
page_faults = 20
select_loops = 33936090
cpu_time = 1311.020000
wall_time = 46.297577
swap.outs = 5836
swap.ins = 746
swap.files_cleaned = 82
aborted_requests = 525
0
* Connection #0 to host localhost left intact
web servers may respond with either Content-Length: BYTES or Transfer-Encoding: chunked. In the chunked case, the first line after \r\n\r\n has the number of bytes following (in hex, check in netdata here: https://github.com/firehol/netdata/blob/8027db0a458960e75285732986d62f15795e8cbc/src/web_client.c#L2121). A zero, means EOF (check in netdata here: https://github.com/firehol/netdata/blob/8027db0a458960e75285732986d62f15795e8cbc/src/web_client.c#L2167).
To avoid implementing all the HTTP protocol, you could use HTTP/1.0 where squid does not seem to support keep-alive, does not send Content-Length and closes the connection.
# curl -v -0 --raw -H "Connection: keep-alive" "http://localhost:3128/squid-internal-mgr/counters"
* Trying ::1...
* Connected to localhost (::1) port 3128 (#0)
> GET /squid-internal-mgr/counters HTTP/1.0
> Host: localhost:3128
> User-Agent: curl/7.49.1
> Accept: */*
> Connection: keep-alive
>
< HTTP/1.1 200 OK
< Server: squid/3.5.20
< Mime-Version: 1.0
< Date: Mon, 18 Jul 2016 22:50:31 GMT
< Content-Type: text/plain;charset=utf-8
< Expires: Mon, 18 Jul 2016 22:50:31 GMT
< Last-Modified: Mon, 18 Jul 2016 22:50:31 GMT
< X-Cache: MISS from boxe
< X-Cache-Lookup: MISS from boxe:3128
< Connection: close
<
sample_time = 1468882227.957787 (Mon, 18 Jul 2016 22:50:27 GMT)
client_http.requests = 283186
client_http.hits = 4139
client_http.errors = 89
client_http.kbytes_in = 116668
client_http.kbytes_out = 16649723
client_http.hit_kbytes_out = 27103
server.all.requests = 109930
server.all.errors = 0
server.all.kbytes_in = 16348915
server.all.kbytes_out = 106446
server.http.requests = 94553
server.http.errors = 0
server.http.kbytes_in = 5432833
server.http.kbytes_out = 31612
server.ftp.requests = 16
server.ftp.errors = 0
server.ftp.kbytes_in = 11192
server.ftp.kbytes_out = 1
server.other.requests = 15361
server.other.errors = 0
server.other.kbytes_in = 10904889
server.other.kbytes_out = 74832
icp.pkts_sent = 0
icp.pkts_recv = 0
icp.queries_sent = 0
icp.replies_sent = 0
icp.queries_recv = 0
icp.replies_recv = 0
icp.query_timeouts = 0
icp.replies_queued = 0
icp.kbytes_sent = 0
icp.kbytes_recv = 0
icp.q_kbytes_sent = 0
icp.r_kbytes_sent = 0
icp.q_kbytes_recv = 0
icp.r_kbytes_recv = 0
icp.times_used = 0
cd.times_used = 0
cd.msgs_sent = 0
cd.msgs_recv = 0
cd.memory = 0
cd.local_memory = 349
cd.kbytes_sent = 0
cd.kbytes_recv = 0
unlink.requests = 3508
page_faults = 20
select_loops = 33948560
cpu_time = 1312.190000
wall_time = 3.918642
swap.outs = 5844
swap.ins = 748
swap.files_cleaned = 82
aborted_requests = 525
* Closing connection 0
As you can see, we requested keep-alive, but the server responded Connection: close.
Regarding the implementation, all seem good, except the first bullet of point 5.
Can you point the code you do this check?
I verified squid behaves the same with the cache_object://... URLs
file base.py, method _receive, lines:
try:
ready_to_read, _, in_error = select.select([self._sock], [], [], 0.01)
except Exception as e:
self.debug("SELECT", str(e))
self._disconnect()
break
if len(ready_to_read) > 0:
buf = self._sock.recv(4096)
I am using select to check if socket is readable, then I check if array ready_to_read has something in it.
Also thanks for the information about HTTP.
I have already run some test and keeping socket alive (HTTP/1.1) cuts down run time by ~9% (from 11ms to 10ms). But I don't know if it will be the case after changing _check_raw_data from just returning True to something else.
I know you would prefer to change this line:
ready_to_read, _, in_error = select.select([self._sock], [], [], 0.01)
to something like:
ready_to_read, _, in_error = select.select([self._sock1, self._sock2, self._sock3, self._sock4], [], [], 0.01)
and then check which socket is available, but I don't know how to implement it in current version.
I was wrong. Using HTTP 1.0 and closing connections won't time out select.select and data gathering will be run in 1ms. However if I use HTTP 1.1, select.select will wait and time out (thus running in minimal time of 10ms).
I will be still investigating.
@paulfantom what would happen if you just self._sock.recv(4096) directly without the select? Do you need this to prevent it from waiting forever or is there any other reason?
Do you need this to prevent it from waiting forever
Exactly this.
Exactly this.
ok. What will happen if this waits forever? Is it about check()/create()? Because I guess in update() it is ok to wait forever. Right?
It is about check/create. You are right.
ok, why don't you then call the select() statement only if this code is called in check()/create()? Can you know when it is called?
I can set some variable with initial value of 2 and decrease it.
But I will try to find some other way.
I think this good. If you just do it in create/update only, then:
Keep in mind in both cases this has to be in a loop. Do not assume the response is just one packet.
Ok, will work on it.
Hmmm this way I can set up socket in blocking mode, and just change timeout after create to something like 5 minutes.
Hmmm this way I can set up socket in blocking mode, and just change timeout after create to something like 5 minutes.
initialization or update() ?
When I said to bypass select I meant to use blocking mode. Otherwise it will not work. non-blocking without select is a no-go (you will again end-up playing with timings).
What is not clear to me is if check/create and update both use the same blocking mode or not. Are you using non-blocking and blocking?
Hmmm this way I can set up socket in blocking mode, and just change timeout after create to something like 5 minutes.
I think it will work. As I see in python docs the settimeout() method can be used to set a timeout on blocking sockets.
So, you could just set it blocking everywhere and settimeout() something like 30secs or 1min to be used for both check/create and update.
Remember to collect data in a loop. Either the server will close the connection (no need to timeout) or the callback will let you know it is done.
Ok, I got it working. With select, with non-blocking socket, with HTTP/1.0 and HTTP/1.1 and with 60 second timeout on select.
I got one thing wrong. Placement of callback method _check_raw_data. It was in _get_raw_data but should be in _receive. When I changed it, squid and redis have run times under 1ms.
nice!
I have committed those changes it to my repository. It still needs some testing, but as of now it works pretty good.
Main problem is with _check_raw_data methods. In redis it was easy, but squid is not that friendly, so for now it just checks if there are needed squid data. Tomorrow I will be looking for some better way.
Also I saw you have redis on IPv6, could you test if my current version can connect on IPv6?
Also do you think if I use 'HTTP/1.1' to communicate with squid, it will be enough to check if last 7 bytes are equal to "\r\n0\r\n\r\n"?
For some reason, redis has stopped working for me, even on the merged version.
Redis is here:
# telnet localhost 6379
Trying ::1...
Connected to localhost.
Escape character is '^]'.
INFO
$2159
# Server
redis_version:3.2.1
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:e19bb8c3d1c28291
redis_mode:standalone
os:Linux 4.4.6-gentoo x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:5.3.0
process_id:1
run_id:20358fe7a1243ebc0d4adb7d8835d0519d884093
tcp_port:6379
uptime_in_seconds:524905
uptime_in_days:6
hz:10
lru_clock:9269583
executable:/data/redis-server
config_file:
# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
# Memory
used_memory:821736
used_memory_human:802.48K
used_memory_rss:1056768
used_memory_rss_human:1.01M
used_memory_peak:965584
used_memory_peak_human:942.95K
total_system_memory:8266366976
total_system_memory_human:7.70G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
mem_fragmentation_ratio:1.29
mem_allocator:jemalloc-4.0.3
# Persistence
loading:0
rdb_changes_since_last_save:14
rdb_bgsave_in_progress:0
rdb_last_save_time:1468362470
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
# Stats
total_connections_received:497205
total_commands_processed:497470
instantaneous_ops_per_sec:0
total_net_input_bytes:2986385
total_net_output_bytes:1073491329
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:1
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:527.06
used_cpu_user:237.62
used_cpu_sys_children:527.06
used_cpu_user_children:237.62
# Cluster
cluster_enabled:0
# Keyspace
db0:keys=1,expires=0,avg_ttl=0
^]
telnet> c
Connection closed.
Your repo:
# ./plugins.d/python.d.plugin debug redis 1
INFO: Using python v3
python.d INFO: reading configuration file: /etc/netdata/python.d.conf
python.d INFO: MODULES_DIR='/data/src/netdata-paulfantom.git/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['redis']
python.d DEBUG: redis: loading module configuration: '/etc/netdata/python.d/redis.conf'
python.d DEBUG: redis: reading configuration
python.d DEBUG: redis/socket2: job added
python.d DEBUG: redis/socket1: job added
python.d DEBUG: redis/localhost: job added
python.d DEBUG: redis/socket3: job added
python.d DEBUG: redis/localipv4: job added
python.d DEBUG: redis/localipv6: job added
python.d DEBUG: all job objects [<Service(socket2, initial daemon)>, <Service(socket1, initial daemon)>, <Service(localhost, initial daemon)>, <Service(socket3, initial daemon)>, <Service(localipv4, initial daemon)>, <Service(localipv6, initial daemon)>]
python.d DEBUG: redis_socket2 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket2 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/run/redis/redis.sock
python.d ERROR: redis_socket2 no data received
python.d ERROR: redis_socket2 check function failed.
python.d INFO: Disabled redis/socket2
python.d DEBUG: redis_socket1 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket1 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /tmp/redis.sock
python.d ERROR: redis_socket1 no data received
python.d ERROR: redis_socket1 check function failed.
python.d INFO: Disabled redis/socket1
python.d DEBUG: redis_localhost No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localhost No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localhost
python.d ERROR: redis_localhost cannot find check() function.
python.d DEBUG: redis_socket3 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket3 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/lib/redis/redis.sock
python.d ERROR: redis_socket3 no data received
python.d ERROR: redis_socket3 check function failed.
python.d INFO: Disabled redis/socket3
python.d DEBUG: redis_localipv4 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv4 No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localipv4
python.d ERROR: redis_localipv4 cannot find check() function.
python.d DEBUG: redis_localipv6 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv6 No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localipv6
python.d ERROR: redis_localipv6 cannot find check() function.
python.d DEBUG: overridden job names: []
python.d DEBUG: all remaining job objects: []
python.d FATAL: no more jobs
DISABLE
merged repo:
# ./plugins.d/python.d.plugin debug redis 1
INFO: Using python v3
python.d INFO: reading configuration file: /etc/netdata/python.d.conf
python.d INFO: MODULES_DIR='/data/src/netdata.git/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['redis']
python.d DEBUG: redis: loading module configuration: '/etc/netdata/python.d/redis.conf'
python.d DEBUG: redis: reading configuration
python.d DEBUG: redis/localipv4: job added
python.d DEBUG: redis/socket2: job added
python.d DEBUG: redis/socket1: job added
python.d DEBUG: redis/localipv6: job added
python.d DEBUG: redis/socket3: job added
python.d DEBUG: redis/localhost: job added
python.d DEBUG: all job objects [<Service(localipv4, initial daemon)>, <Service(socket2, initial daemon)>, <Service(socket1, initial daemon)>, <Service(localipv6, initial daemon)>, <Service(socket3, initial daemon)>, <Service(localhost, initial daemon)>]
python.d DEBUG: redis_localipv4 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv4 No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localipv4
python.d ERROR: redis_localipv4_localipv4 cannot find check() function.
python.d DEBUG: redis_socket2 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket2_socket2 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/run/redis/redis.sock
python.d ERROR: redis_socket2_socket2 no data received
python.d ERROR: redis_socket2_socket2 check function failed.
python.d INFO: Disabled redis/socket2
python.d DEBUG: redis_socket1 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket1_socket1 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /tmp/redis.sock
python.d ERROR: redis_socket1_socket1 no data received
python.d ERROR: redis_socket1_socket1 check function failed.
python.d INFO: Disabled redis/socket1
python.d DEBUG: redis_localipv6 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv6 No request specified. Using: 'INFO
'
python.d ERROR: redis_localipv6_localipv6 [Errno -9] Address family for hostname not supported Cannot create socket with following configuration: host: ::1 port: 6379 socket: None
python.d ERROR: redis_localipv6_localipv6 no data received
python.d ERROR: redis_localipv6_localipv6 check function failed.
python.d INFO: Disabled redis/localipv6
python.d DEBUG: redis_socket3 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket3_socket3 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/lib/redis/redis.sock
python.d ERROR: redis_socket3_socket3 no data received
python.d ERROR: redis_socket3_socket3 check function failed.
python.d INFO: Disabled redis/socket3
python.d DEBUG: redis_localhost No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localhost No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localhost
python.d ERROR: redis_localhost_localhost cannot find check() function.
python.d DEBUG: overridden job names: []
python.d DEBUG: all remaining job objects: []
python.d FATAL: no more jobs
DISABLE
Any ideas?
Also do you think if I use 'HTTP/1.1' to communicate with squid, it will be enough to check if last 7 bytes are equal to "\r\n0\r\n\r\n"?
Nice idea! Yes. I suggest though to also check Transfer-Encoding: chunked. Otherwise this check will never match anything.
Keep in mind you should also handle a connection close from the server.
I think this message is the key:
python.d ERROR: redis_localhost_localhost cannot find check() function.
it is printed only if an AttributeError is thrown in python.d.plugin in:
if not job.check():
msg.error(job.chart_name, "check function failed.")
self._stop(job)
else:
msg.debug(job.chart_name, "check succeeded")
i += 1
I pushed a possible fix and more debug statements to my branch. Could you check?
Didn't work:
# ./plugins.d/python.d.plugin debug redis 1
INFO: Using python v3
python.d INFO: reading configuration file: /etc/netdata/python.d.conf
python.d INFO: MODULES_DIR='/data/src/netdata-paulfantom.git/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['redis']
python.d DEBUG: redis: loading module configuration: '/etc/netdata/python.d/redis.conf'
python.d DEBUG: redis: reading configuration
python.d DEBUG: redis/localipv4: job added
python.d DEBUG: redis/socket3: job added
python.d DEBUG: redis/socket2: job added
python.d DEBUG: redis/localipv6: job added
python.d DEBUG: redis/socket1: job added
python.d DEBUG: redis/localhost: job added
python.d DEBUG: all job objects [<Service(localipv4, initial daemon)>, <Service(socket3, initial daemon)>, <Service(socket2, initial daemon)>, <Service(localipv6, initial daemon)>, <Service(socket1, initial daemon)>, <Service(localhost, initial daemon)>]
python.d DEBUG: redis_localipv4 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv4 No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localipv4
python.d ERROR: redis_localipv4 cannot find check() function.
python.d DEBUG: 'str' object has no attribute 'decode'
python.d DEBUG: redis_socket3 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket3 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/lib/redis/redis.sock
python.d ERROR: redis_socket3 no data received
python.d ERROR: redis_socket3 check function failed.
python.d INFO: Disabled redis/socket3
python.d DEBUG: redis_socket2 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket2 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /var/run/redis/redis.sock
python.d ERROR: redis_socket2 no data received
python.d ERROR: redis_socket2 check function failed.
python.d INFO: Disabled redis/socket2
python.d DEBUG: redis_localipv6 No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localipv6 No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localipv6
python.d ERROR: redis_localipv6 cannot find check() function.
python.d DEBUG: 'str' object has no attribute 'decode'
python.d DEBUG: redis_socket1 No request specified. Using: 'INFO
'
python.d ERROR: redis_socket1 [Errno 2] No such file or directory Cannot create socket with following configuration: host: localhost port: 6379 socket: /tmp/redis.sock
python.d ERROR: redis_socket1 no data received
python.d ERROR: redis_socket1 check function failed.
python.d INFO: Disabled redis/socket1
python.d DEBUG: redis_localhost No unix socket specified. Trying TCP/IP socket.
python.d DEBUG: redis_localhost No request specified. Using: 'INFO
'
python.d INFO: Disabled redis/localhost
python.d ERROR: redis_localhost cannot find check() function.
python.d DEBUG: 'str' object has no attribute 'decode'
python.d DEBUG: overridden job names: []
python.d DEBUG: all remaining job objects: []
python.d FATAL: no more jobs
DISABLE
works with python2 though
because of this:
'str' object has no attribute 'decode'
check now if you can.
yes it works python3 and it works ipv6 too.
And squid should also be finished. I have implemented handling of server disconnect in _receive and check for "\r\n0\r\n\r\n" and "Transfer-Encoding: chunked" in _check_raw_data.
All should work.
created PR #709.
What's going on with those labels when I create PR (just curious)?
ktsaou added the in progress label just now
netdata passes the environment variable NETDATA_HOST_PREFIX. cpufreq should use this to prefix sys_dir automatically.
currently it is possible to specify sys_dir in configuration file apart from default /sys/devices. Should I append NETDATA_HOST_PREFIX to user defined sys_dir or not?
What's going on with those labels when I create PR (just curious)?
I don't know either. I didn't add the in progress label by hand. Github did it somehow...
currently it is possible to specify sys_dir in configuration file apart from default /sys/devices. Should I append NETDATA_HOST_PREFIX to user defined sys_dir or not?
Let me suggest this:
We will comment it in the config file and compose it in code.
Then the user can uncomment it in the config file to set his own.
How does that sound?
Just like I just implemented in #709.
I marked
netdata passes the environment variable NETDATA_HOST_PREFIX. cpufreq should use this to prefix sys_dir automatically. This variable is used when netdata runs in a container. The system directories /proc, /sys of the host should be exposed with this prefix.
as done, since cpufreq is using this variable.
ok. What would also be nice is to log using a date in-front of each log, much like netdata does. The way it is now, it is hard to find delays...
Already added.
Hi,
I think is the right place to report what I've struggling with. I wanted to use the postfix python.d chart but in my case, the SMTP server is running in a LXC container. I've set up everything needed with sudo and I tried command: 'sudo /usr/bin/lxc-attach -n smtp-lxc -- postqueue -p' which failed miserably. I decided to debug this to the core. Thinking it was a parsing error, I've wrapped this command in a shell script like this
#!/bin/sh
exec sudo /usr/bin/lxc-attach -n $1 -- postqueue -p
and used command: NDPostfixChart smtp-lxc which also failed! [Errno 2] No such file or directory.
Further investigation of base.py showed the problem(s)
/sbin and /usr/sbin as possible paths for the commands in method ExecutableService.check()? This isn't documented anywhere. sudo resides in "/usr/bin" which prevents anyone to use it. IHMO, this part of the code is simply useless and confusing:# test command and search for it in /usr/sbin or /sbin when failed
base = command[0].split('/')[-1]
if self._get_raw_data() is None:
for prefix in ['/sbin/', '/usr/sbin/']:
command[0] = prefix + base
if os.path.isfile(command[0]):
break
Using msg.debug() showed that my wrapper wasn't found at /usr/sbin/NDPostfixChart where it was NOT because the for loop above never breaks and returns a WRONG command variable content!
ExecutableService.check() method, the line command = self.command.split(' ') is very bad practice to split a shell command line, It will FAIL on numerous simple cases like command: 'somecommand -option "this is a single quoted arg". The proper way to split a shell command line in Python is : >>> import shlex
>>> cmd = 'somecommand -option "this is a single quoted arg"'
>>> shlex.split( cmd )
['somecommand', '-option', 'this is a single quoted arg']
Can these issues please get fixed? thank you.
If it helps, here's the patch I'm applying on my local NetData version. It works perfectly and without side effects so far.
--- a/python.d/python_modules/base.py
+++ b/python.d/python_modules/base.py
@@ -28,6 +28,7 @@
import urllib2
from subprocess import Popen, PIPE
+import shlex
import threading
import msg
@@ -780,19 +781,12 @@
self.command = str(self.configuration['command'])
except (KeyError, TypeError):
self.error("No command specified. Using: '" + self.command + "'")
- command = self.command.split(' ')
+ command = shlex.split(self.command)
for arg in command[1:]:
if any(st in arg for st in self.bad_substrings):
self.error("Bad command argument:" + " ".join(self.command[1:]))
return False
- # test command and search for it in /usr/sbin or /sbin when failed
- base = command[0].split('/')[-1]
- if self._get_raw_data() is None:
- for prefix in ['/sbin/', '/usr/sbin/']:
- command[0] = prefix + base
- if os.path.isfile(command[0]):
- break
self.command = command
if self._get_data() is None or len(self._get_data()) == 0:
self.error("Command", self.command, "returned no data")
@MelkorLord I know this part of code might be confusing. But it is not checking only in /sbin and /usr/sbin, first it should be looking for a command in locations specified by $PATH environment variable (by executing a command). Looking for a command in /sbin and /usr/sbin was a workaround for some systems (like fedora) which don't include those in $PATH.
Thanks for informing about shlex, I will take a look on this python module and maybe include your patch into netdata.
Keep in mind netdata sets these already for all its plugins, and there is a configuration option to allow the user set them via netdata.conf:
https://github.com/firehol/netdata/blob/54ea4803f2b9ea02838b6287ca7846c51cf996b7/src/main.c#L448-L449
@paulfantom This code snipplet is really confusing in fact. As you can see in my initial message, using a shell script as a wrapper placed in /usr/bin or directly using sudo didn't work because the for loop finished leaving the command[0] set as /usr/sbin/xxx where neither my shell wrapper nor sudo reside. Leaving this code snipplet is a recipe for trouble, both in usability and debug. Imagine my confusion when msg.debug( self.command ) right before the call to Popen() showed /usr/sbin/xxx...
Moreover, as @ktsaou clearly states, there's no need to search within known paths for the command as NetData already provides a comprehensive $PATH environment.
As of shlex this is clearly the way to go when manipulating shell command lines, especially when they are used by the subprocess module.
@MelkorLord what you are saying is true. This confusing part of code was created before netdata could manipulate $PATH variable and now it is here only because I don't have time to remove it. Could you create pull request with your patch, so I can test and merge it asap?
I'll try to find some time for that. For now, my plate is full (of problems :)
@ktsaou please close.