Large PDF files, and other types, are generally requested in ranges which generate many separate requests and 206 Partial Content responses. These requests will cause the URL or host address to automatically float to the top of any lists when in reality the URL may not be frequently requested.
It would be wonderful if goaccess could detect and aggregate these requests to prevent them from skewing the results or make the IP address appear to be making excessive requests.
That's a good point, and I'm thinking it could be implemented as an option. However, how would you identify these requests and aggregate them? Should it be a combination of IP + file + date?
The requests follow this pattern in my logs. The initial request has a 200 result and subsequent requests have a 206 (partial content) result and the referrer is changed to the full URL of the file (not the referrer of the original request).
192.168.1.1 - - [04/Nov/2014:11:25:18 -0500] "GET /files/some.pdf HTTP/1.1" 200 589361 "http://www.example.com/referrer.html" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"
192.168.1.1 - - [04/Nov/2014:11:25:18 -0500] "GET /files/some.pdf HTTP/1.1" 206 32768 "http://www.example.com/files/some.pdf" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"
192.168.1.1 - - [04/Nov/2014:11:25:18 -0500] "GET /files/some.pdf HTTP/1.1" 206 309260 "http://www.example.com/files/some.pdf" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"
I suggest that these could be aggregated by IP + URI + UA but even then it is problematic to disambiguate between multiple requests from behind the same IP address (corporate or government) for high traffic sites. The timestamp isn't useful if you are able to determine a new request with the 200 result and it isn't useful if you are trying to disambiguate requests either, in my opinion.
You are right, it could be problematic if those requests come from the same IP. I guess this could be mitigated by logging %{Cookie}i data or something that can identify one client from the other. I can certainly look into this as it sounds like an interesting feature to implement. Please keep this open.
I'm using GoAccess to anaylise log files for our podcast static assets. Currently I am using ignore-status 206. This reduces the downloads for static files by about 20%. I'd love to see GoAccess add support for aggregating 206 requests so we can get a better idea of listener-ship.
I just wrote a script to do the merging. Your milage may vary. https://github.com/pretaweb/merge206
@djay Thanks for sharing it!
Any updates in aggregating 206 reposes?
@Hexhu Thanks for the reminder. I'll bump this up as I think it should be up on the to-do list.
Most helpful comment
I just wrote a script to do the merging. Your milage may vary. https://github.com/pretaweb/merge206