In continuing onward with the tcb performance tarpitting as described in: #886
To handle some monster logs, I have devised the following procedure to keep the parsing speeds over 1KB/s (late in the month), usually the average is around 2KB/s to 3KB/s until around the 25th day of the month.
The directory structure is simple:
logs
logs/.goa (long term storage for tcb files)
logs/.goaProcess (ephemerol shadow tmpfs for fast goaccess processing)
If I don't use the .goaProcess tmpfs, goaccess parsing drops down to ~100KB/s to ~900KB/s range with many stalls.
Anyhow, I made an error with creating the tmpfs .goaProcess, with du -bs .goa + 3GB, it was created at +1GB instead. During processing, it continued without any problems, however I found that the .goaProcess directory was out of disk space. This at least answered the question as to why the tcbmgr optimize (-df) was failing with invalid metadata errors.
So my question is, is there a way for the tokyo cabinet library to alert goaccess that where it is writing the tcb files to has run out of disk space and thereby the parsing operation can be halted? If it isn't halted, then the postrun operations can rsync back the corrupted tcb files to .goa long term storage.
FWIW, the largest set of logs I'm dealing with requires around 10GB of .goaProcess tmpfs to handle properly for a full month of -df optimize tcb files. Also, I've never seen a day's disk usage delta be over 2GB, which is why I use 3GB just to be safe. Also using compression during the month carries a fair amount of overhead, which is why I don't bzip2 (optimize -tb) until after the month is done and current month is rotated out for archival.
Interesting findings you got there and thanks for sharing this. I was trying to find a way to prevent further writes if it hits a threshold, but unfortunately I didn't see anything that would do this in tcb.
One way to reduce the size of the database would be using a fixed-length database. Per this document, implementing this shouldn't be too bad since GoAccess is currently using the Abstract Database API (might be as simple as replacing tcb to tcf extension on these files, (haven't tested it) and adding the following params "mode", "width", and "limsiz" here. BTW, if you don't need the query part from the request, you can remove it using -q, it may help save some space.
Also, it's probably to early to tell but I'm working on refactoring the on-disk implementation and use lmdb instead which appears to outperform tcb. Hope to have some news about this soon.
Would an optimized NoSQL type database work better for this and get away from straight Berkeley db type key/value stores? It would allow you to have multiple values per key that have fast indexed lookups on the key itself. I believe it might offer more extensible flexibility in handling more metrics in the future. With key/value, it can paint you into a corner, that you have to solve by adding more db's to expand its scope/range.
I think my biggest concern with lmdb is that it uses mmap to bring the files into memory and possibly gobbling up tons of memory during its processing. Also from readings its caveats, it seems that lmdb might be a little bit fragile in production. My current largest log set is using 9.3GB of tcb (-df optimized) for 27 days. I do have some that will be much larger, but I will have to handle those differently due to hardware limitations for which they reside.
As for Tokyo Cabinet, shame on their API/implementation for not properly handling an out of disk space condition and bubbling that up to the API consumer. :(
Using a fixed length db wouldn't really work for us because we have some logs that its referrers are > 1024 chars. Its not that I want to limit the size of the tcbs, I just want to more accurately allocate the tmpfs storage space, and if I somehow underestimate it, at least running out of disk space won't cause the malformed tcbs from being pushed back to the hard disk since there is no error condition to stop the process. I will need to add another check and/or sub-process that monitors the disk space while its running and that sidecar process would be able to STOP the main process, resize it on the fly and then kick it with a CONT. My only consternation is that there are more moving parts being added to handle something that should be dead stopped with an error from GoAccess/TokyoCabinet.
One more thing on Tokyo Cabinet, I have found that tcbs created on Ubuntu, won't work on my Gentoo system. Both are using the exact same version of tcbmgr and library [1.4.48 (911:1.0)], but when I try to optimize a tcb written by Ubuntu on a Gentoo system (and vice/versa), I get hit with the dreaded metadata error. :(
Would an optimized NoSQL type database work better for this and get away from straight Berkeley db type key/value stores?
When you said, NoSQL, are you thinking something more like Cassandra? I completely I agree that the ability to have multiple values per key would simplify a lot of things and give much more flexibility when it comes to new metrics.
Having said that, I'd also like to keep goaccess with minimal dependencies, and in the case where there's a dependency, I think it should be accessible without much hassle to install it. However, as it stands right now, I'm open have a couple of implementations where the user has the option to switch between them (e.g., default to hash tables for cases where the dataset does fit in memory and an additional on-disk implementation such as tokyo cabinet or fill-in-the-blank-a-db).
As far as lmdb, I still need to play with it and see how it performs. Though, I like the fact that it supports multiple values per key through MDB_DUPSORT. Also, I like the fact that it can make use of multiple cores. There is an FAQ that may clarify some questions. Regarding production ready, as I said, I'll have to run some good tests and see how it goes. There's a list of projects that apparently make use of it in some way, including InfluxDB which at some point I considered.
One more thing on Tokyo Cabinet, I have found that tcbs created on Ubuntu, won't work on my Gentoo system.
Thanks for sharing that. I was not aware of this and that explains why I wasn't able to read some files sent to me as part of a crash report.
Interesting that you mentioned InfluxDB, as I've used it for various metrics handling in our Kubernetes cluster. More specifically: InfluxDB/Telegraf/Grafana and it seems to work rather well. We also use Prometheus/Grafana as well since it ties in better to deep Kubernetes (Heapster/etc) metrics. I can't really say that I like one over the other, however for CLI work, I lean towards InfluxDB for its SQL like interface, vs Prometheus's functional type syntax. Both (TSDBs) though get the job done rather well for time series storage/query.
I see GoAccess handling metrics like a miner pickaxing a mine, where you read in the log and pick out individual metrics and bin them to key/value stores. With a time series db, you would just parse/store the logs into the db, and pick out what you want for the graphs without having to bin everything.
After working with and deep diving GoAccess/TokyoCabinet (TC), I believe that GoAccess is now at the end of the road with TC for several reasons:
After thinking about it, I believe taking a 2nd look at TSDBs might provide a better future growth path than NoSQL type DBs. In short, let the TSDBs handle the metric heavy lifting (injection/storage/retrieval) and GoAccess focus on metric extraction/UX.
I really do like GoAccess UX due to its clear and concise view of one's web traffic with an 'at-a-glance-on-the-go' type overview in which it excels. If one really wants to dive deeper with an analytics bent (also increased administrative complexity), than there are things like Piwik/GoogleAnalytics/customGrafana/SawMill/nowDefunctUrchin(RIP).
_Disclaimer, I am a C to Go convert, and have never looked back (for new projects)._
Just as an aside, have you looked at re-coding GoAccess into another language, like Go? The more I worked with the C code, the more I saw a lot of (carefully) allocated string handling, a lot of memory handling in general, and also extensive structs usage that begged for tight methods.
I have been nothing but impressed with Go as it forces one to pretty much error check/handle everything (which is really good), along with struct/methods/composition, extensive library, multiple return values, slices, built-in maps, goroutines, channels, go fmt, go doc, etc...
There are some custom modules I need to add to GoAccess for our particular infrastructure that will use cgo as the interface.
One more thing I should add, I really like that the Report output (report.html) is an all-inclusive/self-contained single file. I believe that was a really good design decision. :)
Since we have talked about Tokyo Cabinet and Go, just wanted to note that I've been doing some work on the Go Tokyo Cabinet library bindings at TerraTech/go-tokyocabinet
Actually I need to make a correction from my previous post, quoting from the current (v1.3) InfluxDB docs, it seems like they are now using their own implementation called Time Structured Merge Tree storage engine:
The 0.8 line of InfluxDB allowed multiple storage engines, including LevelDB, RocksDB, HyperLevelDB, and LMDB. The 0.9 line of InfluxDB used BoltDB as the underlying storage engine
I agree that TC is not really helping where it should be doing its job. I'll be honest, when parsing large logs, I've been spoiled with a machine that it's mostly idle with a bunch of RAM on it so I really haven't had the need for TC very much.
However, I'm aware of the performance issue as the data set grows and hash collisions start to occur, thus the need for a db replacement. Let me give as well a second thought to a TSDB implementation. I'm curious how InfluxDB would play and perform with goaccess, some immediate concerns that I can see here are the need for a JSON parser and SQL.
I really do like GoAccess UX due to its clear and concise view of one's web traffic with an 'at-a-glance-on-the-go' type overview in which it excels. If one really wants to dive deeper with an analytics bent (also increased administrative complexity).
I'm glad to hear you like GoAccess' report. I agree, there are a bunch of tools out there that get you some interesting metrics. One of my goals with goaccess is to implement #117 (it's been requested a lot of times), allowing the user to filter the dataset, which is one of the features that I personally need ASAP.
Just as an aside, have you looked at re-coding GoAccess into another language, like Go?
I think implementing this on a language like Go would almost certainly attract more users and perhaps more contributions :) all kidding aside, I haven't really pay much attention of re-coding goaccess in another language. I can't really comment on Go's features, but I've been following Rust for sometime and indeed some of the features are really nice. However, I still don't change the easiness of C's fairly trivial building and running of programs on vanilla *nix.
I've coded in both rust and Go, and I would choose Go most every time for a new project. The only exception I'd have for rust is if I want to get as close to the metal as possible. rust has a rather steep learning curve, and I still get frustrated with it. Go on the other hand is just fun to write in and you feel like you are being productive most all the time. My biggest speed bumps with Go was a couple things: 1) using channels effectively and 2) using interfaces effectively. Once I got past that, I could just sit down with a blank page and an idea and just start coding with beat and flow in Go. Start with mapping out all the interfaces and data structures first, then write the methods to operate and support them.
So if you have taken the time to look at rust, then I would urge you to spend a few days with Go and let me know how you feel about your productivity. :)
It may seem like I'm a fanboy of Go, but I'm really not, I've just had the pleasure of working with it and really appreciate what it has to offer. It gives one the flexibility and ease of a scripting language, that compiles down to an all-inclusive single binary. It also forces good programming techniques on you with its strict typing (which can be both a blessing and a curse). If one is willing to accept the strict typing and work with it (instead of fighting it), then you are golden. One more thing to mention, is that by nature I'm a procedural coder, and never really liked OOP or functional type coding, but Go got me to change my mind on that due to how they designed it. You can mix and match all three styles effortlessly depending on what you are trying to solve and best of all, it just makes sense.
I think this really popped up in my head, as I was seeing a metric ton of untrusted and arbitrary data (that is under an external attacker's control) being fed into a somewhat sophisticated C program with a lot of processing paths. From a security standpoint, it really does take careful and exceptional C coding to handle that kind of data. To note, I see that you have spent a large amount of time with your memory management and bounds checking, but the old axiom still exists - you must protect and cover all while an attacker only needs to find one little (missed) pinhole to crawl though. Even if your code is airtight, you still have to be concerned about all you link to (geoip, tokyocabinet, etc).
v1.4 has been released and uses a new optmizied in-memory storage. A dataset of about 400M hits (74G size) is parsed in ~1H 20 mins (in-memory) consuming about 12GB RAM. Please give it a shot and feel free to share the results.