Watchman no longer gives us useful events when the children of directories are created/deleted, which is a large and surprising change in behaviour that means we need to over-invalidate (or add additional syscalls to differentiate creates/updates/deletes).
We should explore whether using something like the notify crate would result in fewer assumptions... we already know that it would result in one fewer daemon and less memory usage (as we have no need to track anything that isn't warm in the graph).
The primary uncertainty is whether consuming a queue of events in process would require filtering to decide which events to ignore, and which require additional syscalls to determine their actual meaning: for example:
accesstime updates?Also, even if this looks like a good approach, it should likely be enabled behind a pants global option, so that both watchman and the notify crate can be used _at once_ (EDIT at @illicitonion's suggestion) for a while.
thinking that this is probably a longer-term goal in the face of our current priorities, but agree it would be nice to fold filesystem event monitoring directly into the daemon if we can do it as good or better than watchman.
we should probably also attempt to make a case upstream for feature flagging the behavior we want if that's the primary driving factor here.
Another relevant bit here is that watchman walks the entire filesystem during initialization, which is strictly more* (*fixed) than we need to do, afaict.
Assigning @jsirois : updated the description to make it clear that even if we decide to ship this, we'll want to support toggling between watchman and native-notify via a global flag.
@illicitonion : Assuming you're not taking this one, let's leave it assigned to @jsirois .
@jsirois : If you're able to work on this one next week, it would be much appreciated. I can lay some groundwork this week to unblock you (basically, moving the "fork lock" to the rust side in order to make the background thread that @illicitonion determined would be necessary here viable).
Yeah, enjoy @jsirois!
Actually @jsirois, I think we need you on #5426 with higher priority... but perhaps you can start this later in the week.
Ack. I'll be back on it tomorrow morning.
Regarding the fork_lock... I implemented moving it from python to rust and was busy writing the PR description when I realized that we probably don't need to acquire it here.
Assuming what what we need for the inotify crate is just a background thread, we can stop/join that background thread in: https://github.com/pantsbuild/pants/blob/e3c38bac286ac379d62b3c0be47790a7915e9418/src/rust/engine/src/context.rs#L77
Any time we stop the thread, we need to re-crawl everything we were watching, because we may have missed notifications while the thread was stopped. This is slow, blocking, and generally sad.
Ok. Posted #5521
Notes about API: It would be nice to just be able to ask for changes since X for a root. Even better would be where X is not a time but a token and an initial scan of a root returns a token. This would allow any weirdness with time to be held behind the API.
I kicked off a small branch for this over the holiday: https://github.com/pantsbuild/pants/compare/master...twitter:stuhood/notify-crate ... it does not yet work, but it compiles.
It has a number of important TODOs before it is landable:
InvalidationWatcher::watch method is blocking, and (likely) makes syscalls in order to accomplish what it is doing. We call it in the branch from an async context here: https://github.com/pantsbuild/pants/compare/master...twitter:stuhood/notify-crate#diff-11db197014e1e80c9551e9b61aba6aa0R1224 ... that is almost certainly making things slower than they should be. We should probably give the InvalidationWatcher a reference to the Core's Executor in here (by just cloning it: it is internally reference counted), and then having the fn watch method return an impl Future that we would consume in Node::run instead.InvalidationWatcher starts). Should run with something like MODE=debug ./pants --enable-pantsd list 3rdparty:: and watch .pants.d/pantsd/pantsd.log to see invalidation log messages: the goal would be to receive similar log messages for this codepath as we do in the existing watchman invalidation path here.The first thing is purely a performance improvement, but I'd suggest looking at it first because it will make debugging the second thing easier not to have to wait long periods for pantsd to start up.
Fixing those two things would leave us in a place where we are using two systems to invalidate, but hopefully in a way that has no noticeable impact for end users.
Then, in a second PR, we should allow for watchman to be disabled, most likely by making edits to the python-side FSEventService
InvalidationWatcher::running method.Keith got a draft of this out here: #9089: it contains parts of the first two items above. Thanks Keith!
The implementation of notify watcher in #9318 leads to some user noticeable performance hits for users who don't have pantsd enabled. It's possible that there are some improvements we can make by ignoring files that are already gitignored that don't need to be watched for events. This would be an ad-hoc solution to make notify faster for us but does nothing for consumers, who would likely have to do the same thing. (Link to issue about ignoring based on project .gitignore?)
Questions that will affect how we move forward:
Next Steps
The plan to complete the migration to notify away from watchman is to follow up #9318 with a PR which allows opt-in to disable watchman, and in the longer term, deprecating it entirely. However this assumes that the performance of using notify is good enough for everyone that we can enable notify by default. If it isn't, and we don't plan to enable pantsd by default then we may have to maintain multiple watchers until we decide to. @stuhood suggested some kind of tri-state flag which is either {notify, notify+watchman, watchman} though I don't understand the use case of the notify+watchman mode.
though I don't understand the use case of the notify+watchman mode.
Without having had looked closely at the implementation, that does indeed sound difficult and error-prone to implement. I would think either notify or watchman would be sufficient? If our goal is to remove watchman, then using notify + watchman should be redundant because notify presumably does the same thing.
that does indeed sound difficult and error-prone to implement
Error prone yes, just because watchman is already error-prone, and we would have to maintain more code. Difficult, not really, currently in #9318 the default is to use both simultaneously. They both hook in to the same code path in the engine, but we get redundant invalidation.
Based on changes implemented in https://github.com/pantsbuild/pants/pull/9318/commits/3f52ee519f74ad2d860dc5b75fa5cfda6f40a214 to improve latency of setting up file watches it seems like we are no longer between a rock and a hard place with how to move forward! Watching the entire build root at startup time adds minimal overhead to pants (https://github.com/pantsbuild/pants/pull/9318#issuecomment-603613434). We can safely land #9318 without any penalty to users, so we don't need to put it behind a feature gate. This means the next steps are much clearer:
Revised Next Steps
Most helpful comment
I kicked off a small branch for this over the holiday: https://github.com/pantsbuild/pants/compare/master...twitter:stuhood/notify-crate ... it does not yet work, but it compiles.
It has a number of important TODOs before it is landable:
InvalidationWatcher::watchmethod is blocking, and (likely) makes syscalls in order to accomplish what it is doing. We call it in the branch from an async context here: https://github.com/pantsbuild/pants/compare/master...twitter:stuhood/notify-crate#diff-11db197014e1e80c9551e9b61aba6aa0R1224 ... that is almost certainly making things slower than they should be. We should probably give theInvalidationWatchera reference to the Core's Executor in here (by just cloning it: it is internally reference counted), and then having thefn watchmethod return animpl Futurethat we would consume inNode::runinstead.InvalidationWatcherstarts). Should run with something likeMODE=debug ./pants --enable-pantsd list 3rdparty::and watch.pants.d/pantsd/pantsd.logto see invalidation log messages: the goal would be to receive similar log messages for this codepath as we do in the existing watchman invalidation path here.The first thing is purely a performance improvement, but I'd suggest looking at it first because it will make debugging the second thing easier not to have to wait long periods for pantsd to start up.
Fixing those two things would leave us in a place where we are using two systems to invalidate, but hopefully in a way that has no noticeable impact for end users.
Then, in a second PR, we should allow for watchman to be disabled, most likely by making edits to the python-side FSEventService
InvalidationWatcher::runningmethod.