Watchdog: [Question] How to detect files that were created before app starts

Created on 15 Jan 2020  Â·  17Comments  Â·  Source: gorakhargosh/watchdog

Win 10 or Linux: if CSV files are already present in directory before, then starting app would not detect these files, how to resolve this?

To be precise how to detect files created previously, with or without a time window, say within the last 30 days so that upon launching the app first time these are processed, but stopping the app and restarting next session these already processed files will be ignored,

Is this even possible? If not then maybe need to use something like Kafka to process files only once?

Most helpful comment

Sorry for the late reply: I work full time, I have other duties and it took me quite a good time to write this response.

First, pip takes the latest release uploaded on PyPi so even if the master branch is updated with a commit, the PyPi repository isn't automatically updated.

Moving to the code you've showed, it seems like you are missing to escape the backslash, may it be that the reason of your error?

Also it seems like you are trying to use a csv file and open it. That's a big indicator of a missunderstanding. I'll try to explain it a bit more in-depth:

The file that we use is just to store the DirectorySnapshot object for the next execution of your application, it is done in binary and it is not supposed to be readable for humans. What we store there is just the DirectorySnapshot with the data of which files and folders were inside a directory at a certain point in time (when the object is created).

We need to store that information to avoid processing it the next time we start the application, but why? Here's the flow:

└── folder
    ├── file_a.txt
    └── file_b.txt
  1. Start the application that will look for changes in the directory /folder/. That's your application.
  2. Since there aren't any record of a previous execution (missing file directory_snapshot.pickle), we must process all files, so we take the EmptyDirectorySnapshot as reference.
  3. We take the DirectorySnapshot of the directory /folder/.
  4. We make the diff between both snapshots. Since the first one is empty, the result will be that all the files in the second snapshot will be detected as CREATED. This means both files inside the directory: /folder/file_a.txt and /folder/file_b.txt.
  5. We call the function handle_diff with the results of the operation.
  6. Since we have processed both files as created and we don't want to do it the next time the application starts, we store (pickle) the DirectorySnapshot that we previously took in a file (directory_snapshot.pickle). So we can avoid processing again the same files the next time.

Now, what will happen if a file (/folder/file_c.txt) gets added and that piece of code is executed again?

  1. The application will look if there's any record of a previous execution. It finds the file directory_snapshot.pickle and takes (unpickles) its contents. It gets the first DirectorySnapshot created on a previous execution as reference.
  2. We take the DirectorySnapshot of the directory /folder/.
  3. We make the diff between both snapshots. Since the files /folder/file_a.txt and /folder/file_b.txt exist in both snapshots and haven't been updated, they are ignored on the diff. This doesn't happen with the file /folder/file_c.txt, because it's new, it gets detected as CREATED.
  4. We call the function handle_diff with the results of the operation.
  5. Since we have processed the new file and we don't want to do it the next time, we store (pickle) the second DirectorySnapshot in a file (directory_snapshot.pickle). This way we can avoid processing the same file the next time.

I hope you now have a better understanding of how the code works.

All 17 comments

You could provide a DirectorySnapshot of the directory and an empty DirectorySnapshot to DirectorySnapshotDiff. This way all the files of the directory would be set as created.

I did this with my own application, but if @BoboTiG allows me I'll create a PR so he can review the code and apply it to the library if he deems necessary.

Also, to avoid processing the files the next time you start your application, you should pickle the DirectorySnapshot with the last processed content and recover it on the next application start.

The resulting code should look something like this:

if file_with_pickled_snapshot_exists():
    previous_snapshot = recover_pickled_snapshot()
else:
    previous_snapshot = EmptyDirectorySnapshot()

current_snapshot = DirectorySnapshot('/path')
diff = DirectorySnapshotDiff(previous_snapshot, current_snapshot)
handle_diff(diff)
pickle_snapshot(current_snapshot)

Also you should probabily use the parameter ignore_device if you are planning on making the diff among different boots (PR #597).

@Ajordat yeah, open a PR. And even if it may not be merged, it will help others :)

@Ajordat could you show a demo that we could run and try out thanks as the current way is using some database to store and retrieve, so not sure how this approach would differ.

Also not sure what Is

“pickle the DirectorySnapshot”

I've just created the PR #613. If @BoboTiG believes that code might be useful for anybody else, you will be able to use the new class EmptyDirectorySnapshot (I'm already using it on my own project).

Regarding to pickling, it's the serialization of an object into bytes with the objective of later recovery. More according to your concern, just pickle the DirectorySnapshot with the processed changes and recover it later to avoid processing the same content. It should be something like this:

try:
    with open('directory_snapshot.pickle', 'rb') as file:
        previous_snapshot = pickle.load(file)
except FileNotFoundError:
    previous_snapshot = EmptyDirectorySnapshot()

current_snapshot = DirectorySnapshot('/path')
diff = DirectorySnapshotDiff(previous_snapshot, current_snapshot)
handle_diff(diff)

with open('directory_snapshot.pickle', 'wb') as file:
    pickle.dump(current_snapshot, file)

As you can see, if the file doesn't exist you make the diff using the EmptyDirectorySnapshot; whereas if it exists, you recover the pickled DirectorySnapshot to avoid processing the files present on the previous execution.

I can use this now as it is?

Well, you should do a few things before:

  • Add the imports.
import pickle
from watchdog.utils.dirsnapshot import DirectorySnapshot
from watchdog.utils.dirsnapshot import DirectorySnapshotDiff
from watchdog.utils.dirsnapshot import EmptyDirectorySnapshot
  • Implement the function handle_diff to do whatever you want.
def handle_diff(diff: DirectorySnapshotDiff) -> None:
    pass
  • Wait for this to be in a release (if it makes it to one). If you don't want to wait, look at the source code of PR #613 and copy the created class.

Also, since you are making this question, are you sure you understand what that piece of code does?

Closed automatically when merged the PR. I reopen and let @scheung38 handle the state.

Not exactly sure could you demonstrate ? Say if

  1. filesA is created or modified before app starts
  2. App started and so it should pick up fileA
  3. App stops and restarts, now it should it ignore fileA since already processed it
  4. fileB is created and modified, now app starts and it should only process fileB and so on?

Sent with GitHawk

Yes, that's exactly what would happen. I thought it was what you were asking for, wasn't it?

Yes but I need to understand your logic first before trying. Appreciate it

This applies for files that are either created or modified before app starts correct?

Why EmptyDirectorySnapshot cannot be imported?

I can import the other DirectorySnapshot, DirectorySnapshotDiff classes though

EDIT: EmptyDirectorySnapshot is in master from what I can see but not in dirsnapshot.py?

Sent with GitHawk

Why EmptyDirectorySnapshot cannot be imported?

It is part of a version not yet released. You have to install the version from the master branch instead of the one from PyPi.

So “pip install watchdog” is not from master?

Then company firewall might prevent pip install since

python -m pip install git+https://github.com/gorakhargosh/watchdog —user

Looking in indexes: http://CLIENT_URL/artifactory/api/pypi/pypi-repos/simple

Collecting git+https://github.com/gorakhargosh/watchdog

Error RPC failed; HTTP 403

Copy and pasted only the new EmptyDirectorySnapshot class:

from watchdog.utils.dirsnapshot import DirectorySnapshot
from watchdog.utils.dirsnapshot import DirectorySnapshotDiff
import pickle


class EmptyDirectorySnapshot(object):
    """Class to implement an empty snapshot. This is used together with
    DirectorySnapshot and DirectorySnapshotDiff in order to get all the files/folders
    in the directory as created.
    """

    @staticmethod
    def path(_):
        """Mock up method to return the path of the received inode. As the snapshot
        is intended to be empty, it always returns None.
        :returns:
            None.
        """
        return None

    @property
    def paths(self):
        """Mock up method to return a set of file/directory paths in the snapshot. As
        the snapshot is intended to be empty, it always returns an empty set.
        :returns:
            An empty set.
        """
        return set()


def handle_diff(diffs):
    print(diffs)


try:
    with open('Y:\\data\sample.csv', 'rb') as file:
        previous_snapshot = pickle.load(file)
except FileNotFoundError:
    previous_snapshot = EmptyDirectorySnapshot()

current_snapshot = DirectorySnapshot('Y:\\data')
diff = DirectorySnapshotDiff(previous_snapshot, current_snapshot)
handle_diff(diff)

with open('Y:\\data\sample.csv', 'wb') as file:
    pickle.dump(current_snapshot, file)

Returns:

Traceback: line 37, in

previous_snapshot = pickle.load(file)

_pickle.UnpicklingError: A load persistent id instruction was encountered,

but no persistent load function was specified.

Seems to work now, but does it work with CSV files? it seems CSV file are now corrupted next time opening in Excel?

EDIT: but sometimes I still get the above error? And needing to restart PyCharm?

my fault, should be opening rb and wb a file.pkl instead

Sorry for the late reply: I work full time, I have other duties and it took me quite a good time to write this response.

First, pip takes the latest release uploaded on PyPi so even if the master branch is updated with a commit, the PyPi repository isn't automatically updated.

Moving to the code you've showed, it seems like you are missing to escape the backslash, may it be that the reason of your error?

Also it seems like you are trying to use a csv file and open it. That's a big indicator of a missunderstanding. I'll try to explain it a bit more in-depth:

The file that we use is just to store the DirectorySnapshot object for the next execution of your application, it is done in binary and it is not supposed to be readable for humans. What we store there is just the DirectorySnapshot with the data of which files and folders were inside a directory at a certain point in time (when the object is created).

We need to store that information to avoid processing it the next time we start the application, but why? Here's the flow:

└── folder
    ├── file_a.txt
    └── file_b.txt
  1. Start the application that will look for changes in the directory /folder/. That's your application.
  2. Since there aren't any record of a previous execution (missing file directory_snapshot.pickle), we must process all files, so we take the EmptyDirectorySnapshot as reference.
  3. We take the DirectorySnapshot of the directory /folder/.
  4. We make the diff between both snapshots. Since the first one is empty, the result will be that all the files in the second snapshot will be detected as CREATED. This means both files inside the directory: /folder/file_a.txt and /folder/file_b.txt.
  5. We call the function handle_diff with the results of the operation.
  6. Since we have processed both files as created and we don't want to do it the next time the application starts, we store (pickle) the DirectorySnapshot that we previously took in a file (directory_snapshot.pickle). So we can avoid processing again the same files the next time.

Now, what will happen if a file (/folder/file_c.txt) gets added and that piece of code is executed again?

  1. The application will look if there's any record of a previous execution. It finds the file directory_snapshot.pickle and takes (unpickles) its contents. It gets the first DirectorySnapshot created on a previous execution as reference.
  2. We take the DirectorySnapshot of the directory /folder/.
  3. We make the diff between both snapshots. Since the files /folder/file_a.txt and /folder/file_b.txt exist in both snapshots and haven't been updated, they are ignored on the diff. This doesn't happen with the file /folder/file_c.txt, because it's new, it gets detected as CREATED.
  4. We call the function handle_diff with the results of the operation.
  5. Since we have processed the new file and we don't want to do it the next time, we store (pickle) the second DirectorySnapshot in a file (directory_snapshot.pickle). This way we can avoid processing the same file the next time.

I hope you now have a better understanding of how the code works.

Fully appreciated it thanks hence last comment was using a file.pkl and not the actual CSV file.

Sent with GitHawk

Was this page helpful?
0 / 5 - 0 ratings