Win 10 or Linux: if CSV files are already present in directory before, then starting app would not detect these files, how to resolve this?
To be precise how to detect files created previously, with or without a time window, say within the last 30 days so that upon launching the app first time these are processed, but stopping the app and restarting next session these already processed files will be ignored,
Is this even possible? If not then maybe need to use something like Kafka to process files only once?
You could provide a DirectorySnapshot of the directory and an empty DirectorySnapshot to DirectorySnapshotDiff. This way all the files of the directory would be set as created.
I did this with my own application, but if @BoboTiG allows me I'll create a PR so he can review the code and apply it to the library if he deems necessary.
Also, to avoid processing the files the next time you start your application, you should pickle the DirectorySnapshot with the last processed content and recover it on the next application start.
The resulting code should look something like this:
if file_with_pickled_snapshot_exists():
previous_snapshot = recover_pickled_snapshot()
else:
previous_snapshot = EmptyDirectorySnapshot()
current_snapshot = DirectorySnapshot('/path')
diff = DirectorySnapshotDiff(previous_snapshot, current_snapshot)
handle_diff(diff)
pickle_snapshot(current_snapshot)
Also you should probabily use the parameter ignore_device
if you are planning on making the diff among different boots (PR #597).
@Ajordat yeah, open a PR. And even if it may not be merged, it will help others :)
@Ajordat could you show a demo that we could run and try out thanks as the current way is using some database to store and retrieve, so not sure how this approach would differ.
Also not sure what Is
“pickle the DirectorySnapshot”
I've just created the PR #613. If @BoboTiG believes that code might be useful for anybody else, you will be able to use the new class EmptyDirectorySnapshot
(I'm already using it on my own project).
Regarding to pickling, it's the serialization of an object into bytes with the objective of later recovery. More according to your concern, just pickle the DirectorySnapshot with the processed changes and recover it later to avoid processing the same content. It should be something like this:
try:
with open('directory_snapshot.pickle', 'rb') as file:
previous_snapshot = pickle.load(file)
except FileNotFoundError:
previous_snapshot = EmptyDirectorySnapshot()
current_snapshot = DirectorySnapshot('/path')
diff = DirectorySnapshotDiff(previous_snapshot, current_snapshot)
handle_diff(diff)
with open('directory_snapshot.pickle', 'wb') as file:
pickle.dump(current_snapshot, file)
As you can see, if the file doesn't exist you make the diff using the EmptyDirectorySnapshot; whereas if it exists, you recover the pickled DirectorySnapshot to avoid processing the files present on the previous execution.
I can use this now as it is?
Well, you should do a few things before:
import pickle
from watchdog.utils.dirsnapshot import DirectorySnapshot
from watchdog.utils.dirsnapshot import DirectorySnapshotDiff
from watchdog.utils.dirsnapshot import EmptyDirectorySnapshot
handle_diff
to do whatever you want.def handle_diff(diff: DirectorySnapshotDiff) -> None:
pass
Also, since you are making this question, are you sure you understand what that piece of code does?
Closed automatically when merged the PR. I reopen and let @scheung38 handle the state.
Not exactly sure could you demonstrate ? Say if
Sent with GitHawk
Yes, that's exactly what would happen. I thought it was what you were asking for, wasn't it?
Yes but I need to understand your logic first before trying. Appreciate it
This applies for files that are either created or modified before app starts correct?
Why EmptyDirectorySnapshot cannot be imported?
I can import the other DirectorySnapshot, DirectorySnapshotDiff classes though
EDIT: EmptyDirectorySnapshot is in master from what I can see but not in dirsnapshot.py?
Sent with GitHawk
Why EmptyDirectorySnapshot cannot be imported?
It is part of a version not yet released. You have to install the version from the master branch instead of the one from PyPi.
So “pip install watchdog” is not from master?
Then company firewall might prevent pip install since
python -m pip install git+https://github.com/gorakhargosh/watchdog —user
Looking in indexes: http://CLIENT_URL/artifactory/api/pypi/pypi-repos/simple
Collecting git+https://github.com/gorakhargosh/watchdog
Error RPC failed; HTTP 403
Copy and pasted only the new EmptyDirectorySnapshot class:
from watchdog.utils.dirsnapshot import DirectorySnapshot
from watchdog.utils.dirsnapshot import DirectorySnapshotDiff
import pickle
class EmptyDirectorySnapshot(object):
"""Class to implement an empty snapshot. This is used together with
DirectorySnapshot and DirectorySnapshotDiff in order to get all the files/folders
in the directory as created.
"""
@staticmethod
def path(_):
"""Mock up method to return the path of the received inode. As the snapshot
is intended to be empty, it always returns None.
:returns:
None.
"""
return None
@property
def paths(self):
"""Mock up method to return a set of file/directory paths in the snapshot. As
the snapshot is intended to be empty, it always returns an empty set.
:returns:
An empty set.
"""
return set()
def handle_diff(diffs):
print(diffs)
try:
with open('Y:\\data\sample.csv', 'rb') as file:
previous_snapshot = pickle.load(file)
except FileNotFoundError:
previous_snapshot = EmptyDirectorySnapshot()
current_snapshot = DirectorySnapshot('Y:\\data')
diff = DirectorySnapshotDiff(previous_snapshot, current_snapshot)
handle_diff(diff)
with open('Y:\\data\sample.csv', 'wb') as file:
pickle.dump(current_snapshot, file)
Returns:
Traceback: line 37, in
previous_snapshot = pickle.load(file)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent load function was specified.
Seems to work now, but does it work with CSV files? it seems CSV file are now corrupted next time opening in Excel?
EDIT: but sometimes I still get the above error? And needing to restart PyCharm?
my fault, should be opening rb and wb a file.pkl instead
Sorry for the late reply: I work full time, I have other duties and it took me quite a good time to write this response.
First, pip
takes the latest release uploaded on PyPi so even if the master branch is updated with a commit, the PyPi repository isn't automatically updated.
Moving to the code you've showed, it seems like you are missing to escape the backslash, may it be that the reason of your error?
Also it seems like you are trying to use a csv file and open it. That's a big indicator of a missunderstanding. I'll try to explain it a bit more in-depth:
The file that we use is just to store the DirectorySnapshot object for the next execution of your application, it is done in binary and it is not supposed to be readable for humans. What we store there is just the DirectorySnapshot with the data of which files and folders were inside a directory at a certain point in time (when the object is created).
We need to store that information to avoid processing it the next time we start the application, but why? Here's the flow:
└── folder
├── file_a.txt
└── file_b.txt
/folder/
. That's your application.directory_snapshot.pickle
), we must process all files, so we take the EmptyDirectorySnapshot
as reference.DirectorySnapshot
of the directory /folder/
./folder/file_a.txt
and /folder/file_b.txt
.handle_diff
with the results of the operation.DirectorySnapshot
that we previously took in a file (directory_snapshot.pickle
). So we can avoid processing again the same files the next time.Now, what will happen if a file (/folder/file_c.txt
) gets added and that piece of code is executed again?
directory_snapshot.pickle
and takes (unpickles) its contents. It gets the first DirectorySnapshot
created on a previous execution as reference.DirectorySnapshot
of the directory /folder/
./folder/file_a.txt
and /folder/file_b.txt
exist in both snapshots and haven't been updated, they are ignored on the diff. This doesn't happen with the file /folder/file_c.txt
, because it's new, it gets detected as CREATED.handle_diff
with the results of the operation.DirectorySnapshot
in a file (directory_snapshot.pickle
). This way we can avoid processing the same file the next time.I hope you now have a better understanding of how the code works.
Fully appreciated it thanks hence last comment was using a file.pkl and not the actual CSV file.
Sent with GitHawk
Most helpful comment
Sorry for the late reply: I work full time, I have other duties and it took me quite a good time to write this response.
First,
pip
takes the latest release uploaded on PyPi so even if the master branch is updated with a commit, the PyPi repository isn't automatically updated.Moving to the code you've showed, it seems like you are missing to escape the backslash, may it be that the reason of your error?
Also it seems like you are trying to use a csv file and open it. That's a big indicator of a missunderstanding. I'll try to explain it a bit more in-depth:
The file that we use is just to store the DirectorySnapshot object for the next execution of your application, it is done in binary and it is not supposed to be readable for humans. What we store there is just the DirectorySnapshot with the data of which files and folders were inside a directory at a certain point in time (when the object is created).
We need to store that information to avoid processing it the next time we start the application, but why? Here's the flow:
/folder/
. That's your application.directory_snapshot.pickle
), we must process all files, so we take theEmptyDirectorySnapshot
as reference.DirectorySnapshot
of the directory/folder/
./folder/file_a.txt
and/folder/file_b.txt
.handle_diff
with the results of the operation.DirectorySnapshot
that we previously took in a file (directory_snapshot.pickle
). So we can avoid processing again the same files the next time.Now, what will happen if a file (
/folder/file_c.txt
) gets added and that piece of code is executed again?directory_snapshot.pickle
and takes (unpickles) its contents. It gets the firstDirectorySnapshot
created on a previous execution as reference.DirectorySnapshot
of the directory/folder/
./folder/file_a.txt
and/folder/file_b.txt
exist in both snapshots and haven't been updated, they are ignored on the diff. This doesn't happen with the file/folder/file_c.txt
, because it's new, it gets detected as CREATED.handle_diff
with the results of the operation.DirectorySnapshot
in a file (directory_snapshot.pickle
). This way we can avoid processing the same file the next time.I hope you now have a better understanding of how the code works.