hello all,
i am starting a new discussion because i didn't find any previous discussion on the topic of automatically updating (or finding updates) for casks.
my current impression is that we are moving towards more automation for keeping casks current, but the effort is mainly done 'outside' HBC by observing the appcasts and then sending pull requests.
i am not sure how far along this is (i am often finding astonishingly outdated casks with perfectly fine appcasts) but i have the impression that is is being worked on, and is a problem that can be solved reasonably (if there are valid sparkle or github appcasts).
what i'd like to discuss is keeping casks current where we don't have any valid or easily parsable appcasts. i think i can provide something of value here to the HBC community. as discussed previously, through our app i have access to version information of apps installed by thousands of users. most of the time, /someone/ has the latest version of an app installed.
i am able to automatically compare the version that is in HBC to the versions stored in my database, and generate a report for each app where a newer version is available.
i just implemented a first go at this, and already noticed quite a few outdated casks only by looking at the results:
some points
• obviously there are a lot of false positives. keep in mind this is the very first output.
while looking through the file makes it seem there is quite a lot of noise, i think we can easily cover 70% of the apps without appcast with practically zero noise with some improvements and filtering. most apps don't have beta versions.
is this useful to HBC? basically i can commit to improving the script and running it regularly and posting the output of it publicly for others to use accordingly. what i can not do is to build the infrastructure to filter all false positives and send pull requests automatically (or manually).
looking forward to your thoughts.
just sent a dozen PRs based on the txt file. the success rate is already quite good.
I would say for now that you should either:
a) build a tool that consumes that information on your side and submits PRs with cask-repair
b) publish the info at an api/url and see if others can build onto it. it may eventually get migrated into HBC depending on what happens.
You're right that there is a movement towards automation (I think you and I are mostly leading it at this point), but there have been issues automating these things in the past.
The current line of thinking around this is to customize any automation to each cask. Basically for a given cask (let's take splashtop-streamer.rb, randomly), there would be a separate file which specifically detailed how to check that cask for updates in an automated way. Fetch this page, look for this link, click on it, look for this part of the page, grep the verison number, etc.
In some other thread which I can't find now, most of these usecases could be boiled down into like 4 or 5 main flows (one for github, one for sourceforge, etc). So then all you would really need to write is the deviations from the baseline flow. For example if we had a cask that uses github but the "generic" automation for github casks doesn't work, you can just write what needs to be different.
Does any of that make sense?
That is kind of what I'm working on now.
EDIT: I should also note that there will be questions of privacy/legality and whatnot around how you came to possess the information you are using, which would be a barrier for HBC integration. In short, that information collection would have to be approved and internalized... but since it comes from a 3rd party (I believe you run a business and this information is a byproduct?) that is unlikely to happen.
hey @brianmorton , thanks for your feedback.
i wasn't prepared to do a) as it involves probably more time than i can afford to put into this at the moment. option b) sounds good and workable at the moment.
for the time being i've seen little interest here and so we'll use the info to do cask-repairs of those casks that interest us manually. if things change i'm prepared to clean this up and periodically generate and upload it.
The current line of thinking around this is to customize any automation to each cask.
thats also what i have envisioned as the ultimate solution to the given problem. i see two main problems: 1.) writing tailored code for 4000 casks and 2.) maintaining the code as upstream changes. e.g. websites change, go offline, go online again, feeds get broken, repaired again, etc.
For example if we had a cask that uses github but the "generic" automation
for github casks doesn't work, you can just write what needs to be different.
looking at the cask-repairs i've done i see that many people seem to randomly change filenames each release which is obviously bad for our use-case.
Does any of that make sense?
yes it does and thanks for your feedback!
privacy/legality
i don't really see much connection here
• the information that "appversion 1.2.3 exists" is nothing "private". private means connected to a human. its not connected to any human. we don't track users or user behaviour and don't intend to. we track appversions which are not something private.
• the information that "appversion 1.2.3 exists" isn't even copyrightable at all because its a fact about our world and not something some person or company has produced (like the app itself). just like the length of the latest BMW or the number of legs on some exotic animal its a fact about our world and not copyrightrightable. (exception, the specific aggregation of non-copyrightable things can still be copyrightable, which is a loophole that some collections of public-domain works use to prevent their databases from being copied. since we did the aggregation ourselves there is no problem here).
That is kind of what I'm working on now.
ok thats great!
don't really see much connection here
I was alluding to the fact that you as a private entity/business are collecting information on your users and then sharing said information. But my understanding of what you do or how you get your information may be flawed.
i think it is ;) you are mixing up 'information' and 'personal information':
https://en.wikipedia.org/wiki/Personally_identifiable_information
while 'personal information' is protected in most jurisdictions (especially now in the EU with the GDPR) ) in how you can collect, store, process, use and share it, informations that are not connected to humans are not protected (apart from the usual copyright and trade-secret stuff).
we don't collect any 'personal information'.
I understand all that, but the information is collected as part of your business practices. Ergo it would be difficult for homebrew to indefinitely use it - they would likely want/need to internalize the collection. If your business changed priorities or went away, that would be problematic if something here was built on top of having access to that information.
i see. however, if we just publish a file containing a list of (possibly) outdated casks with their actual newer versions each day - i.e. plan b) , i doubt that could become problematic.
If you did that I would be happy to build something that worked with it :)
sounds great ;) i'll try to get something to work this week.
It’s been difficult for me to give attention to this thread, but I’ve read it all now and you’re going in the right direction. Agree completely with the progression so far.
ok i've been working on this a bit but i am undecided about some architectural questions. basically i generate a textual output about any apps where i see newer versions in our database than the cask lists.
the thing is, we need filtering to keep noise down.
i already filter app-versions from the database that look suspiciously like beta releases. there are about 10 filters here (e.g. a version '1.2.3b3' will be filtered because it looks beta-ish), so no report about any of those will be generated. this filtering works fine for what i want in my app, but isn't "enough" here. we've processed the 'caskversioncheck.txt' i've posted initially and i've had a look at the results - in which cases were there really new versions that have resulted in a cask-repair being done, in which cases were false positives.
i am about 1/3 through and came up with 3 more filtering categories we neeed:
1.) a list of cask-names to be ignored for this facility altogether because noise is too high because the vendor publishes beta versions that are indistinguishable from real versions (e.g. visual-studio, skype-for-business, origin, etc)
2.) a list of filter strings that should be applied to filter out some beta versions. e.g. caret|.-rc. mendeley|.-dev. => e.g. every version of the 'cask' caret that is floating around and has "rc" in its name should be ignored because its obviously a 'release candidate' and not a real new version that should be entered into HBC
3.) a list of specific versions to be ignored. many of these are one-off versions that are not real releases. e.g. fs-uae|2.9.7|2.9.7 desmume|0.9.12|0.9.12 ebmac|1.43.3|1.43.3
if we apply these 3 additional filtering facilities to my output, the noise is pretty low and most of the output is useful and should result in real new versions to be entered into HBC.
the thing is, where to apply these filters? on my end? or on the consuming end?
on my end:
+) i already have filtering being done, adding more isn't difficult for me and keeps all the filtering in one place
+) this will make the output i generate much higher quality as-is without any additional post-processing
-) i have no problem adding the filtering on my end, but if this system is in use, new filters will need to get added all the time. since i am quite busy and sometimes unavailable there might be a bottleneck there.
on the consuming end:
+) maintaining the filters here seems much more flexible as the round-trip to me goes away
+) my output probably needs to be processed in some way anyway, adding filtering shouldn't be difficult
-) having some filters on my end and additional filters somewhere else seems like a weird distinction.
hm => why not remove even the beta filters on my end and output something completely unprocessed and keep the filters in one place?
I personally lean more towards this being a raw data source that would need processing, however I am sensitive to the fact that some burden is being put upon you to provide this dataset so making it smaller or more manageable (upstream of me) may be necessary.
edit: typo
ok, things are moving slow, but i finally made some progress. its still not in a state where i can run it periodically but its getting there
have a look here:
http://macupdater.net/cask-repair/
you'll find:
cr-raw-2018-10-21.txt
this is the raw output from running the comparison of my database against the HBC database
postprocessCaskRepairOutput.py.txt
this is the script to postprocess the raw output by filtering things that are known or highly likely to be false positives. i had to append .txt to make my web server happy
cr-processed-2018-10-21.txt
this is the file you'll get when running the postprocessing python script on the raw source output. i think most of the entries in there should lead to valid cask-repairs
i've worked on this some more, and i can do everything now in a semi-automated fashion. i'll try to do it daily from now on.
i've uploaded a new postprocessing script and new outputs. i think most of the entries in cr-raw-2018-11-01.txt should yield valid cask-repairs.
what do you think?
It looked good when I checked it out 10 days ago... but I haven't had much time lately to do HBC stuff - hoping that will change today!
the file from yesterday were like 80% real cask-repairs and only a few false positives.
@suschizu and me have done all those already.
i've made some adjustments to the postprocessing script too and the output from tomorrow should be pretty small since most casks seem to be up-to-date now.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.