Homebrew-cask: Discussion: automatic version checking for casks without appcast

Created on 4 Oct 2018 · 17Comments · Source: Homebrew/homebrew-cask

hello all,

i am starting a new discussion because i didn't find any previous discussion on the topic of automatically updating (or finding updates) for casks.

my current impression is that we are moving towards more automation for keeping casks current, but the effort is mainly done 'outside' HBC by observing the appcasts and then sending pull requests.
i am not sure how far along this is (i am often finding astonishingly outdated casks with perfectly fine appcasts) but i have the impression that is is being worked on, and is a problem that can be solved reasonably (if there are valid sparkle or github appcasts).

what i'd like to discuss is keeping casks current where we don't have any valid or easily parsable appcasts. i think i can provide something of value here to the HBC community. as discussed previously, through our app i have access to version information of apps installed by thousands of users. most of the time, /someone/ has the latest version of an app installed.
i am able to automatically compare the version that is in HBC to the versions stored in my database, and generate a report for each app where a newer version is available.

i just implemented a first go at this, and already noticed quite a few outdated casks only by looking at the results:

caskversioncheck.txt

some points
• obviously there are a lot of false positives. keep in mind this is the very first output.

generally there will be problems for apps that distribute any versions that are not official releases (e.g. beta versions). they pose a problem, but luckily they are in the minority. for many of those that distribute beta versions, they can be filtered automatically. i already filter things e.g. with GIT or BETA in the version, but a lot more can be done with app-specific filters. i envision 3 steps to dealing with false positives: app specific filters, app specific blacklists for specific versions, and blacklisting certain apps where there is too much noise
apart from that, there are false positives, e.g. when CASK only has part of the version. e.g. 7.8 is stored in cask, with NOSHA. real versions are 7.8.1, 7.8.2 and seemingly newer. these cases could be dealt with more intelligent code or with blacklisting

while looking through the file makes it seem there is quite a lot of noise, i think we can easily cover 70% of the apps without appcast with practically zero noise with some improvements and filtering. most apps don't have beta versions.

is this useful to HBC? basically i can commit to improving the script and running it regularly and posting the output of it publicly for others to use accordingly. what i can not do is to build the infrastructure to filter all false positives and send pull requests automatically (or manually).

looking forward to your thoughts.

outdated stale

Source

core-code

All 17 comments

just sent a dozen PRs based on the txt file. the success rate is already quite good.

core-code on 4 Oct 2018

I would say for now that you should either:
a) build a tool that consumes that information on your side and submits PRs with cask-repair
b) publish the info at an api/url and see if others can build onto it. it may eventually get migrated into HBC depending on what happens.

You're right that there is a movement towards automation (I think you and I are mostly leading it at this point), but there have been issues automating these things in the past.

The current line of thinking around this is to customize any automation to each cask. Basically for a given cask (let's take splashtop-streamer.rb, randomly), there would be a separate file which specifically detailed how to check that cask for updates in an automated way. Fetch this page, look for this link, click on it, look for this part of the page, grep the verison number, etc.

In some other thread which I can't find now, most of these usecases could be boiled down into like 4 or 5 main flows (one for github, one for sourceforge, etc). So then all you would really need to write is the deviations from the baseline flow. For example if we had a cask that uses github but the "generic" automation for github casks doesn't work, you can just write what needs to be different.

Does any of that make sense?

That is kind of what I'm working on now.

EDIT: I should also note that there will be questions of privacy/legality and whatnot around how you came to possess the information you are using, which would be a barrier for HBC integration. In short, that information collection would have to be approved and internalized... but since it comes from a 3rd party (I believe you run a business and this information is a byproduct?) that is unlikely to happen.

brianbrownton on 12 Oct 2018

hey @brianmorton , thanks for your feedback.

i wasn't prepared to do a) as it involves probably more time than i can afford to put into this at the moment. option b) sounds good and workable at the moment.

for the time being i've seen little interest here and so we'll use the info to do cask-repairs of those casks that interest us manually. if things change i'm prepared to clean this up and periodically generate and upload it.

The current line of thinking around this is to customize any automation to each cask.

thats also what i have envisioned as the ultimate solution to the given problem. i see two main problems: 1.) writing tailored code for 4000 casks and 2.) maintaining the code as upstream changes. e.g. websites change, go offline, go online again, feeds get broken, repaired again, etc.

For example if we had a cask that uses github but the "generic" automation
for github casks doesn't work, you can just write what needs to be different.

looking at the cask-repairs i've done i see that many people seem to randomly change filenames each release which is obviously bad for our use-case.

Does any of that make sense?

yes it does and thanks for your feedback!

privacy/legality

i don't really see much connection here
• the information that "appversion 1.2.3 exists" is nothing "private". private means connected to a human. its not connected to any human. we don't track users or user behaviour and don't intend to. we track appversions which are not something private.
• the information that "appversion 1.2.3 exists" isn't even copyrightable at all because its a fact about our world and not something some person or company has produced (like the app itself). just like the length of the latest BMW or the number of legs on some exotic animal its a fact about our world and not copyrightrightable. (exception, the specific aggregation of non-copyrightable things can still be copyrightable, which is a loophole that some collections of public-domain works use to prevent their databases from being copied. since we did the aggregation ourselves there is no problem here).

That is kind of what I'm working on now.

ok thats great!

core-code on 14 Oct 2018

don't really see much connection here

I was alluding to the fact that you as a private entity/business are collecting information on your users and then sharing said information. But my understanding of what you do or how you get your information may be flawed.

brianbrownton on 15 Oct 2018

i think it is ;) you are mixing up 'information' and 'personal information':

https://en.wikipedia.org/wiki/Personally_identifiable_information

while 'personal information' is protected in most jurisdictions (especially now in the EU with the GDPR) ) in how you can collect, store, process, use and share it, informations that are not connected to humans are not protected (apart from the usual copyright and trade-secret stuff).

we don't collect any 'personal information'.

core-code on 15 Oct 2018

I understand all that, but the information is collected as part of your business practices. Ergo it would be difficult for homebrew to indefinitely use it - they would likely want/need to internalize the collection. If your business changed priorities or went away, that would be problematic if something here was built on top of having access to that information.

brianbrownton on 15 Oct 2018

i see. however, if we just publish a file containing a list of (possibly) outdated casks with their actual newer versions each day - i.e. plan b) , i doubt that could become problematic.

core-code on 15 Oct 2018

If you did that I would be happy to build something that worked with it :)

brianbrownton on 15 Oct 2018

sounds great ;) i'll try to get something to work this week.

core-code on 15 Oct 2018

👍1

It’s been difficult for me to give attention to this thread, but I’ve read it all now and you’re going in the right direction. Agree completely with the progression so far.

vitorgalvao on 18 Oct 2018

👍1

ok i've been working on this a bit but i am undecided about some architectural questions. basically i generate a textual output about any apps where i see newer versions in our database than the cask lists.

the thing is, we need filtering to keep noise down.

i already filter app-versions from the database that look suspiciously like beta releases. there are about 10 filters here (e.g. a version '1.2.3b3' will be filtered because it looks beta-ish), so no report about any of those will be generated. this filtering works fine for what i want in my app, but isn't "enough" here. we've processed the 'caskversioncheck.txt' i've posted initially and i've had a look at the results - in which cases were there really new versions that have resulted in a cask-repair being done, in which cases were false positives.

i am about 1/3 through and came up with 3 more filtering categories we neeed:

1.) a list of cask-names to be ignored for this facility altogether because noise is too high because the vendor publishes beta versions that are indistinguishable from real versions (e.g. visual-studio, skype-for-business, origin, etc)

2.) a list of filter strings that should be applied to filter out some beta versions. e.g. caret|.-rc. mendeley|.-dev. => e.g. every version of the 'cask' caret that is floating around and has "rc" in its name should be ignored because its obviously a 'release candidate' and not a real new version that should be entered into HBC

3.) a list of specific versions to be ignored. many of these are one-off versions that are not real releases. e.g. fs-uae|2.9.7|2.9.7 desmume|0.9.12|0.9.12 ebmac|1.43.3|1.43.3

if we apply these 3 additional filtering facilities to my output, the noise is pretty low and most of the output is useful and should result in real new versions to be entered into HBC.

the thing is, where to apply these filters? on my end? or on the consuming end?

on my end:
+) i already have filtering being done, adding more isn't difficult for me and keeps all the filtering in one place
+) this will make the output i generate much higher quality as-is without any additional post-processing
-) i have no problem adding the filtering on my end, but if this system is in use, new filters will need to get added all the time. since i am quite busy and sometimes unavailable there might be a bottleneck there.

on the consuming end:
+) maintaining the filters here seems much more flexible as the round-trip to me goes away
+) my output probably needs to be processed in some way anyway, adding filtering shouldn't be difficult
-) having some filters on my end and additional filters somewhere else seems like a weird distinction.

hm => why not remove even the beta filters on my end and output something completely unprocessed and keep the filters in one place?

core-code on 19 Oct 2018

I personally lean more towards this being a raw data source that would need processing, however I am sensitive to the fact that some burden is being put upon you to provide this dataset so making it smaller or more manageable (upstream of me) may be necessary.

edit: typo

brianbrownton on 19 Oct 2018

ok, things are moving slow, but i finally made some progress. its still not in a state where i can run it periodically but its getting there

have a look here:
http://macupdater.net/cask-repair/

you'll find:
cr-raw-2018-10-21.txt
this is the raw output from running the comparison of my database against the HBC database

postprocessCaskRepairOutput.py.txt
this is the script to postprocess the raw output by filtering things that are known or highly likely to be false positives. i had to append .txt to make my web server happy

cr-processed-2018-10-21.txt
this is the file you'll get when running the postprocessing python script on the raw source output. i think most of the entries in there should lead to valid cask-repairs

core-code on 22 Oct 2018

👍1

i've worked on this some more, and i can do everything now in a semi-automated fashion. i'll try to do it daily from now on.

i've uploaded a new postprocessing script and new outputs. i think most of the entries in cr-raw-2018-11-01.txt should yield valid cask-repairs.

what do you think?

core-code on 1 Nov 2018

It looked good when I checked it out 10 days ago... but I haven't had much time lately to do HBC stuff - hoping that will change today!

brianbrownton on 2 Nov 2018

the file from yesterday were like 80% real cask-repairs and only a few false positives.
@suschizu and me have done all those already.

i've made some adjustments to the postprocessing script too and the output from tomorrow should be pretty small since most casks seem to be up-to-date now.

core-code on 2 Nov 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.