Pkp-lib: consider case-insensitive bot match

Created on 30 Jan 2019  路  18Comments  路  Source: pkp/pkp-lib

It seems that the regexp match in Core::isUserAgentBot() is currently case-sensitive: s. https://github.com/pkp/pkp-lib/blob/stable-3_1_2/classes/core/Core.inc.php#L112.
The documentation for COUNTER bot list (s. https://github.com/atmire/COUNTER-Robots) last sentence) says: "When matching against the patterns in this list, we recommend to use case-insensitive matching.".
Thus matching against this bot list should be case-insensitive.
However the function Core::isUserAgentBot() is used for other bot lists too, i.e. for botAgents.txt, which maybe do not need/have to be case-insensitive.
Thus, maybe it would be best to provide and consider an additional optional parameter $caseInsensitive in the function isUserAgentBot, so that it can be used for each bot list as wished. A patch for that and use of $caseInsensitive = true for COUNTER bot list is coming.

s. also https://github.com/pkp/pkp-lib/issues/4390

Enhancement

All 18 comments

My initial thought is: does it matter if https://github.com/pkp/pkp-lib/blob/master/registry/botAgents.txt is checked using case insensitive parameter?

A connected issue is: why do we need two lists? Why not just use the atmire list?

(either you have moved further east or you are awake early?)

PRs:
pkp-lib stable-3_1_2: https://github.com/pkp/pkp-lib/pull/4404
ojs stable-3_1_2: https://github.com/pkp/ojs/pull/2256 (only submodules updates)

pkp-lib master: https://github.com/pkp/pkp-lib/pull/4426

@asmecher, could you please take a look?
@ajnyga, would it be maybe possible for you to test the patches?

My initial thought is: does it matter if https://github.com/pkp/pkp-lib/blob/master/registry/botAgents.txt is checked using case insensitive parameter?

A connected issue is: why do we need two lists? Why not just use the atmire list?

For both I am not 100% sure :-(
Regarding the second question, s. also https://github.com/pkp/pkp-lib/issues/3209.
@asmecher, what do you think?

(either you have moved further east or you are awake early?)

Yes, somehow I couldn't sleep :-(
And it seems you never sleep? :-)

And it seems you never sleep? :-)

Working with OJS is so great, impossible to sleep

I haven't dug deeply into this aspect of OJS -- IIRC I introduced the botAgents.txt list way back before Counter stats were on our roadmap, and we were counting views in a column for various entities. I'm all in favour of using a unified list, especially if it's externally maintained and aligned with a recognised standard (e.g. Counter). I can't think of any reason to worry about case sensitivity -- I suggest making the match case insensitive.

Hmmm... OK. Then I will change it so that only COUNTER list is used and the match to be case-insensitive per default. Any objection?

Maybe @ctgraham has an opinion to the issue?

I think that a single list would make sense.

Yes, above all what does it mean being COUNTER-compatible? E.g. using the exact bot list...

The intent of COUNTER is to exclude all bot traffic when reporting statistics. The intent of the COUNTER Robots repository is not to limit other User Agent strings (or other detection mechanisms) from being used, but rather to provide an official "known bot" list. If we have an additional list or method (such as bad robots by ip) which we want to use within the application, this should apply equally to the COUNTER stats as well as any other internal usage.

The Core::isUserAgentBot() method should draw on the COUNTER Robots list, but also should be free to use any additional functionality as needed.

Do we have a use case where the check should be case-sensitive? If not, maybe better to just make it case insensitive everywhere.

Yes, I will then just make it case insensitive everywhere. Now I am just not sure if we should use just the COUNTER list or both (COUNTER and botAgents.txt) :-\
@asmecher, where do we have this other list botAgents.txt and how do we maintain it?

There are only a few entries in our "botAgents.txt" that are not already covered by the COUNTER list:

Accoona-AI-Agent
B-l-i-t-z-B-O-T
Cerberian Drtrs
Charlotte
cosmos
Covario IDS
igdeSpyder
mabontland
mogimogi
MVAClient
NetResearchServer
NewsGator
NG-Search
Nymesis
oegp
Orbiter
Peew
Pompos
PostPost
Qseero
Radian6
SBIder
ScoutJet
Scrubby
SearchSight
semanticdiscovery
ShopWiki
silk
Snappy
TinEye
truwoGPS
Vagabondo
Vortex
voyager
VYU2
Websquash.com
wf84
WomlpeFactory
Yeti
yoogliFetchAgent
Zao

Sampling some recent data, I see the following counts/matches in our logs currently:

      1 Dalvik/2.1.0 (Linux; U; Android 6.0; ASUS_X008DB Build/MRA58K)
     48 Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15G77 [FBAN/FBIOS;FBAV/157.0.0.42.96;FBBV/90008621;FBDV/iPhone7,2;FBMD/iPhone;FBSN/iOS;FBSV/11.4.1;FBSS/2;FBCR/Ufone;FBID/phone;FBLC/en_GB;FBOP/5;FBRV/0]
     51 Mozilla/5.0 (iPhone; CPU iPhone OS 12_0_1 like Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko) GSA/45.0.188348008 Mobile/16A404 Safari/604.1
    114 Mozilla/5.0 (Linux; Android 4.4.2; SG008 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/30.0.0.0 Safari/537.36
      1 Mozilla/5.0 (Linux; Android 5.0; ASUS_Z008 Build/LRX21V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/71.0.3578.99 Mobile Safari/537.36 GSA/8.91.5.21.x86
      1 Mozilla/5.0 (Linux; Android 5.0; ASUS_Z008D) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.80 Mobile Safari/537.36
      2 Mozilla/5.0 (Linux; Android 5.0; ASUS_Z008D Build/LRX21V) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/39.0.0.0 Mobile Safari/537.36
      1 Mozilla/5.0 (Linux; Android 6.0; 7008 Build/MRA58K; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/45.0.2454.95 Mobile Safari/537.36 GSA/6.8.23.21.arm
      1 Mozilla/5.0 (Linux; Android 6.0; ASUS_X008DB Build/MRA58K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.89 Mobile Safari/537.36
     49 Mozilla/5.0 (Linux; Android 7.0; 9008A Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/71.0.3578.99 Mobile Safari/537.36 [FB_IAB/FB4A;FBAV/202.0.0.40.99;]
     96 Mozilla/5.0 (Linux; Android 7.0; ASUS_X008DA) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.99 Mobile Safari/537.36
      1 Mozilla/5.0 (Linux; Android 7.0; ASUS_X008DA Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.137 Mobile Safari/537.36
     47 Mozilla/5.0 (Linux; Android 7.0; ASUS_X008D) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.99 Mobile Safari/537.36
      1 Mozilla/5.0 (Linux; Android 7.0; ASUS_X008D Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.126 Mobile Safari/537.36
      1 Mozilla/5.0 (Linux; Android 7.0; ASUS_X008D Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Mobile Safari/537.36
      1 Mozilla/5.0 (Linux; U; Android 4.2.2; en-us; Lenovo A316i/S008) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
     49 Mozilla/5.0 (Linux; U; Android 5.1; en-US; SM-J5008 Build/LMY47O) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.4.8.1012 Mobile Safari/537.36
      1 Mozilla/5.0 (Linux; U; Android 7.0; ASUS_X008DA Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/71.0.3578.99 Mobile Safari/537.36 OPR/37.6.2254.134291
      3 Mozilla/5.0 (Linux; U; Android 7.0; en-US; ASUS_X008DA Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 UCBrowser/12.9.9.1155 Mobile Safari/537.36
      2 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_26_73) AppleWebKit/531.77.24 (KHTML, like Gecko) Chrome/55.1.6535.1263 Safari/532.06 Edge/36.00854
      2 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_35_72) AppleWebKit/531.76.23 (KHTML, like Gecko) Chrome/55.1.6534.1262 Safari/532.06 Edge/36.00836
      2 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.13; ko; rv:1.9.1b2) Gecko/20081201 Firefox/60.0
      1 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6;en-US; rv:1.9.2.9) Gecko/20100824 Firefox/3.6.9
    222 Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9) Gecko/2008052906 Firefox/3.0
      1 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.2.1.13) Gecko/20080311 Firefox/2.0.0.13
     18 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13
    181 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1
    112 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0
      1 Mozilla/5.0 (Windows; U; Windows NT 5.1; ja-JP; rv:1.9.2.8) Gecko/20100817 Firefox/3.6.8 (Palemoon/3.6.8a)
     35 Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.0.1) Gecko/2008070208
      2 Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9b4) Gecko/2008030317 Firefox/3.0b4
      2 Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20080129 Firefox/52.0
    147 Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6
      1 Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092816 Iceweasel/3.0.1 (Debian-3.0.1-1)
      2 Mozilla/5.0 (X11; U; Linux i686; es-ES; rv:1.8.1.14) Gecko/20080419 Ubuntu/8.04 (hardy) Firefox/52.7.3
      1 Mozilla/5.0 (X11; U; Linux i686; fr; rv:1.8.1.6) Gecko/20071008 Ubuntu/7.10 (gutsy) Firefox/2.0.0.11
      1 Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.25 (jaunty) Firefox/3.8
    288 Mozilla/5.0 (X11; U; Linux i686; pt-BR; rv:1.9.0.3) Gecko/2008101315 Ubuntu/8.10 (intrepid)  Firefox/3.0.3
      2 Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.20) Gecko/20081217 Firefox/52.4.1
      2 Mozilla/5.0 (X11; U; Linux x86_64; es-AR; rv:1.9.0.3) Gecko/2008092515 Ubuntu/8.10 (intrepid) Firefox/50.0.1
IDS
     72 bidswitchbot/1.0
      1 OutclicksBot/4 +https://www.outclicks.net/agent/j6CcoBJztJlsqWqCBkIrWTeZVsREcZPFCSwcOidsH05
      3 WordPress/4.9.9; http://capitaldescribefunctionlessamberoids.dresaj.tk
NewsGator
      2 NewsGator FetchLinks extension/0.2.0 (http://graemef.com)
      2 NewsGatorOnline/2.0 (http://www.newsgator.com; 1 subscribers)
ScoutJet
     59 Mozilla/5.0 (compatible; ScoutJet; +http://www.scoutjet.com/)
silk
     54 Mozilla/5.0 (Linux; Android 4.0.3; KFTT) AppleWebKit/537.36 (KHTML, like Gecko) Silk/71.1.106 like Chrome/71.0.3578.98 Safari/537.36
      1 Mozilla/5.0 (Linux; Android 4.0.3; KFTT) AppleWebKit/537.36 (KHTML, like Gecko) Silk/71.3.1 like Chrome/71.0.3578.98 Safari/537.36
      1 Mozilla/5.0 (Linux; Android 4.0.4; KFJWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/70.5.1 like Chrome/70.0.3538.110 Safari/537.36
      1 Mozilla/5.0 (Linux; Android 4.4.3; KFSOWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/71.2.4 like Chrome/71.0.3578.98 Safari/537.36
      1 Mozilla/5.0 (Linux; Android 4.4.3; KFTHWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/71.2.4 like Chrome/71.0.3578.98 Safari/537.36
      1 Mozilla/5.0 (Linux; Android 4.4.3; KFTHWI Build/KTU84M) AppleWebKit/537.36 (KHTML, like Gecko) Silk/47.1.79 like Chrome/47.0.2526.80 Safari/537.36
      1 Mozilla/5.0 (Linux; Android 5.1.1; KFAUWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/70.5.1 like Chrome/70.0.3538.110 Safari/537.36
      2 Mozilla/5.0 (Linux; Android 5.1.1; KFAUWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/71.2.4 like Chrome/71.0.3578.98 Safari/537.36
     53 Mozilla/5.0 (Linux; Android 5.1.1; KFDOWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/70.5.1 like Chrome/70.0.3538.110 Safari/537.36
     43 Mozilla/5.0 (Linux; Android 5.1.1; KFDOWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/71.2.4 like Chrome/71.0.3578.98 Safari/537.36
      1 Mozilla/5.0 (Linux; Android 5.1.1; KFFOWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/70.4.2 like Chrome/70.0.3538.80 Safari/537.36
     91 Mozilla/5.0 (Linux; Android 5.1.1; KFFOWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/70.5.1 like Chrome/70.0.3538.110 Safari/537.36
     17 Mozilla/5.0 (Linux; Android 5.1.1; KFFOWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/71.2.4 like Chrome/71.0.3578.98 Safari/537.36
     53 Mozilla/5.0 (Linux; Android 5.1.1; KFGIWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/70.5.1 like Chrome/70.0.3538.110 Safari/537.36
     46 Mozilla/5.0 (Linux; Android 5.1.1; KFSUWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/70.5.1 like Chrome/70.0.3538.110 Safari/537.36
    160 Mozilla/5.0 (Linux; Android 5.1.1; KFSUWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/71.2.4 like Chrome/71.0.3578.98 Safari/537.36
      1 Mozilla/5.0 (Linux; Android 5.1.1; KFTBWI Build/LVY48F) AppleWebKit/537.36 (KHTML, like Gecko) Silk/62.2.2 like Chrome/62.0.3202.73 Safari/537.36
Vagabondo
      5 Mozilla/4.0 (compatible;  Vagabondo/4.0; webcrawler at wise-guys dot nl; http://webagent.wise-guys.nl/; http://www.wise-guys.nl/)
voyager
      2 Mozilla/3.01 (compatible; AmigaVoyager/2.95; AmigaOS/MC680x0)
Yeti
    412 Mozilla/5.0 (compatible; Yeti/1.1; +http://naver.me/spd)

The strings "008", "IDS", and "Voyager" seem to be false-positives or redundant.

The others could be submitted as pull requests to COUNTER, and the local botAgents.txt could be removed.

Thanks a lot @ctgraham! :+1:
Would you maybe like to submit the PR for COUNTER?
And should we then move the counter submodule to pkp-lib/registry/, or lib/, or can it stay in the plugins/usageStats/lib/, or ? (@asmecher?)

Hmm, no strong preference about where to keep the COUNTER submodule, but maybe lib/pkp/lib?

Further review suggests that "silk" is also a false positive. I think only ScoutJet and Yeti appear to be bot activity which is not otherwise covered by the COUNTER list.

Upstream PR: https://github.com/atmire/COUNTER-Robots/pull/21

I also have a pr there for Feedbin, hopefully the will start merging these.

I also have a pr there for Feedbin, hopefully the will start merging these.

on it

Was this page helpful?
0 / 5 - 0 ratings