Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 411019

Summary: Bots/crawlers/spiders/scripts increasing message view counts to unrealistic numbers
Product: Community Reporter: Denis Roy <denis.roy>
Component: Forums and NewsgroupsAssignee: Forums and Newsgroups inbox <forums-inbox>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: alexandra.schladebeck, Ed.Merks, pwebster, webmaster
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:
Attachments:
Description Flags
User agents from a single IP address in Belarus none

Description Denis Roy CLA 2013-06-18 09:52:42 EDT
Ed (cc'd) brought this up.. The Views count on the forums seems to be totally bogus.  Currently, my Git Systems Integrator thread in the Jobs forum is listed at over 9000 views:
http://www.eclipse.org/forums/index.php/f/94/

Checking the Apache logs, I can't see more than 200 hits to that page... Which, frankly, is much more realistic.

If I click the thread, press back, then refresh, the count is correctly incremented by one.  So somewhere down the road the count is being incorrectly set.
Comment 1 Denis Roy CLA 2013-06-18 10:01:52 EDT
As I look at the MySQL binary logs, it seems that new threads get their views count incremented rapidly after the initial post.

I see tons of queries like this in the logs, often the same query repeating itself.  It's incrementing a bunch of thread views all at once, repeatedly:
UPDATE fud_thread SET views=views+1 WHERE id IN(489200,489188,489178,489165,489152,489115,489112,489059,488763,488762,488751,488750,489207,489206,489203,489120,489205,488712,489083,489204,489198,489197,489190,489136,489103,489070,489065,
489202,489196,489118,489043,489024,489167,489148,489053,488756,489199,489180,489147,489026)
Comment 2 Denis Roy CLA 2013-06-18 11:14:15 EDT
I have two threads in the "Test" forum...  The first one had its view count climb to just over 500 very rapidly (5 minutes) then it stopped.

Hitting the "Today's Messages", Unread Messages and Unanswered Messages links is what's causing a bunch of thread view counts to be incremented as comment 1.

Later on I opened a second thread.  Its view count after 20 minutes is still at about 15, which seems normal.

Looking at the Apache logs, it appears there's a _lot_ of garbage requests.  I plowed through and could find one bot (ezooms) request the Unanswered Messages list (and there is one per thread, then you can combine with Unread, Today's, etc) 68 times in the 60-second period of 10:04 this morning.  My logs are full with this garbage.

In the end, there is no problem with the code. It's the increase in volume of useless bots/spiders/scripts/researchers/etc that generates _tons_ of artificial traffic.  Messages that go unanswered for days suffer even more.
Comment 3 Denis Roy CLA 2013-06-18 11:39:39 EDT
Created attachment 232506 [details]
User agents from a single IP address in Belarus

Just to illustrate what I'm up against, attached is a list of user agents (and hit counts) from one single IP address in Belarus that appears to be hitting various combinations of the Unread/Today's/Unanswered messages so far today.  We're not even 12 hours into the day yet.

This looks like some kind of script that was simply cycling though a dictionary of user agents/Platforms to disguise itself from ... accessing our forum pages repeatedly?  Why?
Comment 4 Denis Roy CLA 2013-06-18 13:25:57 EDT
Karl pointed out that our robots.txt was missing on 2 of the 3 servers...  So that is fixed.  I also added a rule:

User-agent: *
Disallow: /forums/index.php/sel

I'm not convinced that will resolve the issue entirely.  Perhaps a better solution would be to alter the forum template and simply not display the view counts at all.
Comment 5 Denis Roy CLA 2013-06-18 13:51:08 EDT
Ok, I've greatly expanded our robots.txt to remove lots of cruft, double-indexing and what not.  Also, I've added some hard rules for the Ezoom bot; it will receive an HTTP 418 "I'm a teapot!" response.

Let's let this one simmer and see what happens as time goes by.
Comment 6 Alexandra Schladebeck CLA 2014-01-22 02:02:15 EST
This looks like it's happening again - I remember it being better after the last comment, and now the Jubula forum has rather high numbers again.
Comment 7 Denis Roy CLA 2014-02-18 15:24:57 EST
I just created thread # 653226
http://www.eclipse.org/forums/index.php/t/653226/

Within 2 minutes it had a view count of 400+

If I examine the MySQL binary logs, I see the thread view count was indeed updated 400+ times"
mysqlbinlog dbmaster.007287 | egrep "UPDATE fud_thread.*653226" | wc -l
427


There's a single query that updates only my thread... likely from my viewing it immediately after creation.  All 426 others are:

UPDATE fud_thread SET views=views+1 WHERE id IN(653226,653225,651210,648987,648724,653209,651646,649638,649456,648931,648786,652720,652689,652634,652585,652552,652007,650559,648851,648552,651966,649376,653116,648956,653160,652978,652797,648772,648299,652855,648725,647779,653004,648723,648193,652905,649413,647817,653081,651941)

I'll track down that query in the code and add some debugging comments.
Comment 8 Denis Roy CLA 2014-02-18 15:44:59 EST
I tracked the query to selmsg.php and altered it to add IP address information.

q('UPDATE /* selmsg:881 IP:[' . $_SERVER['REMOTE_ADDR'] . '] */ fud_thread SET views=views+1 WHERE id IN('. implode(',', $thl) .')');


I get the feeling the NNTP sync is what is responsible for these invalid counts.  If that's the case, I'll wrap the above query around an if() to check for a valid IP address before UPDATE'ing the count.
Comment 9 Denis Roy CLA 2014-02-20 09:26:22 EST
> I plowed through and could find one bot (ezooms) 

So the Ezooms bot was back at it... they had changed their user-agent, so the fix in comment 2 wasn't working anymore.

Regardless, even Googlebot and Bing were generating a fair amount of requests.  To reiterate, the bulk of the problem are the  Today's Messages :: Unread Messages :: Unanswered Messages links, which increase the thread view count for numerous threads.

To resolve this issue once and for all, I've simply removed the DB query that increments multiple thread views at once.  Our overall view counts may be a bit lower, but will more accurately reflect threads that humans (and search engines) actually click on.

I'll close this as FIXED.  New threads should begin to have a view count which is much more realistic.
Comment 10 Alexandra Schladebeck CLA 2014-02-20 09:53:37 EST
Thanks Denis :)