| Summary: | Bots/crawlers/spiders/scripts increasing message view counts to unrealistic numbers | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Community | Reporter: | Denis Roy <denis.roy> | ||||
| Component: | Forums and Newsgroups | Assignee: | Forums and Newsgroups inbox <forums-inbox> | ||||
| Status: | RESOLVED FIXED | QA Contact: | |||||
| Severity: | normal | ||||||
| Priority: | P3 | CC: | alexandra.schladebeck, Ed.Merks, pwebster, webmaster | ||||
| Version: | unspecified | ||||||
| Target Milestone: | --- | ||||||
| Hardware: | PC | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Attachments: |
|
||||||
|
Description
Denis Roy
As I look at the MySQL binary logs, it seems that new threads get their views count incremented rapidly after the initial post. I see tons of queries like this in the logs, often the same query repeating itself. It's incrementing a bunch of thread views all at once, repeatedly: UPDATE fud_thread SET views=views+1 WHERE id IN(489200,489188,489178,489165,489152,489115,489112,489059,488763,488762,488751,488750,489207,489206,489203,489120,489205,488712,489083,489204,489198,489197,489190,489136,489103,489070,489065, 489202,489196,489118,489043,489024,489167,489148,489053,488756,489199,489180,489147,489026) I have two threads in the "Test" forum... The first one had its view count climb to just over 500 very rapidly (5 minutes) then it stopped. Hitting the "Today's Messages", Unread Messages and Unanswered Messages links is what's causing a bunch of thread view counts to be incremented as comment 1. Later on I opened a second thread. Its view count after 20 minutes is still at about 15, which seems normal. Looking at the Apache logs, it appears there's a _lot_ of garbage requests. I plowed through and could find one bot (ezooms) request the Unanswered Messages list (and there is one per thread, then you can combine with Unread, Today's, etc) 68 times in the 60-second period of 10:04 this morning. My logs are full with this garbage. In the end, there is no problem with the code. It's the increase in volume of useless bots/spiders/scripts/researchers/etc that generates _tons_ of artificial traffic. Messages that go unanswered for days suffer even more. Created attachment 232506 [details]
User agents from a single IP address in Belarus
Just to illustrate what I'm up against, attached is a list of user agents (and hit counts) from one single IP address in Belarus that appears to be hitting various combinations of the Unread/Today's/Unanswered messages so far today. We're not even 12 hours into the day yet.
This looks like some kind of script that was simply cycling though a dictionary of user agents/Platforms to disguise itself from ... accessing our forum pages repeatedly? Why?
Karl pointed out that our robots.txt was missing on 2 of the 3 servers... So that is fixed. I also added a rule: User-agent: * Disallow: /forums/index.php/sel I'm not convinced that will resolve the issue entirely. Perhaps a better solution would be to alter the forum template and simply not display the view counts at all. Ok, I've greatly expanded our robots.txt to remove lots of cruft, double-indexing and what not. Also, I've added some hard rules for the Ezoom bot; it will receive an HTTP 418 "I'm a teapot!" response. Let's let this one simmer and see what happens as time goes by. This looks like it's happening again - I remember it being better after the last comment, and now the Jubula forum has rather high numbers again. I just created thread # 653226 http://www.eclipse.org/forums/index.php/t/653226/ Within 2 minutes it had a view count of 400+ If I examine the MySQL binary logs, I see the thread view count was indeed updated 400+ times" mysqlbinlog dbmaster.007287 | egrep "UPDATE fud_thread.*653226" | wc -l 427 There's a single query that updates only my thread... likely from my viewing it immediately after creation. All 426 others are: UPDATE fud_thread SET views=views+1 WHERE id IN(653226,653225,651210,648987,648724,653209,651646,649638,649456,648931,648786,652720,652689,652634,652585,652552,652007,650559,648851,648552,651966,649376,653116,648956,653160,652978,652797,648772,648299,652855,648725,647779,653004,648723,648193,652905,649413,647817,653081,651941) I'll track down that query in the code and add some debugging comments. I tracked the query to selmsg.php and altered it to add IP address information.
q('UPDATE /* selmsg:881 IP:[' . $_SERVER['REMOTE_ADDR'] . '] */ fud_thread SET views=views+1 WHERE id IN('. implode(',', $thl) .')');
I get the feeling the NNTP sync is what is responsible for these invalid counts. If that's the case, I'll wrap the above query around an if() to check for a valid IP address before UPDATE'ing the count.
> I plowed through and could find one bot (ezooms) So the Ezooms bot was back at it... they had changed their user-agent, so the fix in comment 2 wasn't working anymore. Regardless, even Googlebot and Bing were generating a fair amount of requests. To reiterate, the bulk of the problem are the Today's Messages :: Unread Messages :: Unanswered Messages links, which increase the thread view count for numerous threads. To resolve this issue once and for all, I've simply removed the DB query that increments multiple thread views at once. Our overall view counts may be a bit lower, but will more accurately reflect threads that humans (and search engines) actually click on. I'll close this as FIXED. New threads should begin to have a view count which is much more realistic. Thanks Denis :) |