Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 365956

Summary: can robots exclude list allow "link checker"?
Product: Community Reporter: David Williams <david_williams>
Component: WikiAssignee: Eclipse Webmaster <webmaster>
Status: RESOLVED FIXED QA Contact:
Severity: enhancement    
Priority: P3 CC: chris.guindon
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:
Attachments:
Description Flags
sample log from running "check links" none

Description David Williams CLA 2011-12-07 15:10:19 EST
I tried to use w3c link checker, to validate the links in one of my wiki documents, 

http://wiki.eclipse.org/SimRel/Simultaneous_Release_Requirements

But, only got messages such as 
= = = = 
info Line: 379 http://build.eclipse.org/juno/simrel/reports/nonUniqueVersions.txt
    Status: (N/A) Forbidden by robots.txt

    The link was not checked due to robots exclusion rules. Check the link manually. 
= = = =

I don't know that much about robots.txt, but this site at
http://validator.w3.org/docs/checklink#bot 

says it would be something like the following (which seems counter intuitive)

User-Agent: *
Disallow: /

User-Agent: W3C-checklink
Disallow:
Comment 1 Eclipse Webmaster CLA 2011-12-08 09:59:39 EST
We use robots.txt to try and prevent our servers from getting hosed by search engines/crawlers(at least those that respect robots.txt).

What are you trying to achieve?

-M.
Comment 2 David Williams CLA 2011-12-08 10:36:49 EST
(In reply to comment #1)
> We use robots.txt to try and prevent our servers from getting hosed by search
> engines/crawlers(at least those that respect robots.txt).
> 
> What are you trying to achieve?
> 
> -M.

Oh, sorry, I thought everyone knew what "w3c link checker" was :) 

Its a tool which scans the "links" in a page (usually tags like <a href="someURL">link</a>
and provides a "summary" of links that are "broken" ... that is, the "someURL" no long exists or otherwise returns some error response. 

Then, the links can be fixed. Its a handy tool when a page has dozens (or hundreds) of links, especially when the page "lives" for years so some links that once worked no longer do. 

Actually, the "excluded list" seems much shorter today, when I tried again just now (did you fix already? :) Or, maybe I was reading it wrong yesterday?) 

The list of messages saying "link was not checked due to exclusion rule" only came from URLs for 

http://build.eclipse.org, such as 
http://build.eclipse.org/juno/simrel/reports/nonUniqueVersions.txt

and 

https://bugs.eclipse.org/, such as 
https://bugs.eclipse.org/bugs/show_bug.cgi?id=217339
[This later one, for "bugs" would never break over time ... bugs aren't removed from database ... but, might still be broken due to typos]. 

This only effected about 7 links out of the 100 on the page, so isn't too bad to "check manually". 

So, at this point, consider a very low priority request (I still think it'd be nice to add to build.eclipse.org and bugs.eclipse.org. Not sure why it doesn't work there, but does on other wiki and main pages. I'll attach a "log" of the checking it does ... appears to always use HEAD requests. (I think there is a way to tell it to "check recursively" which, I suppose, should end up causing tens of thousands of HEAD requests, if "misused").  

And feel free to close as won't fix if you fear "opening the flood gates" or anything. Like I said, it seems better today ... I would have sworn yesterday it was listing all 100 links as "excluded" ... I could have been looking at it wrong.
Comment 3 David Williams CLA 2011-12-08 10:37:40 EST
Created attachment 208098 [details]
sample log from running "check links"
Comment 4 Christopher Guindon CLA 2019-02-19 14:14:41 EST
The check "w3c link checker" is currently working for me.

Closing this bug!