Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 421390

Summary: download.eclipse.org has seriously broken html
Product: [Eclipse Project] Platform Reporter: aditsu <aditsu>
Component: RelengAssignee: David Williams <david_williams>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: david_williams, wayne.beaton
Version: 4.4   
Target Milestone: 4.5 M4   
Hardware: PC   
OS: Linux   
Whiteboard:
Bug Depends on:    
Bug Blocks: 434596, 451808    
Attachments:
Description Flags
bash script to get a partcular download none

Description aditsu CLA 2013-11-09 06:47:18 EST
You can check (with validator.w3.org) any page from that site and find tons of errors, but the one I am mainly concerned about is http://download.eclipse.org/eclipse/downloads/drops4/R-4.3.1-201309111000/
Among the 862 errors reported, I think the most serious and easily fixed are &nbsp without semicolon and unclosed div tags.
Comment 1 aditsu CLA 2013-11-09 06:51:16 EST
Actually download.eclipse.org seems to redirect to www.eclipse.org/downloads, I'm talking about http://download.eclipse.org/eclipse/downloads/ to be clear.
Comment 2 David Williams CLA 2013-11-10 09:26:35 EST
aditsu, I assume you mean errors in the sense of using a tool such as 

http://validator.w3.org/

?. Some of these could be in "our" generated code, but I know the "style sheet" from Eclipse.org introduces some. 

Some of them, such as 
http://download.eclipse.org/eclipse/downloads/drops4/S-4.4M3-201310302000/
are "machine generated" so have no plans to fix those by hand, but will take into account HTML validation when when re-writing the Java code that produces them. 


If there are any errors that actually cause you problems, be sure to name those separately.
Comment 3 aditsu CLA 2013-11-10 13:28:44 EST
(In reply to David Williams from comment #2)
> aditsu, I assume you mean errors in the sense of using a tool such as 
> http://validator.w3.org/ ?

Yes I did mention that in the description.

> Some of them, such as 
> http://download.eclipse.org/eclipse/downloads/drops4/S-4.4M3-201310302000/
> are "machine generated" so have no plans to fix those by hand

Yes, the generator should be fixed, not just the output. Do you plan to fix it and regenerate the pages?

> If there are any errors that actually cause you problems, be sure to name
> those separately.

Well, the unclosed divs I mentioned are causing me problems, but not with the common browsers I usually use. I'm parsing the page using jtidy, and it attempts to fix the problem but does it in a bad way (this is also something to fix in jtidy). If you want to see what I'm talking about, you can try loading that page with this browser I made: http://tidysaucer.sourceforge.net/ (which uses jtidy).

For now I did a string replace to work around the problem in my program, but I also think it should be fixed on your side.
Comment 4 David Williams CLA 2013-11-10 19:23:12 EST
(In reply to aditsu from comment #3)
> (In reply to David Williams from comment #2)

> Yes, the generator should be fixed, not just the output. Do you plan to fix
> it and regenerate the pages?

No. 

In general, "screen scraping" is not something we support or encourage. I'm not sure what your ultimate goal is, but be forewarned that the format/content of these pages will change from time to time (and, probably, quite a bit over next few months). 

If you can state what you are trying to accomplish, I can keep that in mind as other changes are made, and see if there is someway I can help.
Comment 5 aditsu CLA 2013-11-11 03:15:16 EST
(In reply to David Williams from comment #4)
> > Do you plan to fix it and regenerate the pages?
> 
> No [...] be forewarned that the
> format/content of these pages will change from time to time (and, probably,
> quite a bit over next few months).

You seem to immediately contradict yourself :) Unless the new format will use equally broken html (perhaps intentionally).

> If you can state what you are trying to accomplish, I can keep that in mind
> as other changes are made, and see if there is someway I can help.

My current goal is to easily download the latest stable platform runtime binary for my OS. I don't think I'm misusing the site by scraping, but just automating a task that always consumed my time and nerves in the past.
Why do I want the platform runtime binary? Because the standard packages include too many things I don't want.
Why did it consume my time and nerves? Because it can not be found on www.eclipse.org/downloads, I don't even know how to find the right site except by googling (and then bookmarking), and this download site is not as streamlined but makes me work to get to the right download link.
Comment 6 David Williams CLA 2013-11-11 05:05:07 EST
(In reply to aditsu from comment #5)
> (In reply to David Williams from comment #4)
> > > Do you plan to fix it and regenerate the pages?
> > 
> > No [...] be forewarned that the
> > format/content of these pages will change from time to time (and, probably,
> > quite a bit over next few months).
> 
> You seem to immediately contradict yourself :) Unless the new format will
> use equally broken html (perhaps intentionally).
> 

Don't know what you mean ... but not sure it matters. The point is, no plan to fix things "generated in the past" ... but till try and improve things generated in the future. And by "change", I mean now, file name is often in third column of certain tables ... will likely become second column of table in future, as well as other changes. 

> > If you can state what you are trying to accomplish, I can keep that in mind
> > as other changes are made, and see if there is someway I can help.
> 
> My current goal is to easily download the latest stable platform runtime
> binary for my OS. I don't think I'm misusing the site by scraping, but just
> automating a task that always consumed my time and nerves in the past.
> Why do I want the platform runtime binary? Because the standard packages
> include too many things I don't want.
> Why did it consume my time and nerves? Because it can not be found on
> www.eclipse.org/downloads, I don't even know how to find the right site
> except by googling (and then bookmarking), and this download site is not as
> streamlined but makes me work to get to the right download link.

Now we are getting to where I can maybe be helpful ... maybe. I suspect you want something purely automatic, that can poll occasionally? If if instead you'd be willing to enter one or two values, the rest of the "location" is all well known and predictable. I'll attach the script I use ... if you find it helpful, get ... if not .. then not sure what to recommend.
Comment 7 David Williams CLA 2013-11-11 05:07:23 EST
Created attachment 237357 [details]
bash script to get a partcular download

HTH ... you could "make it better" in many ways, such as by getting/check SHA1 sum, etc. ... but ... it what I use. 

Good luck.
Comment 8 aditsu CLA 2013-11-11 07:40:49 EST
(In reply to David Williams from comment #6)
> > You seem to immediately contradict yourself :)
> 
> Don't know what you mean ... but not sure it matters. The point is, no plan
> to fix things "generated in the past" ... but till try and improve things
> generated in the future.

If the things generated in the future will have better html, then I consider that as fixing this bug. I'm not hung up on specifically fixing this current format, although that would be the quickest solution.

> file name [...] will likely become second column of table
> in future, as well as other changes.

Yes, this kind of thing is expected. I don't intend to complain that you "broke my parser" when that happens :)

> I suspect you
> want something purely automatic, that can poll occasionally?

Yes, I've written it and it works. I just had to use a workaround for the broken html.

> If instead
> you'd be willing to enter one or two values, the rest of the "location" is
> all well known and predictable.

Hmm, that can be subject to change too, but I guess doing it this way avoids the issues with parsing html.

> I'll attach the script I use ... if you find
> it helpful, get ... if not .. then not sure what to recommend.

Thanks a lot, especially for trying your best to be helpful :) Among other things, the script assumes an ssh account at build.eclipse.org (which I don't think I have), but it's good for inspiration. Other than the script, I think it would be helpful to put the platform runtime binary on www.eclipse.org/downloads if that's something you could consider.

> you could "make it better" in many ways, such as by getting/check SHA1 sum

My program already checks md5 automatically :) It also detects the latest release version.
Comment 9 David Williams CLA 2014-11-25 04:23:55 EST
Between the new and improved "Solstice theme" and some fixes I've made here and there, the main download page, at 
http://download.eclipse.org/eclipse/downloads/
is now pretty clean. 

The w3c validator reports a few warnings, but no errors. 

I've opened bug 453160 to address the specific "drop pages", such as 
http://download.eclipse.org/eclipse/downloads/drops4/N20141120-0700/
Comment 10 aditsu CLA 2014-11-25 05:42:24 EST
(In reply to David Williams from comment #9)
> the main download page, at 
> http://download.eclipse.org/eclipse/downloads/
> is now pretty clean. 
> 
> I've opened bug 453160 to address the specific "drop pages", such as 
> http://download.eclipse.org/eclipse/downloads/drops4/N20141120-0700/

Actually this bug was supposed to be mainly about the "drop pages", as seen in the description. Comment 1 was in reference to the summary line (yeah I could have worded it better). But I guess it's ok as long as we have a bug open to keep track of that.