| Summary: | download.eclipse.org has seriously broken html | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | aditsu <aditsu> | ||||
| Component: | Releng | Assignee: | David Williams <david_williams> | ||||
| Status: | RESOLVED FIXED | QA Contact: | |||||
| Severity: | normal | ||||||
| Priority: | P3 | CC: | david_williams, wayne.beaton | ||||
| Version: | 4.4 | ||||||
| Target Milestone: | 4.5 M4 | ||||||
| Hardware: | PC | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Bug Depends on: | |||||||
| Bug Blocks: | 434596, 451808 | ||||||
| Attachments: |
|
||||||
|
Description
aditsu
Actually download.eclipse.org seems to redirect to www.eclipse.org/downloads, I'm talking about http://download.eclipse.org/eclipse/downloads/ to be clear. aditsu, I assume you mean errors in the sense of using a tool such as http://validator.w3.org/ ?. Some of these could be in "our" generated code, but I know the "style sheet" from Eclipse.org introduces some. Some of them, such as http://download.eclipse.org/eclipse/downloads/drops4/S-4.4M3-201310302000/ are "machine generated" so have no plans to fix those by hand, but will take into account HTML validation when when re-writing the Java code that produces them. If there are any errors that actually cause you problems, be sure to name those separately. (In reply to David Williams from comment #2) > aditsu, I assume you mean errors in the sense of using a tool such as > http://validator.w3.org/ ? Yes I did mention that in the description. > Some of them, such as > http://download.eclipse.org/eclipse/downloads/drops4/S-4.4M3-201310302000/ > are "machine generated" so have no plans to fix those by hand Yes, the generator should be fixed, not just the output. Do you plan to fix it and regenerate the pages? > If there are any errors that actually cause you problems, be sure to name > those separately. Well, the unclosed divs I mentioned are causing me problems, but not with the common browsers I usually use. I'm parsing the page using jtidy, and it attempts to fix the problem but does it in a bad way (this is also something to fix in jtidy). If you want to see what I'm talking about, you can try loading that page with this browser I made: http://tidysaucer.sourceforge.net/ (which uses jtidy). For now I did a string replace to work around the problem in my program, but I also think it should be fixed on your side. (In reply to aditsu from comment #3) > (In reply to David Williams from comment #2) > Yes, the generator should be fixed, not just the output. Do you plan to fix > it and regenerate the pages? No. In general, "screen scraping" is not something we support or encourage. I'm not sure what your ultimate goal is, but be forewarned that the format/content of these pages will change from time to time (and, probably, quite a bit over next few months). If you can state what you are trying to accomplish, I can keep that in mind as other changes are made, and see if there is someway I can help. (In reply to David Williams from comment #4) > > Do you plan to fix it and regenerate the pages? > > No [...] be forewarned that the > format/content of these pages will change from time to time (and, probably, > quite a bit over next few months). You seem to immediately contradict yourself :) Unless the new format will use equally broken html (perhaps intentionally). > If you can state what you are trying to accomplish, I can keep that in mind > as other changes are made, and see if there is someway I can help. My current goal is to easily download the latest stable platform runtime binary for my OS. I don't think I'm misusing the site by scraping, but just automating a task that always consumed my time and nerves in the past. Why do I want the platform runtime binary? Because the standard packages include too many things I don't want. Why did it consume my time and nerves? Because it can not be found on www.eclipse.org/downloads, I don't even know how to find the right site except by googling (and then bookmarking), and this download site is not as streamlined but makes me work to get to the right download link. (In reply to aditsu from comment #5) > (In reply to David Williams from comment #4) > > > Do you plan to fix it and regenerate the pages? > > > > No [...] be forewarned that the > > format/content of these pages will change from time to time (and, probably, > > quite a bit over next few months). > > You seem to immediately contradict yourself :) Unless the new format will > use equally broken html (perhaps intentionally). > Don't know what you mean ... but not sure it matters. The point is, no plan to fix things "generated in the past" ... but till try and improve things generated in the future. And by "change", I mean now, file name is often in third column of certain tables ... will likely become second column of table in future, as well as other changes. > > If you can state what you are trying to accomplish, I can keep that in mind > > as other changes are made, and see if there is someway I can help. > > My current goal is to easily download the latest stable platform runtime > binary for my OS. I don't think I'm misusing the site by scraping, but just > automating a task that always consumed my time and nerves in the past. > Why do I want the platform runtime binary? Because the standard packages > include too many things I don't want. > Why did it consume my time and nerves? Because it can not be found on > www.eclipse.org/downloads, I don't even know how to find the right site > except by googling (and then bookmarking), and this download site is not as > streamlined but makes me work to get to the right download link. Now we are getting to where I can maybe be helpful ... maybe. I suspect you want something purely automatic, that can poll occasionally? If if instead you'd be willing to enter one or two values, the rest of the "location" is all well known and predictable. I'll attach the script I use ... if you find it helpful, get ... if not .. then not sure what to recommend. Created attachment 237357 [details]
bash script to get a partcular download
HTH ... you could "make it better" in many ways, such as by getting/check SHA1 sum, etc. ... but ... it what I use.
Good luck.
(In reply to David Williams from comment #6) > > You seem to immediately contradict yourself :) > > Don't know what you mean ... but not sure it matters. The point is, no plan > to fix things "generated in the past" ... but till try and improve things > generated in the future. If the things generated in the future will have better html, then I consider that as fixing this bug. I'm not hung up on specifically fixing this current format, although that would be the quickest solution. > file name [...] will likely become second column of table > in future, as well as other changes. Yes, this kind of thing is expected. I don't intend to complain that you "broke my parser" when that happens :) > I suspect you > want something purely automatic, that can poll occasionally? Yes, I've written it and it works. I just had to use a workaround for the broken html. > If instead > you'd be willing to enter one or two values, the rest of the "location" is > all well known and predictable. Hmm, that can be subject to change too, but I guess doing it this way avoids the issues with parsing html. > I'll attach the script I use ... if you find > it helpful, get ... if not .. then not sure what to recommend. Thanks a lot, especially for trying your best to be helpful :) Among other things, the script assumes an ssh account at build.eclipse.org (which I don't think I have), but it's good for inspiration. Other than the script, I think it would be helpful to put the platform runtime binary on www.eclipse.org/downloads if that's something you could consider. > you could "make it better" in many ways, such as by getting/check SHA1 sum My program already checks md5 automatically :) It also detects the latest release version. Between the new and improved "Solstice theme" and some fixes I've made here and there, the main download page, at http://download.eclipse.org/eclipse/downloads/ is now pretty clean. The w3c validator reports a few warnings, but no errors. I've opened bug 453160 to address the specific "drop pages", such as http://download.eclipse.org/eclipse/downloads/drops4/N20141120-0700/ (In reply to David Williams from comment #9) > the main download page, at > http://download.eclipse.org/eclipse/downloads/ > is now pretty clean. > > I've opened bug 453160 to address the specific "drop pages", such as > http://download.eclipse.org/eclipse/downloads/drops4/N20141120-0700/ Actually this bug was supposed to be mainly about the "drop pages", as seen in the description. Comment 1 was in reference to the summary line (yeah I could have worded it better). But I guess it's ok as long as we have a bug open to keep track of that. |