Spent a whole day to build the spider >>
<< spider for the comic's ebook
Implementation

Author Zhou Renjian Create@ 2005-03-19 18:09
whizz Note icon

Environment:
IE6 + IE6's inner proxy support
HTA + JavaScript + XMLHttp
XML DB?
HTML
Develop environment: Windows XP + UltraEdit

Requirement:
1. Can be interrupted, user may pause the process and resume later;
2. Should report the processing status;
3. Should save processed data as soon as possible, so breaking down of the computer do not lost lots of processed data;
4. Display the processed data in HTML pages;
5. Support publish over the website;
6. Support search? 

Estimate:
1. The fold in comic.sjtu.edu.cn/ebooks/ contains more than 30,000 books;
2. Ebooks' folder hierarchies is no more than 5;
3. In each hierarchy folder has no more than 15 folders;
4. In each hierarchy folder has no more than 50 books;
In total the folders in comic's ebook is no more than (15/2)*(15/2)*(15/2)*(15/2)=3,164.0625 folders, and in fact, I can suppose that it's about 1000 folders. So I can use ebook\d{4}.xml as xml db name.

XML DB Implementation:
1. Each folder will result in a XML data;
2. In the XML data, the XML contains its ebooks'URL and its descendant folders' new XML file URL;
3. XML data format:
<current-folder name="...">
<folders>
  <folder name="" url="".../>
  ...
</folders>
<files>
  <file name="" url="".../>
  ...
</files>
</current-folder>

Schedule Steps:
1. XMLHttp load Comic ebook's booklist.asp;
2. Parse the booklist's HTML into folders and files;
3. Save xml files;
4. Pop and push tasks;
5. Process bars and save processing status every 10 minutes;
6. Statistics;
7. Support search function; ?
8. XSLT for HTML pages;
9. Publish to website;
10. Exception dealing;

Schedule Time:
Step 1. 2. 3. : 30 minutes;
Step 4. 5. : 30 mintus;
Step 6. : 30 minutes;
Step 7. : 40 mintues;
Step 8. : 30 mintues;
Step 9. : 40 mintues;
Step 10. : 10 minutes;
Processing spider will require about 5 hours to get all ebooks'URL;
Total time: about 8 hours

 

本记录所在类别:
本记录相关记录: