Broad Crawler
Overview
This is a project aiming to crawl a variety of web pages(especially pages of news) with a spider, a.k.a. broad crawler.
Features
The broad crawler should support the following features:
- Title Extractor
- URL Extractor
- Date Extractor
- Main Content Extractor
Requirements
- Python 2.7
- Scrapy 1.3.3
- beautifulsoup4 4.5.3
- scrapy-redis 0.6.8
- PyMySQL 0.7.11
- redis 2.10.5
- virtualenv (Optional)
Usage
License
GPL license