WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

WebCollector WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. In addition to a general
Information
Category: Java / Web Crawling
Watchers: 331
Star: 3k
Fork: 1.5k
Last update: Sep 22, 2023

Related Repos



USCDataScience A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc.
 

NandanDesai SocialInfo4J - fetch data from Facebook, Instagram and LinkedIn
 

code4craft Readme in Chinese A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. Features:
 

fcannizzaro jsoup-annotations Jsoup Annotations POJO Gradle Dependency Step 1. Add the JitPack repository to your build file allprojects { repositories { ... maven { url 'https://jitpack.io'
 

internetarchive Heritrix Introduction Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix
 

dyweb scrala scrala is a web crawling framework for scala, which is inspired by scrapy. Installation From Docker gaocegege/scrala in dockerhub Create a Dockerfile in your project. FROM gaoceg
 

reggoodwin About Ferrit is an API driven web crawler service written in Scala using Akka, Spray and Cassandra. I created it to help me learn more about small service design using Akka and the Functional Reactive programming style.
 

YahooArchive nutch-anth Anthelion is a Nutch plugin for focused crawling of semantic data. The project is an open-source project released under the Apache License 2.0. Note: This project contains the complete Nutch 1.6 distribution. The plug
 

yasserg crawler4j crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes. Table of content Installat
 
Featured
2.7k

apache Apache Nutch README For the latest information about Nutch, please visit our website at: http://nutch.apache.org and our wiki, at: http://wiki.apache.org/nutch/ To get started using Nutch read Tutorial: http://wiki.apache.
 
10.3k

jhy jsoup: Java HTML parser that makes sense of real-world HTML soup. jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like
 

CrawlScript WebCollector WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. In addition to a general