Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Heritrix Introduction Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix

Related Repos



NandanDesai SocialInfo4J - fetch data from Facebook, Instagram and LinkedIn
 

USCDataScience A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc.