Web Content Extracting

Libraries for extracting web contents.

Newest releases

google-research-datasets WebRED is a large and diverse manually annotated dataset for extracting relationships from a variety of text found on the World Wide Web.

MayankPandey01 BrokenLinkHijacker(BLH) is a Fast Broken Link Hijacker Tool written in Python. It crawls the website and searches for all the Broken Links.This tool is mainly designed for Bug Bounty Hunters.

LaundroMat Extract and index movie information of movies found in open directories posted on r/opendirectories.

jmcarp RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser

federicotdn wikiquote The wikiquote Python 3 module allows you to search and retrieve quotes from any Wikiquote article, and also retrieve the quote of the day. Please keep in mind that due to Wikiquote's varying HTML article

buriy python-readability Given a html document, it pulls out the main body text and cleans it up. This is a python port of a ruby port of arc90's readability project. Installation It's easy using pip, just

coleifer A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety of links, such as links to youtube videos. micawber also provides

michaelhelmick Lassie Lassie is a Python library for retrieving basic content from websites. Usage >>> import lassie >>> lassie.fetch('http://www.youtube.com/watch?v=dQw4w9WgXcQ') { 'des

Alir3z4 html2text html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format). Usage: html2text

datalib Libextract: extract data from websites ___ __ __ __ / (_) /_ ___ _ __/ /__________ ______/ /_ / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/ / / / /_/ / __/> </ /_/ /