Main Content Extraction from html written in Java. It will extract the article text with out the around clutters.
Recent days its really a challenging open issue to extract the main article content from html pages. There are many open source algorithms / implementations available. What i aim in this project is concise some of the best content extraction algorithm implemented in JAVA.
My focus is mainly on the tuning parameters and customization / modifications of these algorithmic features according to my requirements.
readabilityBUNDLE will perform equally what other algorithms does plus below listed extras.
Whats extra in readabilityBUNDLE
- Preserve the html tags in the extracted content.
- Keep all the possible images in the content instead of finding best image.
- Keep all the available videos.
- Better extraction of li,ul,ol tags
- Content normalization of extracted content.
- Incorporated 3 best popular extraction algorithm , you can choose based on your requirement.
- Provision to append next pages extracted content and create a consolidated output
- Many cleaner / formatter measures added.
- Some core changes in algorithms.
The main challenge which i was facing to extract the main content by keeping all the images / videos / html tags / and some realated div tags which are used as content / non content identification by most of the algorithms.
Some html pages works very well in a particular algorithm and some not. This is the main reason i put all the available algorithm under a roof . You can choose an algorithm which best suits you.
You can see all author citations in each java file itself.
You need to say which extraction algorithm to use. The 3 extraction algorithms are ReadabilitySnack,ReadabilityCore and ReadabilityGoose. By default its ReadabilitySnack.
- With out next page finding
Article article = new Article(); ContentExtractor ce = new ContentExtractor(); HtmlFetcher htmlFetcher = new HtmlFetcher(); String html = htmlFetcher.getHtml("http://blogmaverick.com/2012/11/19/what-i-really-think-about-facebook/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Counterparties+%28Counterparties%29", 0); article = ce.extractContent(html, "ReadabilitySnack"); System.out.println("Content : "+article.getCleanedArticleText());
- With next page html sources
If you need to extract and append content from next pages also then,
You can use [NextPageFinder] (https://github.com/srijiths/NextPageFinder) to find out all the next pages links.
Get the html of each next pages as a List of String using Network
Pass it to the content extractor like
article = ce.extractContent(firstPageHtml,extractionAlgorithm,nextPagesHtmlSources)
Using Maven , mvn clean package
Apache License 2 - http://www.apache.org/licenses/LICENSE-2.0.html