DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

DKPro C4CorpusTools DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal. DKPro C4Co
Information
Category: Java / Natural Language Processing
Watchers: 16
Star: 40
Fork: 7
Last update: Nov 26, 2021

Related Repos



pemistahl Quick Info this library tries to solve language detection of very short words and phrases, even shorter than tweets makes use of both statistical and rule-based approaches outperforms Apache Tika, Apache OpenNLP
 

scalanlp Chalk NOTE: This project is currently dormant with no current prospect for further development. Suggestion: check out OpenNLP or StanfordNLP for the JVM or spaCy for Python. (If anyone would like to do something like spaCy for Sc
 

mimno Mallet Website: http://mallet.cs.umass.edu/index.php MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine l
 
Popular
8.2k

stanfordnlp Stanford CoreNLP Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and give the base forms of words, their parts of speech, whether they are names of
 

dkpro DKPro C4CorpusTools DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal. DKPro C4Co