A simple implementation of simhash algorithm by java.
- compute the simhash of a string
- compute the similarity between all the strins by build smart index, so We can deal with big data.
How to use:
run Main with inputfile and outputfile.
The format of inputfile(see src/test_in): one doc eachline with the utf8 charset.
The format of outputfile(see src/test_out):
start //start flag
first line // doc
sencode lien // doc1\tdist the dist is the hamming distance between doc and doc1
end //end flag
- Build the project to a runnable jar.
- Improve the performace under big data.
- Before run Main.java, you should choose a better analyzer instead of BinaryWordSeg!