dbe9235f3a
... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory. |
||
---|---|---|
.. | ||
src | ||
build.gradle | ||
readme.md |
Easy LSH
This a simple Locality-Sensitive Hash for document deduplication. Hashes are compared using their hamming distance.
Central Classes
Demo
Consider statistical distribution only
var lsh1 = new EasyLSH();
lsh1.addUnordered("lorem");
lsh1.addUnordered("ipsum");
lsh1.addUnordered("dolor");
lsh1.addUnordered("sit");
lsh1.addUnordered("amet");
long hash1 = lsh1.get();
var lsh2 = new EasyLSH();
lsh2.addUnordered("amet");
lsh2.addUnordered("ipsum");
lsh2.addUnordered("lorem");
lsh2.addUnordered("dolor");
lsh2.addUnordered("SEAT");
long hash2 = lsh2.get();
System.out.println(EasyLSH.hammingDistance(lsh1, lsh2));
// 1 -- these are similar
Consider order as well as distribution
var lsh1 = new EasyLSH();
lsh1.addOrdered("lorem");
lsh1.addOrdered("ipsum");
lsh1.addOrdered("dolor");
lsh1.addOrdered("sit");
lsh1.addOrdered("amet");
long hash1 = lsh1.get();
var lsh2 = new EasyLSH();
lsh2.addOrdered("amet");
lsh2.addOrdered("ipsum");
lsh2.addOrdered("lorem");
lsh2.addOrdered("dolor");
lsh2.addOrdered("SEAT");
long hash2 = lsh2.get();
System.out.println(EasyLSH.hammingDistance(lsh1, lsh2));
// 5 -- these are not very similar
// note the value is relatively low because there are few words
// and there simply can't be very many differences
// it will approach 32 as documents grow larger