History

Viktor Lofgren 1d34224416 (refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.		2024-02-23 16:13:40 +01:00
..
java/nu/marginalia/lsh	(refac) Remove src/main from all source code paths.	2024-02-23 16:13:40 +01:00
test/nu/marginalia/lsh	(refac) Remove src/main from all source code paths.	2024-02-23 16:13:40 +01:00
build.gradle	(refac) Remove src/main from all source code paths.	2024-02-23 16:13:40 +01:00
readme.md	Move all code to a code directory.	2023-03-07 17:14:32 +01:00

readme.md

Easy LSH

This a simple Locality-Sensitive Hash for document deduplication. Hashes are compared using their hamming distance.

Central Classes

EasyLSH

Demo

Consider statistical distribution only

var lsh1 = new EasyLSH();
lsh1.addUnordered("lorem");
lsh1.addUnordered("ipsum");
lsh1.addUnordered("dolor");
lsh1.addUnordered("sit");
lsh1.addUnordered("amet");

long hash1 = lsh1.get();

var lsh2 = new EasyLSH();
lsh2.addUnordered("amet");
lsh2.addUnordered("ipsum");
lsh2.addUnordered("lorem");
lsh2.addUnordered("dolor");
lsh2.addUnordered("SEAT");

long hash2 = lsh2.get();

System.out.println(EasyLSH.hammingDistance(lsh1, lsh2));
// 1 -- these are similar

Consider order as well as distribution

var lsh1 = new EasyLSH();
lsh1.addOrdered("lorem");
lsh1.addOrdered("ipsum");
lsh1.addOrdered("dolor");
lsh1.addOrdered("sit");
lsh1.addOrdered("amet");

long hash1 = lsh1.get();

var lsh2 = new EasyLSH();
lsh2.addOrdered("amet");
lsh2.addOrdered("ipsum");
lsh2.addOrdered("lorem");
lsh2.addOrdered("dolor");
lsh2.addOrdered("SEAT");


long hash2 = lsh2.get();

System.out.println(EasyLSH.hammingDistance(lsh1, lsh2));
// 5 -- these are not very similar

// note the value is relatively low because there are few words
// and there simply can't be very many differences
// it will approach 32 as documents grow larger