67 lines
1.4 KiB
Markdown
67 lines
1.4 KiB
Markdown
|
# Easy LSH
|
||
|
|
||
|
This a simple [Locality-Sensitive Hash](https://en.wikipedia.org/wiki/Locality-sensitive_hashing)
|
||
|
for document deduplication. Hashes are compared using their hamming distance.
|
||
|
|
||
|
## Central Classes
|
||
|
|
||
|
* [EasyLSH](src/main/java/nu/marginalia/lsh/EasyLSH.java)
|
||
|
|
||
|
## Demo
|
||
|
|
||
|
Consider statistical distribution only
|
||
|
|
||
|
```java
|
||
|
var lsh1 = new EasyLSH();
|
||
|
lsh1.addUnordered("lorem");
|
||
|
lsh1.addUnordered("ipsum");
|
||
|
lsh1.addUnordered("dolor");
|
||
|
lsh1.addUnordered("sit");
|
||
|
lsh1.addUnordered("amet");
|
||
|
|
||
|
long hash1 = lsh1.get();
|
||
|
|
||
|
var lsh2 = new EasyLSH();
|
||
|
lsh2.addUnordered("amet");
|
||
|
lsh2.addUnordered("ipsum");
|
||
|
lsh2.addUnordered("lorem");
|
||
|
lsh2.addUnordered("dolor");
|
||
|
lsh2.addUnordered("SEAT");
|
||
|
|
||
|
long hash2 = lsh2.get();
|
||
|
|
||
|
System.out.println(EasyLSH.hammingDistance(lsh1, lsh2));
|
||
|
// 1 -- these are similar
|
||
|
|
||
|
```
|
||
|
|
||
|
Consider order as well as distribution
|
||
|
|
||
|
```java
|
||
|
var lsh1 = new EasyLSH();
|
||
|
lsh1.addOrdered("lorem");
|
||
|
lsh1.addOrdered("ipsum");
|
||
|
lsh1.addOrdered("dolor");
|
||
|
lsh1.addOrdered("sit");
|
||
|
lsh1.addOrdered("amet");
|
||
|
|
||
|
long hash1 = lsh1.get();
|
||
|
|
||
|
var lsh2 = new EasyLSH();
|
||
|
lsh2.addOrdered("amet");
|
||
|
lsh2.addOrdered("ipsum");
|
||
|
lsh2.addOrdered("lorem");
|
||
|
lsh2.addOrdered("dolor");
|
||
|
lsh2.addOrdered("SEAT");
|
||
|
|
||
|
|
||
|
long hash2 = lsh2.get();
|
||
|
|
||
|
System.out.println(EasyLSH.hammingDistance(lsh1, lsh2));
|
||
|
// 5 -- these are not very similar
|
||
|
|
||
|
// note the value is relatively low because there are few words
|
||
|
// and there simply can't be very many differences
|
||
|
// it will approach 32 as documents grow larger
|
||
|
```
|