CatgirlIntelligenceAgency/code/features-convert/summary-extraction
Viktor Lofgren 1d34224416 (refac) Remove src/main from all source code paths.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.

While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules.  Which you'll do a lot, because it's *modul*ar.  The src/main/java convention makes a lot of sense for a non-modular project though.  This ain't that.
2024-02-23 16:13:40 +01:00
..
java/nu/marginalia/summary (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
test/nu/marginalia/summary (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
test-resources/html (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
build.gradle (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
readme.md Clean up summary extractor module. 2023-03-18 10:29:25 +01:00

Summary Extraction

This feature attempts to find a descriptive passage of text that summarizes what a search result "is about". It's the text you see below a search result.

It must solve two problems:

  1. Identify which part of the document that contains "the text". The crux is that the document may be anywhere from 1993 to the present, with era-appropriate formatting. It may be formatted with <center>ed <font>-tags, or semantic HTML5.

  2. Identify which part of "the text" best describes the document.

It uses several naive heuristics to try to find something that makes sense, and there is probably room for improvement.

There are many good techniques for doing this, but they've sadly not proved particularly fast. Whatever solution is used needs to be able to summarize of order of a 100,000,000 documents with a time budget of a couple of hours.

Central Classes