Clean up summary extractor module.

2023-03-18 10:28:48 +01:00 · 2023-03-18 10:28:48 +01:00 · 950c49d80f
commit 950c49d80f
parent 8def95e849
1 changed files with 9 additions and 0 deletions
--- a/code/features-convert/summary-extraction/readme.md
+++ b/code/features-convert/summary-extraction/readme.md
@ -3,6 +3,14 @@
 This feature attempts to find a descriptive passage of text that summarizes
 what a search result "is about". It's the text you see below a search result.

+It must solve two problems:
+
+1.  Identify which part of the document that contains "the text".
+The crux is that the document may be anywhere from 1993 to the present, with era-appropriate 
+formatting. Headings may be &lt;center&gt;ed  &lt;font&gt;-tags, or semantic HTML5. 
+
+2. Identify which part of "the text" best describes the document. 
+
 It uses several naive heuristics to try to find something that makes sense,
 and there is probably room for improvement. 

@ -10,6 +18,7 @@ There are many good techniques for doing this, but they've sadly not proved
 particularly fast. Whatever solution is used needs to be able to summarize of
 order of a 100,000,000 documents with a time budget of a couple of hours.

+
 ## Central Classes

 * [SummaryExtractor](src/main/java/nu/marginalia/summary/SummaryExtractor.java)