52a0255814
The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this. This assumption holds poorly in the wild. Adding an **experimental** system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior. IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings. |
||
---|---|---|
.. | ||
src | ||
build.gradle | ||
readme.md |
Language Processing
This library contains various tools used in language processing.
Central Classes
- SentenceExtractor - Creates a DocumentLanguageData from a text, containing its words, how they stem, POS tags, and so on.
See Also
features-convert/keyword-extraction uses this code to identify which keywords are important.
features-qs/query-parser also does some language processing.