Viktor Lofgren
e710e057e2
(db) Remove EC_URL and EC_PAGE_DATA from mariadb database
2023-08-25 13:45:03 +02:00
Viktor Lofgren
460998d512
(index) Move index construction to separate process.
...
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service. It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
2023-08-25 12:52:54 +02:00
Viktor Lofgren
1e6800565a
(system) Remove EdgeId<T> and similar objects
...
They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.
2023-08-24 17:46:02 +02:00
Viktor Lofgren
c909120ae1
(search) Basic working integration of linkdb in search service
2023-08-24 17:24:56 +02:00
Viktor Lofgren
9894f37412
(index) Implement new URL ID coding scheme.
...
Also refactor along the way. Really needs an additional pass, these tests are very hairy.
2023-08-24 16:44:27 +02:00
Viktor Lofgren
6a04cdfddf
(loader) Implement new linkdb in loader
...
Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal.
For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.
2023-08-24 13:07:54 +02:00
Viktor Lofgren
c70670bacb
(common) New UrlIdCodec class
...
Have a single class responsible for encoding and decoding URL ids, as it's a bit finicky and used all over.
2023-08-24 11:41:07 +02:00
Viktor Lofgren
7bb3e44a76
(common) Deprecate EdgeId and similar
2023-08-24 11:16:28 +02:00
Viktor Lofgren
b958acb76a
(file-storage) New File Storage type for linkdb
2023-08-24 09:06:13 +02:00
Viktor Lofgren
b22f4fbb72
(linkdb) New Module for sqlite-backed document db
2023-08-24 09:06:13 +02:00
Viktor Lofgren
ebc84c22fb
Upgrade antique lombok plugin
...
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a
Upgrade code to Java 20.
...
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
15912f31d0
(control-service) Basic GUI for deleting bad links from exploration mode
2023-08-21 18:35:26 +02:00
Viktor Lofgren
6cb784df75
(minor) Improve comment
2023-08-18 11:25:36 +02:00
Viktor Lofgren
c019a029ec
(flags) Documentation and preventative bugfix
2023-08-17 17:42:31 +02:00
Viktor Lofgren
46d761f34f
(language) fasttext based language filter
2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f
(valuation) Penalize wordpress style kebab case urls
2023-08-16 13:11:24 +02:00
Viktor Lofgren
e7192a9cad
(mq) Refactor mq and actor library and move it to libraries out of common
2023-08-15 10:53:23 +02:00
Viktor Lofgren
8210e49b4e
(control) Helpful tooltips for the Actor table.
2023-08-13 12:55:56 +02:00
Viktor Lofgren
d6b8b38955
(db) Add indices on SERVICE_EVENTLOG
2023-08-12 15:00:15 +02:00
Viktor Lofgren
6483308bb0
(sql) Update default value for DOMAIN_SELECTION_TYPE
2023-08-11 14:01:15 +02:00
Viktor Lofgren
7440da240d
(blacklist) Fix broken SQL migration
2023-08-11 13:33:35 +02:00
Viktor Lofgren
4f8048be31
(blacklist) Blacklist management
2023-08-10 15:40:07 +02:00
Viktor Lofgren
807fb2d052
(service) Task heartbeat creates event log entries
2023-08-09 15:15:16 +02:00
Viktor Lofgren
ce293029c7
(converter) Treat adtech tracking as advertisement.
2023-08-09 14:23:53 +02:00
Viktor Lofgren
b5ed21be21
(mq) MqPersistence no longer relies on autoCommit being enabled
2023-08-09 14:23:22 +02:00
Viktor Lofgren
251fc63b42
(*) Fix merge gore
2023-08-09 13:33:28 +02:00
Viktor Lofgren
afad4f5ebb
(*) last touches
2023-08-07 12:59:33 +02:00
Viktor Lofgren
4ab1cd9502
(*) last touches
2023-08-07 12:57:44 +02:00
Viktor
52e2ab45bf
Merge branch 'master' into master-control-program
2023-08-07 12:53:43 +02:00
Viktor Lofgren
715d61dfea
(mq) Fix bug in notice handling where they were registered on the wrong name
2023-08-05 14:45:04 +02:00
Viktor Lofgren
c2b45bec8d
(mq) Rename notify to sendNotice to avoid name clash with the java object function
2023-08-05 14:45:04 +02:00
Viktor Lofgren
cdfe284f9a
(file storage) File Storage Type for EXPORT data
...
(file storage) File Storage Type for EXPORT data
2023-08-05 14:45:03 +02:00
Viktor Lofgren
624b78ec3a
(heartbeat) Task heartbeats
2023-08-04 14:40:06 +02:00
Viktor Lofgren
f01f608474
(blacklist) Support blacklists with subdomain
2023-08-03 17:58:52 +02:00
Viktor Lofgren
659d2134ba
(file-storage) Deprecate mustClean flag
2023-08-01 22:32:30 +02:00
Viktor Lofgren
867410c66b
(file-storage) Automatic file storage discovery via manifest file
2023-08-01 18:05:43 +02:00
Viktor Lofgren
58556af6c7
(db) Use flwyay for database migrations.
2023-08-01 17:08:42 +02:00
Viktor Lofgren
2e29038ecd
(db) Fix broken insert statement, move file storage defaults to a separate file.
2023-08-01 15:50:08 +02:00
Viktor Lofgren
c1ea60b399
(db) Default values for storage base
2023-08-01 15:05:04 +02:00
Viktor Lofgren
2f8488610a
(loader) Fix bug where trailing deferred domain meta inserts weren't executed
2023-07-31 14:23:23 +02:00
Viktor Lofgren
d95f01b701
(control) Reduce log spam in control svc
2023-07-31 14:21:06 +02:00
Viktor Lofgren
c9d7635370
(control) Aborting an actor that waits on a process request terminates the running job.
...
(control) Aborting an actor that waits on a process request terminates the running job.
2023-07-31 14:21:06 +02:00
Viktor Lofgren
6b5fb0f841
(control) Disable the start button for actors that aren't directly initializable.
...
(control) Disable the start button for actors that aren't directly initializable.
2023-07-31 14:21:00 +02:00
Viktor Lofgren
5411950b87
(minor) Tidy up EdgeDomain class a bit, no functional difference
2023-07-31 10:31:29 +02:00
Viktor Lofgren
5c071ce4d3
(crawler) Clean up the code and remove unnecessary logging
2023-07-30 16:53:39 +02:00
Viktor Lofgren
866db6c63f
(control) Dialog for updating message state; clean up file view.
2023-07-28 22:02:05 +02:00
Viktor Lofgren
77d5e39fe0
Make processed data Serializable
2023-07-28 18:11:19 +02:00
Viktor Lofgren
27e781761d
(mq single shot inbox) Flag messages as OK if there is no recipient
2023-07-28 12:04:23 +02:00
Viktor Lofgren
92cac52813
(mq) Add indexes to MESSAGE_QUEUE
2023-07-28 12:03:51 +02:00
Viktor Lofgren
507f26ad47
(converter) Refactor converter to not keep instructions list in RAM.
...
(converter) Refactor converter to not keep instructions list in RAM.
(converter) Refactor converter to not keep instructions list in RAM.
2023-07-25 22:06:46 +02:00
Viktor Lofgren
09fd0a1d0e
(converter) Automatically clean stale file storage records if they disappear on disk
2023-07-24 17:04:42 +02:00
Viktor Lofgren
7470c170b1
(minor) EdgeUrl.parse() should deal with null
2023-07-24 15:06:57 +02:00
Viktor Lofgren
f91d92cccb
(crawler) WIP
2023-07-20 21:05:16 +02:00
Viktor Lofgren
08ca6399ec
(converter) WIP
2023-07-19 17:14:45 +02:00
Viktor Lofgren
c0b5ea0e7d
Revert "Less spammy default log settings"
...
This reverts commit f6e2216b87
.
2023-07-18 19:28:42 +02:00
Viktor Lofgren
f21a3983aa
Abortable processes
2023-07-18 18:40:12 +02:00
Viktor Lofgren
f6e2216b87
Less spammy default log settings
2023-07-17 21:42:13 +02:00
Viktor Lofgren
92ed513e4f
Less spammy default log settings
2023-07-17 21:41:56 +02:00
Viktor Lofgren
d7ab21fe34
(*) Refactor Control Service and processes
2023-07-17 21:20:31 +02:00
Viktor Lofgren
bca4bbb6c8
(*) Refactor MQ and MQSM
2023-07-17 13:57:32 +02:00
Viktor Lofgren
e618aa34e9
(control) Name change process->fsm, new fsm:s
...
* FSM for spawning processes when messages appear for them
* FSM for removing data flagged for purging
2023-07-17 12:27:27 +02:00
Viktor Lofgren
c4dd9a0547
(control) Use MQFSMs to monitor and spawn processes when messages are sent to them
2023-07-16 11:58:47 +02:00
Viktor Lofgren
5ec10634d8
(mqfsm) Abortable state machine
2023-07-15 14:12:16 +02:00
Viktor Lofgren
8b74e3aa0d
(*) File Storage WIP
2023-07-14 17:08:10 +02:00
Viktor Lofgren
23169ad818
(db) Model for file storage areas
2023-07-14 11:40:05 +02:00
Viktor Lofgren
d36e36c8fd
(mq) Bugfix lastNMessages; use Lists.reverse properly
2023-07-14 11:39:15 +02:00
Viktor Lofgren
948d4d5f08
(control) Clean up the number of GUI views, abortable FSM tasks
2023-07-13 17:24:21 +02:00
Viktor Lofgren
1ec6f9cde2
(mq) More robust resume and recovery logic, protection against spurious state changes, minor bugfixes
2023-07-13 14:55:45 +02:00
Viktor Lofgren
a5118fe8f1
(minor) clean-up
2023-07-12 22:46:14 +02:00
Viktor Lofgren
6c88f00a9d
(mqsm) guard against spurious transitions from unexpected messages
2023-07-12 22:44:05 +02:00
Viktor Lofgren
8a53e107fa
(mq) Synchronous and Asynchronous inboxes.
2023-07-12 20:12:52 +02:00
Viktor Lofgren
0ed938545b
(mq) Add single-shot inbox
2023-07-12 18:41:27 +02:00
Viktor Lofgren
480abfe966
(minor) Add limit to pol count in MqPersistence, fix test
2023-07-12 18:16:23 +02:00
Viktor Lofgren
89e4343fdb
(minor) Fix test
2023-07-12 18:15:50 +02:00
Viktor Lofgren
8c16a2aede
(work-log, minor) Clean up code
2023-07-12 18:10:05 +02:00
Viktor Lofgren
5deec63667
(work-log) Better tests
2023-07-12 18:04:06 +02:00
Viktor Lofgren
74caf9e38a
(processes) Remove forEach-constructs in favor of iterators.
2023-07-12 17:47:36 +02:00
Viktor Lofgren
77261a38cd
(control, WIP) MQFSM and ProcessService are sitting in a tree
...
We're spawning processes from the MSFSM in control service now!
2023-07-11 17:08:43 +02:00
Viktor Lofgren
4c016b0318
Process monitoring
...
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor Lofgren
ec7826659a
(minor) Javadoc comments for MqPersistance and MqMessageState
2023-07-10 21:52:25 +02:00
Viktor Lofgren
98b5f22104
(control) WIP control service
...
* Set messages to OK when received so they're cleaned up properly.
2023-07-10 21:33:57 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor
0f9b90eb1c
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00
Viktor Lofgren
d9e6c4f266
Trial integration of MQ-FSM into index service.
2023-07-06 18:04:16 +02:00
Viktor Lofgren
f0a8ca440f
MQFSM Usability WIP
2023-07-06 13:33:11 +02:00
Viktor Lofgren
d89db10645
MQFSM Usability WIP
2023-07-06 13:02:16 +02:00
Viktor Lofgren
2ae0b8c159
Message queue based state machine
2023-07-04 17:42:06 +02:00
Viktor Lofgren
31ae71c7d6
Message queue WIP
2023-07-04 14:28:14 +02:00
Viktor Lofgren
62cc9df206
Embryo of new control process
...
* New events and heartbeat tables in mariadb
* Refactored to a cleaner Service interface
2023-07-03 10:40:32 +02:00
Viktor Lofgren
8274e8a953
JVM flags for disabling black and block-lists.
2023-06-30 17:07:47 +02:00
Viktor Lofgren
a6a66c6d8a
Improve site info for unknown domains:
...
* Placeholder screenshot should work
* Add a link to git-repo for submitting the site for crawling
2023-06-27 15:32:11 +02:00
Viktor Lofgren
f92d8a0975
EdgeUrl conversion to/from java.net.URL
2023-06-27 10:57:54 +02:00
Viktor Lofgren
5abaf13192
Fix serialization bug with CompressedBigString
2023-06-27 10:57:54 +02:00
Viktor Lofgren
bd2c3855ed
Add bits and keywords for generator classes (docs, forum, wiki).
2023-06-23 21:35:28 +02:00
Viktor Lofgren
54c2be893b
TRIVIAL: Remove unused import.
2023-06-22 17:21:47 +02:00
Viktor Lofgren
b5ef67ed28
Categorize generators by type
...
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00
Viktor Lofgren
9455100907
Throw a custom exception when WMSA_HOME isn't found
2023-06-20 11:37:52 +02:00
Viktor Lofgren
d1a004bea6
(minor) Clean up StringPool
2023-06-19 17:58:19 +02:00
Viktor Lofgren
2cda57355a
More word metadata tests
2023-05-28 11:57:06 +02:00
Viktor Lofgren
d42ab19166
Issue 5: Fix bug where some IPv6 addresses blew up domain loading.
2023-04-15 14:11:08 +02:00
Viktor Lofgren
2ab26f37b8
Bug fix for document metadata encoding that breaks year based queries.
2023-04-14 16:56:49 +02:00
Viktor
a278fc6296
Increase search result relevance ( #8 )
...
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
2023-04-07 20:18:08 +02:00
Viktor Lofgren
716ab35b4e
Search ranking debuggability improvements.
2023-04-02 13:43:24 +02:00
Viktor Lofgren
cc4e089a5d
Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc.
2023-03-30 15:46:15 +02:00
Viktor Lofgren
03bd892b95
Improve document processing in conversion.
...
* Add flags for long and short documents.
* Break out common length logic from plugins.
* Cleaning up of related code.
2023-03-28 16:38:00 +02:00
Viktor Lofgren
c5f4cb34bf
Documentation for DB
2023-03-25 16:14:16 +01:00
Viktor
be3ba3ef37
Update readme.md
2023-03-25 15:27:11 +01:00
Viktor
ac1ac3ea57
Move database to a separate module
...
* Move database to a separate project, break apart sql file into separate entities.
* Fix front page news listing.
2023-03-25 15:26:17 +01:00
Viktor
d2a9e1b644
Add processes link to readme.md for code/common
2023-03-25 12:42:44 +01:00
Viktor Lofgren
2f2c86a9f5
Fix bug where WmsaHome wouldn't look in /var/lib/wmsa as a fallback
2023-03-25 10:20:52 +01:00
Viktor Lofgren
964014860a
Get suggestions working again
2023-03-22 15:11:22 +01:00
Viktor Lofgren
46f81aca2f
Break apart reverse index into a separate full index and priority index. It did this before using the same code. This will make the priority index about half as big since it no longer needs to keep metadata.
2023-03-21 16:12:31 +01:00
Viktor Lofgren
ca22c287a5
Make use of DocumentFlags' flags
2023-03-21 16:03:15 +01:00
Viktor Lofgren
72115e490f
Put news into a database table instead of keeping them hardcoded, request counter on front page.
2023-03-19 12:54:58 +01:00
Viktor Lofgren
bdd2b4a43e
Put news into a database table instead of keeping them hardcoded.
2023-03-19 11:46:13 +01:00
Viktor Lofgren
2eb972dea1
Remove unrelated code, break tools into their own directory.
2023-03-17 16:03:11 +01:00
Viktor Lofgren
449471a076
Yet more restructuring. Improved search result ranking.
2023-03-16 21:35:54 +01:00
Viktor Lofgren
616effdb3c
The refactoring will continue until morale improves.
2023-03-12 10:04:48 +01:00
Viktor Lofgren
4cec89da91
Fix bug where results would sometimes be presented solely based on the fact that the document is important on the site in general, regardless of whether it's important to the document.
2023-03-11 14:20:32 +01:00
Viktor Lofgren
2e2916cebe
Additional code restructuring to get rid of util and misc-style packages.
2023-03-11 13:53:36 +01:00
Viktor Lofgren
6d939175b1
Additional code restructuring to get rid of util and misc-style packages.
2023-03-11 13:48:40 +01:00
Viktor Lofgren
722ff3bffb
Word feature bit for words that appear in the URL, new search profile for plain text files, better plain text titles.
2023-03-10 16:46:56 +01:00
Viktor Lofgren
efb46cc703
Remove count from WordMetadata entirely.
2023-03-09 18:14:14 +01:00
Viktor Lofgren
8fb531c614
Word Metadata's count is hella broken, stopgap fix by bitCounting positions instead as this is messing with the search result ordering very badly.
2023-03-09 17:58:56 +01:00
Viktor Lofgren
9ece07d559
Chasing a result ranking bug
2023-03-09 17:52:35 +01:00
Viktor Lofgren
0ae4731cf1
Add invariant to WordMetadata
2023-03-09 17:27:07 +01:00
Viktor Lofgren
ad1be7c835
Move all code to a code directory.
2023-03-07 17:14:32 +01:00