Commit Graph

1509 Commits

Author SHA1 Message Date
Viktor Lofgren
6a80ac62a5 (doc) Amend crawling documentation 2023-11-17 11:16:06 +01:00
Viktor Lofgren
98efb08e17 (gradle) Make docker image registry and tag configurable
This is to enable running an external repository for production and test.

Use the ./gradle -Pdocker-registry=registry.foo.bar -Pdocker-tag=my-tag while building to accomplish this.  By default, use 'marginalia' for repository and 'latest' as tag.
2023-11-16 21:12:55 +01:00
Viktor Lofgren
f58a9f46be (loader) Don't truncate the entire links table on load
This behavior is an old vestige from the days of only having a single loader process.  We'd truncate the links table because doing inserts/updates was too slow.  This was also important because we had 32 bit ID, and there's a lot of links between domains to go around...

Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE.

We also update the PRIMARY KEY to a BIGINT.  We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.
2023-11-16 10:30:12 +01:00
Viktor Lofgren
fd77e62a13 set registry to registry.marginalia.nu 2023-11-15 14:04:44 +01:00
Viktor Lofgren
376228e199 (docker) set registry to registry.marginalia.nu 2023-11-15 14:03:22 +01:00
Viktor Lofgren
8a5b853fae Fix experiment runner 2023-11-15 14:03:17 +01:00
Viktor Lofgren
1cbf23e7e7 (test) Don't fail test if atags.parquet is not in ~vlofgren 2023-11-15 09:11:38 +01:00
Viktor Lofgren
63554ba171 (explore2) Add robots.txt 2023-11-14 09:15:32 +01:00
Viktor Lofgren
5de37cb820 (converter) Set feature flags appropriately on stackexchange posts 2023-11-12 15:48:08 +01:00
Viktor Lofgren
e5cee1f46d (sideload) Fix sideloading so that it doesn't get disproportionately good rankings
Also add type flags so that e.g. wikipedia shows up in the wikis filter.
2023-11-12 14:57:57 +01:00
Viktor Lofgren
e9a01caa5c (index) Fix broken metrics 2023-11-11 12:53:47 +01:00
Viktor Lofgren
858357a246 (metrics) Get prometheus up out of disrepair
* Fix bad labels
* Add nodeId where appropriate
* Hopefully fix histogram buckets for index query times
2023-11-08 14:01:28 +01:00
Viktor Lofgren
ef16502159 (doc) Update readme 2023-11-07 16:00:18 +01:00
Viktor Lofgren
29e2c43e01 (gradle) Up to gradle 8.4 since it has better Java 21 compatibility 2023-11-07 16:00:08 +01:00
Viktor Lofgren
7aa2f80117 (domain) id.au should be treated as a TLD 2023-11-06 19:07:47 +01:00
Viktor
d29f9c4ffd
Merge pull request #59 from MarginaliaSearch/atags
Support for anchor tag keywords

* Added new (optional) model file in $WMSA_HOME/data/atags.parquet. Due to size limitations on github, this is available at https://downloads.marginalia.nu/exports
*  Converter gets a component for creating a projection of its domains onto the full atags parquet file
*  New WordFlag ExternalLink
* These terms are also for now flagged as title words
* The ranking algorithm was tweaked to make better use of ngram information as well as weighting the priority BM25
*  Fixed a bug where Title words aliased with UrlDomain words
*  Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
*  Crawler will also use the anchor tag file to prioritize crawling documents with external links.
2023-11-06 19:02:52 +01:00
Viktor Lofgren
7617b4cbc2 (crawler) Fix NPE in crawler caused by not having fetched the domains list yet 2023-11-06 18:16:38 +01:00
Viktor Lofgren
e0c769fd19 (converter) Integrate atags.parquet with the encyclopedia sideloader
Also clean up stackexchange and dirtree a bit.
2023-11-06 18:03:01 +01:00
Viktor Lofgren
ebd10a5f28 (crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized 2023-11-06 16:14:58 +01:00
Viktor Lofgren
2b77184281 (converter) Integrate atags with the topology field 2023-11-06 13:46:44 +01:00
Viktor Lofgren
e23976f6c4 (search) Fix card title overflow 2023-11-06 13:25:39 +01:00
Viktor Lofgren
0b8dc02eba (result-ranking) Nudge up results with ngram matches a tiny bit 2023-11-06 13:14:22 +01:00
Viktor Lofgren
fde1d0677e (search) Remove unnecessary dependencies 2023-11-06 12:56:32 +01:00
Viktor Lofgren
48986574ae (result-ranking) Use a weighted calculation of priority term importance 2023-11-06 12:56:21 +01:00
Viktor Lofgren
c7a6a71d07 (result-ranking) Use a weighted calculation of priority term importance 2023-11-06 12:48:23 +01:00
Viktor Lofgren
1847845151 Revert "(loader) Optimize INSERT statements"
This reverts commit 7cb92195d1.
2023-11-04 19:32:02 +01:00
Viktor Lofgren
7cb92195d1 (loader) Optimize INSERT statements
INSERT IGNORE is too slow.
2023-11-04 17:43:55 +01:00
Viktor Lofgren
72afa0341f duckdb connection may need to be synchronized? 2023-11-04 14:30:25 +01:00
Viktor Lofgren
0152004c42 Initial Commit Anchor Tags
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
2023-11-04 14:24:17 +01:00
Viktor Lofgren
30ca5046b5 (docker) Route screenshots to dating as well 2023-11-02 15:47:18 +01:00
Viktor Lofgren
8e9698c9a0 (control/search) Add ability to suggest removing a site from random exploration
This is what most complaints have been about.
2023-11-02 15:29:49 +01:00
Viktor Lofgren
3047e2dd7c (screenshot-capture-tool) Make screenshot-capture-tool cooperate with docker 2023-11-01 16:38:55 +01:00
Viktor Lofgren
a8b9d21f2d (executor) Refine atag export logic
* Remove obviously uninteresting tags
* Omit URL schema for more sensible sorting
* Change the column order to put the source domain last
2023-11-01 13:23:14 +01:00
Viktor Lofgren
c77a5b7cb6 (control) GUI for atags export 2023-10-31 17:55:47 +01:00
Viktor Lofgren
23f2068e33 (executor) Actor for exporting anchor tag data from crawl data 2023-10-31 17:32:34 +01:00
Viktor Lofgren
ffadfb4149 (control) Use a partial template for the storage types tabs. 2023-10-31 17:12:14 +01:00
Viktor Lofgren
b7e38cfbae (control) Add exports view 2023-10-31 17:08:48 +01:00
Viktor Lofgren
659743b39c (executor) Export Data actor allocates its own storage 2023-10-31 17:04:07 +01:00
Viktor
cbac42bdd1
Merge pull request #55 from MarginaliaSearch/multinode-index
Multinode index, control GUI redesign
2023-10-31 16:37:00 +01:00
Viktor Lofgren
69758c5859 (control) Nicer redirects acknowledging actions 2023-10-31 16:26:29 +01:00
Viktor Lofgren
81bfd7e5fb (experiment) Utility for exporting atags 2023-10-31 16:10:21 +01:00
Viktor Lofgren
fd8a5e695d (build) Upgrade dependencies with CVEs 2023-10-31 16:09:58 +01:00
Viktor Lofgren
8f74dbdbb4 (crawler) Set more lenient parameters for recrawl 2023-10-30 11:35:30 +01:00
Viktor Lofgren
fd5a7eac87 (crawler) Exit crawler retriever on thread interrupted 2023-10-30 11:34:16 +01:00
Viktor Lofgren
6bac3c75cb (api) API documentation 2023-10-29 16:13:21 +01:00
Viktor Lofgren
5d6e0e3790 (log) Clean up logging
Don't log the PROCESS stream to executor's logs, as it will also be logged in the spawned process' log files.

Also tell the spawned process which "service" it is so that it gets a log file with a name that makes sense.
2023-10-29 15:52:17 +01:00
Viktor Lofgren
2871a326e6 (ctrl/exe) Clean up UX and code 2023-10-29 14:09:39 +01:00
Viktor Lofgren
abb42f0f36 (crawler) Fix bug in SQL statement
Arguments were in the wrong order in inserting fetching sites submitted to be crawled
2023-10-29 13:19:17 +01:00
Viktor Lofgren
f6fcb04817 (experiment) Repair the experiment runner 2023-10-27 16:16:50 +02:00
Viktor Lofgren
b8796d825d (docs) Update documentation 2023-10-27 13:24:49 +02:00