Commit Graph

74 Commits

Author SHA1 Message Date
Clément Renault
d6fa9c0414
Index the intra documents word pair proximities 2020-09-22 14:04:33 +02:00
Clément Renault
e34437b2d7
Move the proximity function to a module 2020-09-22 10:54:59 +02:00
Kerollmops
5664c37539
Introduce an heed codec that reduce the size of small amount of serialized integers 2020-09-07 20:06:23 +02:00
Clément Renault
daa3673c1c
Invert the word docid positions key order 2020-09-06 10:30:53 +02:00
Clément Renault
dc88a86259
Store the word positions under the documents 2020-09-05 18:03:06 +02:00
Kerollmops
580ed1119a
Make the engine to return csv string records as documents and headers 2020-08-31 19:02:00 +02:00
Clément Renault
bad0663138
Come back to the old tokenizer 2020-08-31 13:34:38 +02:00
Clément Renault
ad5cafbfed
Introduce a database to store docids in groups of four positions 2020-08-29 17:42:55 +02:00
Clément Renault
3db517548d
Move the documents back into the LMDB database 2020-08-29 15:14:04 +02:00
Clément Renault
3fe497e129
Improve the Mtbl heed codec to only encode MTBL databases 2020-08-29 11:20:39 +02:00
Clément Renault
0a44ff86ab
Put the documents MTBL back into LMDB
We makes sure to write the documents into a file before
memory mapping it and putting it into LMDB, this way we avoid
moving it to RAM
2020-08-28 15:43:24 +02:00
Clément Renault
d784d87880
Remove the prefix LMDB databases 2020-08-28 14:41:43 +02:00
Clément Renault
7cde312f14
Introduce the StrBEU32Codec heed codec 2020-08-28 14:16:37 +02:00
Clément Renault
8806fcd545
Introduce a better query and document lexer 2020-08-16 14:36:54 +02:00
Clément Renault
1e358e3ae8
Introduce the AstarBagIter that iterates through best paths 2020-08-15 16:24:06 +02:00
Clément Renault
7dc594ba4d
Introduce the Search builder struct 2020-08-13 14:27:51 +02:00
Clément Renault
bfb46cbfbe
Introduce the Crtierion enum 2020-08-12 10:43:02 +02:00
Clément Renault
6d04a285dc
Retrieve and display the distances of the words found 2020-08-11 15:18:02 +02:00
Clément Renault
1bd37d213a
Lowercase quoted words 2020-08-10 14:49:09 +02:00
Clément Renault
883a8109c8
Show both database and documents database sizes 2020-08-10 14:37:18 +02:00
Clément Renault
394844062f
Move the documents MTBL database inside the Index 2020-08-10 13:47:19 +02:00
Clément Renault
91282c8b6a
Move the documents into another file 2020-08-07 13:11:31 +02:00
Clément Renault
fae694a102
Put the documents into an MTBL database 2020-08-07 12:14:40 +02:00
Clément Renault
d3b1096510
Compute the word attribute postings lists on each threads 2020-08-06 11:50:27 +02:00
Kerollmops
9ade00e27b
Highlight all the matching words 2020-07-14 11:53:21 +02:00
Kerollmops
3d144e62c4
Search for best proximities in multiple attributes 2020-07-13 19:06:56 +02:00
Kerollmops
576dd011a1
Compute the candidates but not by attribute 2020-07-13 18:16:05 +02:00
Kerollmops
6b14b20369
Introduce a method to retrieve the number of attributes of the documents 2020-07-13 17:50:16 +02:00
Kerollmops
12358476da
Use the log crate instead of stderr 2020-07-12 10:55:09 +02:00
Kerollmops
d31da26a51
Avoid cloning RoraringBitmaps when unecessary 2020-07-11 23:51:32 +02:00
Kerollmops
b12bfcb03b
Reduce the deepness of the word position document ids
This helps reduce the number of allocations.
2020-07-07 12:30:05 +02:00
Kerollmops
7178b6c2c4
First basic version using MTBL again 2020-07-07 11:32:33 +02:00
Kerollmops
ec1023e790
Intersect document ids by inverse popularity of the words
This reduces the worst request we had which took 56s to now took 3s ("the best of the do").
2020-07-05 19:33:51 +02:00
Kerollmops
2fcae719ad
Use another LRU impl which uses hashbrown 2020-06-29 22:26:06 +02:00
Kerollmops
f98b615bf3
Replace the LRU by an Arc cache 2020-06-29 20:48:57 +02:00
Kerollmops
5f0088594b
Index by writing directly into LMDB 2020-06-29 13:54:47 +02:00
Kerollmops
63cbeca64e
Skip all derived words when too short 2020-06-28 12:13:12 +02:00
Kerollmops
736f0f7560
Use the proximity instead of the attributes when searching for <= 7 proximities 2020-06-28 12:13:12 +02:00
Kerollmops
fe3be8f18a
Replace the HashMap by a Vec for attributes documents ids 2020-06-28 12:13:12 +02:00
Kerollmops
7e16afbdce
Ignore documents which are not part of the candidates when exploring with A* 2020-06-24 15:06:45 +02:00
Kerollmops
1c7a9a4132
Remove the found documents from the candidates list 2020-06-24 15:00:26 +02:00
Kerollmops
50169b9798
Compute the full list of ids we are willing to find by attribute 2020-06-24 14:48:04 +02:00
Kerollmops
374ec6773f
Introduce a database to store all docids for a word and attribute 2020-06-22 19:24:20 +02:00
Kerollmops
ba3e805981
Document the Index types and the internal LMDB databases 2020-06-22 18:09:22 +02:00
Kerollmops
2f0e1afd16
Introduce the roaring bitmap heed codec 2020-06-22 17:56:07 +02:00
Kerollmops
8148210860
Use the cache when retrieving the documents at the end 2020-06-21 12:25:19 +02:00
Kerollmops
1628a31efa
Cache the unions of the derived words positions 2020-06-20 15:38:10 +02:00
Kerollmops
115e0142d9
Add a feature flags to enable the export of stats 2020-06-20 13:25:42 +02:00
Kerollmops
beb49b24f6
Skip looking at connections for proximity 0 2020-06-20 13:19:03 +02:00
Kerollmops
55a8941922
Optimize things 2020-06-19 17:48:17 +02:00