Clément Renault
656a851830
Introduce the Transform struct transforming CSVs
...
This allows us to:
- transform a CSV, a JSON or a JSON lines data type into the same
Grenad x Obkv streamable data type and creates the new FieldsIdsMap.
- Extract all the documents user ids in advance to be able to delete
the existing documents before re-indexing them.
- Keep the last documents with the same user id avoiding duplicates
in the same request.
2020-10-24 13:37:38 +02:00
Clément Renault
8d82e37ec0
Introduce the AvailableDocumentsIds iterator
2020-10-23 12:07:01 +02:00
Clément Renault
566a7c3039
Make the FieldsIdsMap serialization more stable by using a BTreeMap
2020-10-22 14:53:20 +02:00
Clément Renault
9133f38138
Introduce the FieldsIdsMap type
2020-10-22 12:56:35 +02:00
Clément Renault
5caf523fd9
Move the Index to its own module
2020-10-21 15:55:48 +02:00
Clément Renault
a122d3d466
Export the indexing part into a module
2020-10-20 14:22:09 +02:00
Clément Renault
871222aebd
Introduce some new routes to handle live indexing
2020-10-19 16:06:43 +02:00
Clément Renault
65e32fecb1
Move the binaries into one with subcommands
2020-10-19 13:44:17 +02:00
Clément Renault
83c1db8763
Introduce the UpdateStore
2020-10-18 15:26:57 +02:00
Kerollmops
a00f5850ee
Add support for placeholder search for empty queries
2020-10-06 20:19:50 +02:00
Clément Renault
ce8e56ee18
Rewrite the indexer to use one MTBL by database
...
This allows us to avoid prefixing keys and appending into LMDB databases
2020-10-04 17:04:33 +02:00
Kerollmops
007e647462
Introduce the Mdfs Iterator that explore the proximity graph using a mana DFS
2020-10-02 16:46:07 +02:00
Kerollmops
d0c73564b1
Use the CboRoaringBitmapCodec for the word pair proximity docids
2020-10-02 16:46:06 +02:00
Kerollmops
4eda149ffa
Rename the BoRoaringBitmap codec
2020-10-02 16:46:06 +02:00
Clément Renault
bc35c9a598
Introduce the size_of_database infos subcommand
2020-10-02 16:46:05 +02:00
Clément Renault
d6fa9c0414
Index the intra documents word pair proximities
2020-09-22 14:04:33 +02:00
Clément Renault
e34437b2d7
Move the proximity function to a module
2020-09-22 10:54:59 +02:00
Kerollmops
5664c37539
Introduce an heed codec that reduce the size of small amount of serialized integers
2020-09-07 20:06:23 +02:00
Clément Renault
daa3673c1c
Invert the word docid positions key order
2020-09-06 10:30:53 +02:00
Clément Renault
dc88a86259
Store the word positions under the documents
2020-09-05 18:03:06 +02:00
Kerollmops
580ed1119a
Make the engine to return csv string records as documents and headers
2020-08-31 19:02:00 +02:00
Clément Renault
bad0663138
Come back to the old tokenizer
2020-08-31 13:34:38 +02:00
Clément Renault
ad5cafbfed
Introduce a database to store docids in groups of four positions
2020-08-29 17:42:55 +02:00
Clément Renault
3db517548d
Move the documents back into the LMDB database
2020-08-29 15:14:04 +02:00
Clément Renault
3fe497e129
Improve the Mtbl heed codec to only encode MTBL databases
2020-08-29 11:20:39 +02:00
Clément Renault
0a44ff86ab
Put the documents MTBL back into LMDB
...
We makes sure to write the documents into a file before
memory mapping it and putting it into LMDB, this way we avoid
moving it to RAM
2020-08-28 15:43:24 +02:00
Clément Renault
d784d87880
Remove the prefix LMDB databases
2020-08-28 14:41:43 +02:00
Clément Renault
7cde312f14
Introduce the StrBEU32Codec heed codec
2020-08-28 14:16:37 +02:00
Clément Renault
8806fcd545
Introduce a better query and document lexer
2020-08-16 14:36:54 +02:00
Clément Renault
1e358e3ae8
Introduce the AstarBagIter that iterates through best paths
2020-08-15 16:24:06 +02:00
Clément Renault
7dc594ba4d
Introduce the Search builder struct
2020-08-13 14:27:51 +02:00
Clément Renault
bfb46cbfbe
Introduce the Crtierion enum
2020-08-12 10:43:02 +02:00
Clément Renault
6d04a285dc
Retrieve and display the distances of the words found
2020-08-11 15:18:02 +02:00
Clément Renault
1bd37d213a
Lowercase quoted words
2020-08-10 14:49:09 +02:00
Clément Renault
883a8109c8
Show both database and documents database sizes
2020-08-10 14:37:18 +02:00
Clément Renault
394844062f
Move the documents MTBL database inside the Index
2020-08-10 13:47:19 +02:00
Clément Renault
91282c8b6a
Move the documents into another file
2020-08-07 13:11:31 +02:00
Clément Renault
fae694a102
Put the documents into an MTBL database
2020-08-07 12:14:40 +02:00
Clément Renault
d3b1096510
Compute the word attribute postings lists on each threads
2020-08-06 11:50:27 +02:00
Kerollmops
9ade00e27b
Highlight all the matching words
2020-07-14 11:53:21 +02:00
Kerollmops
3d144e62c4
Search for best proximities in multiple attributes
2020-07-13 19:06:56 +02:00
Kerollmops
576dd011a1
Compute the candidates but not by attribute
2020-07-13 18:16:05 +02:00
Kerollmops
6b14b20369
Introduce a method to retrieve the number of attributes of the documents
2020-07-13 17:50:16 +02:00
Kerollmops
12358476da
Use the log crate instead of stderr
2020-07-12 10:55:09 +02:00
Kerollmops
d31da26a51
Avoid cloning RoraringBitmaps when unecessary
2020-07-11 23:51:32 +02:00
Kerollmops
b12bfcb03b
Reduce the deepness of the word position document ids
...
This helps reduce the number of allocations.
2020-07-07 12:30:05 +02:00
Kerollmops
7178b6c2c4
First basic version using MTBL again
2020-07-07 11:32:33 +02:00
Kerollmops
ec1023e790
Intersect document ids by inverse popularity of the words
...
This reduces the worst request we had which took 56s to now took 3s ("the best of the do").
2020-07-05 19:33:51 +02:00
Kerollmops
2fcae719ad
Use another LRU impl which uses hashbrown
2020-06-29 22:26:06 +02:00
Kerollmops
f98b615bf3
Replace the LRU by an Arc cache
2020-06-29 20:48:57 +02:00