639: Reduce the size of the word_pair_proximity database  r=loiclec a=loiclec

# Pull Request

## What does this PR do?
Fixes #634 

Now, the value corresponding to the key `prox word1 word2` in the `word_pair_proximity_docids` database contains the ids of the documents in which:
- `word1` is followed by `word2`
- the minimum number of words between `word1` and `word2` is `prox-1`

Before this PR, the `word_pair_proximity_docids` had keys with the format `word1 word2 prox` and the value contained the ids of the documents in which either:
- `word1` is followed by `word2` after a minimum of `prox-1` words in between them
- `word2` is followed by `word1` after a minimum of `prox-2` words 

As a consequence of this change, calls such as:
```
let docids = word_pair_proximity_docids.get(rtxn, (word1, word2, prox));
```
have to be replaced with:
```
let docids1 = word_pair_proximity_docids.get(rtxn, (prox, word1, word2)) ;
let docids2 = word_pair_proximity_docids.get(rtxn, (prox-1, word2, word1)) ;
let docids = docids1 | docids2;
```

## Phrase search

The PR also fixes two bugs in the `resolve_phrase` function. The first bug is that a phrase containing twice the same word would always return zero documents (e.g. `"dog eats dog"`). 

The second bug occurs with a phrase such as "fox is smarter than a dog"` and the document with the text:
```
fox or dog? a fox is smarter than a dog
```
In that case, the phrase search would not return the documents because:
* we only have the key `fox dog 2` in `word_pair_proximity_docids`
* but the implementation of `resolve_phrase` looks for `fox dog 5`, which returns 0 documents 

### New implementation of `resolve_phrase`
Given the phrase:
```
fox is smarter than a dog
```
We select the document ids corresponding to all of the following keys in `word_pair_proximity_docids`:
- `1 fox is`
- `1 is smarter`
- `1 smarter than`
- (etc.)
- `1 fox smarter` OR `2 fox smarter`
- `1 is than` OR `2 is than`
- ...
- `1 than dog` OR `2 than dog`

## Benchmark Results

Indexing:
```
group                                                                     indexing_main_d94339a8                 indexing_word-pair-proximity-docids-refactor_2983dd8e
-----                                                                     ----------------------                 -----------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.19    40.7±11.28ms        ? ?/sec    1.00     34.3±4.16ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.62     11.3±3.77ms        ? ?/sec    1.00      7.0±1.56ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00     12.5±2.62ms        ? ?/sec    1.07     13.4±4.24ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.26    50.2±12.63ms        ? ?/sec    1.00    39.8±20.25ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.83   269.1±16.11ms        ? ?/sec    1.00    146.8±6.12ms        ? ?/sec
indexing/Indexing geo_point                                               1.00      47.2±0.46s        ? ?/sec    1.00      47.3±0.56s        ? ?/sec
indexing/Indexing movies in three batches                                 1.42      12.7±0.13s        ? ?/sec    1.00       9.0±0.07s        ? ?/sec
indexing/Indexing movies with default settings                            1.40      10.2±0.07s        ? ?/sec    1.00       7.3±0.06s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.22       7.8±0.11s        ? ?/sec    1.00       6.4±0.13s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.24       7.3±0.07s        ? ?/sec    1.00       5.9±0.06s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.14      47.6±0.67s        ? ?/sec    1.00      41.8±0.63s        ? ?/sec
indexing/Indexing songs with default settings                             1.13      44.1±0.74s        ? ?/sec    1.00      38.9±0.76s        ? ?/sec
indexing/Indexing songs without any facets                                1.19      42.0±0.66s        ? ?/sec    1.00      35.2±0.48s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.20      44.3±1.40s        ? ?/sec    1.00      37.0±0.48s        ? ?/sec
indexing/Indexing wiki                                                    1.39     862.9±9.95s        ? ?/sec    1.00    622.6±27.11s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.40     934.4±5.97s        ? ?/sec    1.00     665.7±4.72s        ? ?/sec
indexing/Reindexing geo_point                                             1.01      15.9±0.39s        ? ?/sec    1.00      15.7±0.28s        ? ?/sec
indexing/Reindexing movies with default settings                          1.15   288.8±25.03ms        ? ?/sec    1.00    250.4±2.23ms        ? ?/sec
indexing/Reindexing songs with default settings                           1.01       4.1±0.06s        ? ?/sec    1.00       4.1±0.03s        ? ?/sec
indexing/Reindexing wiki                                                  1.41   1484.7±20.59s        ? ?/sec    1.00   1052.0±19.89s        ? ?/sec
```

Search Wiki:
<details>
<pre>
group                                                                                    search_wiki_main_d94339a8              search_wiki_word-pair-proximity-docids-refactor_2983dd8e
-----                                                                                    -------------------------              --------------------------------------------------------
smol-wiki-articles.csv: basic placeholder/                                               1.02     25.8±0.21µs        ? ?/sec    1.00     25.4±0.19µs        ? ?/sec
smol-wiki-articles.csv: basic with quote/"film"                                          1.00    441.7±2.57µs        ? ?/sec    1.00    442.3±2.41µs        ? ?/sec
smol-wiki-articles.csv: basic with quote/"france"                                        1.00    357.0±2.63µs        ? ?/sec    1.00    358.3±2.65µs        ? ?/sec
smol-wiki-articles.csv: basic with quote/"japan"                                         1.00    239.4±2.24µs        ? ?/sec    1.00    240.2±1.82µs        ? ?/sec
smol-wiki-articles.csv: basic with quote/"machine"                                       1.00    180.3±2.40µs        ? ?/sec    1.00    180.0±1.08µs        ? ?/sec
smol-wiki-articles.csv: basic with quote/"miles" "davis"                                 1.00      9.1±0.03ms        ? ?/sec    1.03      9.3±0.04ms        ? ?/sec
smol-wiki-articles.csv: basic with quote/"mingus"                                        1.00      3.6±0.01ms        ? ?/sec    1.03      3.7±0.02ms        ? ?/sec
smol-wiki-articles.csv: basic with quote/"rock" "and" "roll"                             1.00     34.0±0.11ms        ? ?/sec    1.03     35.1±0.13ms        ? ?/sec
smol-wiki-articles.csv: basic with quote/"spain"                                         1.00    162.0±0.88µs        ? ?/sec    1.00    161.9±0.98µs        ? ?/sec
smol-wiki-articles.csv: basic without quote/film                                         1.01    164.4±1.46µs        ? ?/sec    1.00    163.1±1.58µs        ? ?/sec
smol-wiki-articles.csv: basic without quote/france                                       1.00   1698.3±7.37µs        ? ?/sec    1.00  1697.7±11.53µs        ? ?/sec
smol-wiki-articles.csv: basic without quote/japan                                        1.00  1154.0±23.61µs        ? ?/sec    1.00   1150.7±9.27µs        ? ?/sec
smol-wiki-articles.csv: basic without quote/machine                                      1.00    524.6±3.45µs        ? ?/sec    1.01    528.1±4.56µs        ? ?/sec
smol-wiki-articles.csv: basic without quote/miles davis                                  1.00     13.5±0.05ms        ? ?/sec    1.02     13.8±0.05ms        ? ?/sec
smol-wiki-articles.csv: basic without quote/mingus                                       1.00      4.1±0.02ms        ? ?/sec    1.03      4.2±0.01ms        ? ?/sec
smol-wiki-articles.csv: basic without quote/rock and roll                                1.00     49.0±0.19ms        ? ?/sec    1.03     50.4±0.22ms        ? ?/sec
smol-wiki-articles.csv: basic without quote/spain                                        1.00    412.2±3.35µs        ? ?/sec    1.00    412.9±2.81µs        ? ?/sec
smol-wiki-articles.csv: prefix search/c                                                  1.00    383.9±2.53µs        ? ?/sec    1.00    383.4±2.44µs        ? ?/sec
smol-wiki-articles.csv: prefix search/g                                                  1.00    433.4±2.53µs        ? ?/sec    1.00    432.8±2.52µs        ? ?/sec
smol-wiki-articles.csv: prefix search/j                                                  1.00    424.3±2.05µs        ? ?/sec    1.00    424.0±2.15µs        ? ?/sec
smol-wiki-articles.csv: prefix search/q                                                  1.00    154.0±1.93µs        ? ?/sec    1.00    153.5±1.04µs        ? ?/sec
smol-wiki-articles.csv: prefix search/t                                                  1.04   658.5±91.93µs        ? ?/sec    1.00    631.4±3.89µs        ? ?/sec
smol-wiki-articles.csv: prefix search/x                                                  1.00    446.2±2.09µs        ? ?/sec    1.00    445.6±3.13µs        ? ?/sec
smol-wiki-articles.csv: proximity/april paris                                            1.02      3.4±0.39ms        ? ?/sec    1.00      3.3±0.01ms        ? ?/sec
smol-wiki-articles.csv: proximity/diesel engine                                          1.00  1022.1±17.52µs        ? ?/sec    1.00   1017.7±8.16µs        ? ?/sec
smol-wiki-articles.csv: proximity/herald sings                                           1.01  1872.5±97.70µs        ? ?/sec    1.00   1862.2±8.57µs        ? ?/sec
smol-wiki-articles.csv: proximity/tea two                                                1.00   295.2±34.91µs        ? ?/sec    1.00    296.6±4.08µs        ? ?/sec
smol-wiki-articles.csv: typo/Disnaylande                                                 1.00      3.4±0.51ms        ? ?/sec    1.04      3.5±0.01ms        ? ?/sec
smol-wiki-articles.csv: typo/aritmetric                                                  1.00      3.6±0.01ms        ? ?/sec    1.00      3.7±0.01ms        ? ?/sec
smol-wiki-articles.csv: typo/linax                                                       1.00    167.5±1.28µs        ? ?/sec    1.00    167.1±2.65µs        ? ?/sec
smol-wiki-articles.csv: typo/migrosoft                                                   1.01    217.9±1.84µs        ? ?/sec    1.00    216.2±1.61µs        ? ?/sec
smol-wiki-articles.csv: typo/nympalidea                                                  1.00      2.9±0.01ms        ? ?/sec    1.10      3.1±0.01ms        ? ?/sec
smol-wiki-articles.csv: typo/phytogropher                                                1.00      3.0±0.23ms        ? ?/sec    1.08      3.3±0.01ms        ? ?/sec
smol-wiki-articles.csv: typo/sisan                                                       1.00    234.6±1.38µs        ? ?/sec    1.01    235.8±1.67µs        ? ?/sec
smol-wiki-articles.csv: typo/the fronce                                                  1.00    104.4±0.84µs        ? ?/sec    1.00    103.9±0.81µs        ? ?/sec
smol-wiki-articles.csv: words/Abraham machin                                             1.02    675.5±4.74µs        ? ?/sec    1.00    662.1±5.13µs        ? ?/sec
smol-wiki-articles.csv: words/Idaho Bellevue pizza                                       1.02  1004.5±11.07µs        ? ?/sec    1.00   989.5±13.08µs        ? ?/sec
smol-wiki-articles.csv: words/Kameya Tokujirō mingus monk                                1.00  1650.8±10.92µs        ? ?/sec    1.00  1643.2±10.77µs        ? ?/sec
smol-wiki-articles.csv: words/Ulrich Hensel meilisearch milli                            1.00      5.4±0.03ms        ? ?/sec    1.00      5.4±0.02ms        ? ?/sec
smol-wiki-articles.csv: words/the black saint and the sinner lady and the good doggo     1.00     32.9±0.10ms        ? ?/sec    1.00     32.8±0.10ms        ? ?/sec
</pre>
</details>

Search songs:
<details>
<pre>
group                                                                                                    search_songs_main_d94339a8             search_songs_word-pair-proximity-docids-refactor_2983dd8e
-----                                                                                                    --------------------------             ---------------------------------------------------------
smol-songs.csv: asc + default/Notstandskomitee                                                           1.00      3.0±0.01ms        ? ?/sec    1.01      3.0±0.04ms        ? ?/sec
smol-songs.csv: asc + default/charles                                                                    1.00      2.2±0.01ms        ? ?/sec    1.01      2.2±0.01ms        ? ?/sec
smol-songs.csv: asc + default/charles mingus                                                             1.00      3.1±0.01ms        ? ?/sec    1.01      3.1±0.01ms        ? ?/sec
smol-songs.csv: asc + default/david                                                                      1.00      2.9±0.01ms        ? ?/sec    1.00      2.9±0.01ms        ? ?/sec
smol-songs.csv: asc + default/david bowie                                                                1.00      4.5±0.02ms        ? ?/sec    1.00      4.5±0.02ms        ? ?/sec
smol-songs.csv: asc + default/john                                                                       1.00      3.1±0.01ms        ? ?/sec    1.01      3.2±0.01ms        ? ?/sec
smol-songs.csv: asc + default/marcus miller                                                              1.00      5.0±0.02ms        ? ?/sec    1.00      5.0±0.02ms        ? ?/sec
smol-songs.csv: asc + default/michael jackson                                                            1.00      4.7±0.02ms        ? ?/sec    1.00      4.7±0.02ms        ? ?/sec
smol-songs.csv: asc + default/tamo                                                                       1.00  1463.4±12.17µs        ? ?/sec    1.01   1481.5±8.83µs        ? ?/sec
smol-songs.csv: asc + default/thelonious monk                                                            1.00      4.4±0.01ms        ? ?/sec    1.00      4.4±0.02ms        ? ?/sec
smol-songs.csv: asc/Notstandskomitee                                                                     1.01      2.6±0.01ms        ? ?/sec    1.00      2.6±0.01ms        ? ?/sec
smol-songs.csv: asc/charles                                                                              1.00    473.6±3.70µs        ? ?/sec    1.01   476.8±22.17µs        ? ?/sec
smol-songs.csv: asc/charles mingus                                                                       1.01    780.1±3.90µs        ? ?/sec    1.00    773.6±4.60µs        ? ?/sec
smol-songs.csv: asc/david                                                                                1.00    757.6±4.50µs        ? ?/sec    1.00    760.7±5.20µs        ? ?/sec
smol-songs.csv: asc/david bowie                                                                          1.00   1131.2±8.68µs        ? ?/sec    1.00   1130.7±8.36µs        ? ?/sec
smol-songs.csv: asc/john                                                                                 1.00    668.9±6.48µs        ? ?/sec    1.00    669.9±2.78µs        ? ?/sec
smol-songs.csv: asc/marcus miller                                                                        1.00    959.8±7.10µs        ? ?/sec    1.00    958.9±4.72µs        ? ?/sec
smol-songs.csv: asc/michael jackson                                                                      1.01  1076.7±16.73µs        ? ?/sec    1.00   1070.8±7.34µs        ? ?/sec
smol-songs.csv: asc/tamo                                                                                 1.00     70.4±0.55µs        ? ?/sec    1.00     70.5±0.51µs        ? ?/sec
smol-songs.csv: asc/thelonious monk                                                                      1.01      2.9±0.01ms        ? ?/sec    1.00      2.9±0.01ms        ? ?/sec
smol-songs.csv: basic filter: <=/Notstandskomitee                                                        1.00    162.0±0.91µs        ? ?/sec    1.01    163.6±1.72µs        ? ?/sec
smol-songs.csv: basic filter: <=/charles                                                                 1.00     38.3±0.24µs        ? ?/sec    1.01     38.7±0.31µs        ? ?/sec
smol-songs.csv: basic filter: <=/charles mingus                                                          1.01     85.3±0.44µs        ? ?/sec    1.00     84.6±0.47µs        ? ?/sec
smol-songs.csv: basic filter: <=/david                                                                   1.01     32.4±0.25µs        ? ?/sec    1.00     32.1±0.24µs        ? ?/sec
smol-songs.csv: basic filter: <=/david bowie                                                             1.00     68.6±0.99µs        ? ?/sec    1.01     68.9±0.88µs        ? ?/sec
smol-songs.csv: basic filter: <=/john                                                                    1.04     26.1±0.37µs        ? ?/sec    1.00     25.1±0.22µs        ? ?/sec
smol-songs.csv: basic filter: <=/marcus miller                                                           1.00     76.7±0.39µs        ? ?/sec    1.01     77.3±0.61µs        ? ?/sec
smol-songs.csv: basic filter: <=/michael jackson                                                         1.00     95.5±0.66µs        ? ?/sec    1.01     96.3±0.79µs        ? ?/sec
smol-songs.csv: basic filter: <=/tamo                                                                    1.03     26.2±0.36µs        ? ?/sec    1.00     25.3±0.23µs        ? ?/sec
smol-songs.csv: basic filter: <=/thelonious monk                                                         1.00    140.7±1.36µs        ? ?/sec    1.01    142.7±0.88µs        ? ?/sec
smol-songs.csv: basic filter: TO/Notstandskomitee                                                        1.00    165.4±1.25µs        ? ?/sec    1.00    165.7±1.72µs        ? ?/sec
smol-songs.csv: basic filter: TO/charles                                                                 1.01     40.6±0.57µs        ? ?/sec    1.00     40.1±0.54µs        ? ?/sec
smol-songs.csv: basic filter: TO/charles mingus                                                          1.01     87.1±0.80µs        ? ?/sec    1.00     86.3±0.61µs        ? ?/sec
smol-songs.csv: basic filter: TO/david                                                                   1.02     34.5±0.26µs        ? ?/sec    1.00     33.7±0.24µs        ? ?/sec
smol-songs.csv: basic filter: TO/david bowie                                                             1.00     70.6±0.38µs        ? ?/sec    1.00     70.6±0.68µs        ? ?/sec
smol-songs.csv: basic filter: TO/john                                                                    1.02     27.5±0.77µs        ? ?/sec    1.00     26.9±0.21µs        ? ?/sec
smol-songs.csv: basic filter: TO/marcus miller                                                           1.01     79.8±0.76µs        ? ?/sec    1.00     79.3±1.27µs        ? ?/sec
smol-songs.csv: basic filter: TO/michael jackson                                                         1.00     98.3±0.54µs        ? ?/sec    1.00     98.0±0.88µs        ? ?/sec
smol-songs.csv: basic filter: TO/tamo                                                                    1.03     27.9±0.23µs        ? ?/sec    1.00     27.1±0.32µs        ? ?/sec
smol-songs.csv: basic filter: TO/thelonious monk                                                         1.00    142.5±1.36µs        ? ?/sec    1.02    145.2±0.98µs        ? ?/sec
smol-songs.csv: basic placeholder/                                                                       1.00     49.4±0.34µs        ? ?/sec    1.00     49.3±0.45µs        ? ?/sec
smol-songs.csv: basic with quote/"Notstandskomitee"                                                      1.00    190.5±1.60µs        ? ?/sec    1.01    191.8±2.10µs        ? ?/sec
smol-songs.csv: basic with quote/"charles"                                                               1.00    165.0±1.13µs        ? ?/sec    1.01    166.0±1.39µs        ? ?/sec
smol-songs.csv: basic with quote/"charles" "mingus"                                                      1.00  1149.4±15.78µs        ? ?/sec    1.02   1171.1±9.95µs        ? ?/sec
smol-songs.csv: basic with quote/"david"                                                                 1.00    236.5±1.61µs        ? ?/sec    1.00    236.9±1.73µs        ? ?/sec
smol-songs.csv: basic with quote/"david" "bowie"                                                         1.00   1384.8±9.02µs        ? ?/sec    1.01  1393.8±11.39µs        ? ?/sec
smol-songs.csv: basic with quote/"john"                                                                  1.00    358.3±4.85µs        ? ?/sec    1.00    358.9±1.75µs        ? ?/sec
smol-songs.csv: basic with quote/"marcus" "miller"                                                       1.00    281.4±1.79µs        ? ?/sec    1.01    285.6±3.24µs        ? ?/sec
smol-songs.csv: basic with quote/"michael" "jackson"                                                     1.00   1328.4±8.01µs        ? ?/sec    1.00   1334.6±8.00µs        ? ?/sec
smol-songs.csv: basic with quote/"tamo"                                                                  1.00    528.7±3.72µs        ? ?/sec    1.01    533.4±5.31µs        ? ?/sec
smol-songs.csv: basic with quote/"thelonious" "monk"                                                     1.00   1223.0±7.24µs        ? ?/sec    1.02  1245.7±12.04µs        ? ?/sec
smol-songs.csv: basic without quote/Notstandskomitee                                                     1.00      2.8±0.01ms        ? ?/sec    1.00      2.8±0.01ms        ? ?/sec
smol-songs.csv: basic without quote/charles                                                              1.00    273.3±2.06µs        ? ?/sec    1.01    275.9±1.76µs        ? ?/sec
smol-songs.csv: basic without quote/charles mingus                                                       1.00      2.3±0.01ms        ? ?/sec    1.02      2.4±0.01ms        ? ?/sec
smol-songs.csv: basic without quote/david                                                                1.00    434.3±3.86µs        ? ?/sec    1.01    436.7±2.47µs        ? ?/sec
smol-songs.csv: basic without quote/david bowie                                                          1.00      5.6±0.02ms        ? ?/sec    1.01      5.7±0.02ms        ? ?/sec
smol-songs.csv: basic without quote/john                                                                 1.00   1322.5±9.98µs        ? ?/sec    1.00  1321.2±17.40µs        ? ?/sec
smol-songs.csv: basic without quote/marcus miller                                                        1.02      2.4±0.02ms        ? ?/sec    1.00      2.4±0.01ms        ? ?/sec
smol-songs.csv: basic without quote/michael jackson                                                      1.00      3.8±0.02ms        ? ?/sec    1.01      3.9±0.01ms        ? ?/sec
smol-songs.csv: basic without quote/tamo                                                                 1.00    809.0±4.01µs        ? ?/sec    1.01    819.0±6.22µs        ? ?/sec
smol-songs.csv: basic without quote/thelonious monk                                                      1.00      3.8±0.02ms        ? ?/sec    1.02      3.9±0.02ms        ? ?/sec
smol-songs.csv: big filter/Notstandskomitee                                                              1.00      2.7±0.01ms        ? ?/sec    1.01      2.8±0.01ms        ? ?/sec
smol-songs.csv: big filter/charles                                                                       1.00    266.5±1.34µs        ? ?/sec    1.01    270.1±8.17µs        ? ?/sec
smol-songs.csv: big filter/charles mingus                                                                1.00    651.0±5.40µs        ? ?/sec    1.00    651.0±2.73µs        ? ?/sec
smol-songs.csv: big filter/david                                                                         1.00  1018.1±11.16µs        ? ?/sec    1.00   1022.3±8.94µs        ? ?/sec
smol-songs.csv: big filter/david bowie                                                                   1.00  1912.2±11.13µs        ? ?/sec    1.00   1919.8±8.30µs        ? ?/sec
smol-songs.csv: big filter/john                                                                          1.00    867.2±6.66µs        ? ?/sec    1.01    873.3±3.44µs        ? ?/sec
smol-songs.csv: big filter/marcus miller                                                                 1.00    717.7±2.86µs        ? ?/sec    1.01    721.5±3.89µs        ? ?/sec
smol-songs.csv: big filter/michael jackson                                                               1.00  1668.4±16.76µs        ? ?/sec    1.00  1667.9±10.11µs        ? ?/sec
smol-songs.csv: big filter/tamo                                                                          1.01    136.7±0.88µs        ? ?/sec    1.00    135.5±1.22µs        ? ?/sec
smol-songs.csv: big filter/thelonious monk                                                               1.03      3.1±0.02ms        ? ?/sec    1.00      3.0±0.01ms        ? ?/sec
smol-songs.csv: desc + default/Notstandskomitee                                                          1.00      3.0±0.01ms        ? ?/sec    1.00      3.0±0.01ms        ? ?/sec
smol-songs.csv: desc + default/charles                                                                   1.00  1599.5±13.07µs        ? ?/sec    1.01  1622.9±22.43µs        ? ?/sec
smol-songs.csv: desc + default/charles mingus                                                            1.00      2.3±0.01ms        ? ?/sec    1.01      2.4±0.03ms        ? ?/sec
smol-songs.csv: desc + default/david                                                                     1.00      5.7±0.02ms        ? ?/sec    1.00      5.7±0.02ms        ? ?/sec
smol-songs.csv: desc + default/david bowie                                                               1.00      9.0±0.04ms        ? ?/sec    1.00      9.0±0.03ms        ? ?/sec
smol-songs.csv: desc + default/john                                                                      1.00      4.5±0.01ms        ? ?/sec    1.00      4.5±0.02ms        ? ?/sec
smol-songs.csv: desc + default/marcus miller                                                             1.00      3.9±0.01ms        ? ?/sec    1.00      3.9±0.02ms        ? ?/sec
smol-songs.csv: desc + default/michael jackson                                                           1.00      6.6±0.03ms        ? ?/sec    1.00      6.6±0.03ms        ? ?/sec
smol-songs.csv: desc + default/tamo                                                                      1.00  1472.4±10.38µs        ? ?/sec    1.01   1484.2±8.07µs        ? ?/sec
smol-songs.csv: desc + default/thelonious monk                                                           1.00      4.4±0.02ms        ? ?/sec    1.00      4.4±0.05ms        ? ?/sec
smol-songs.csv: desc/Notstandskomitee                                                                    1.01      2.6±0.01ms        ? ?/sec    1.00      2.6±0.01ms        ? ?/sec
smol-songs.csv: desc/charles                                                                             1.00    475.9±3.38µs        ? ?/sec    1.00    475.9±2.64µs        ? ?/sec
smol-songs.csv: desc/charles mingus                                                                      1.00    775.3±4.30µs        ? ?/sec    1.00    778.9±3.52µs        ? ?/sec
smol-songs.csv: desc/david                                                                               1.00    757.9±4.10µs        ? ?/sec    1.01    763.4±3.27µs        ? ?/sec
smol-songs.csv: desc/david bowie                                                                         1.00  1129.0±11.87µs        ? ?/sec    1.01   1135.1±8.86µs        ? ?/sec
smol-songs.csv: desc/john                                                                                1.00    670.2±4.38µs        ? ?/sec    1.00    670.2±3.46µs        ? ?/sec
smol-songs.csv: desc/marcus miller                                                                       1.00    961.2±4.47µs        ? ?/sec    1.00    961.9±4.03µs        ? ?/sec
smol-songs.csv: desc/michael jackson                                                                     1.00   1076.5±6.61µs        ? ?/sec    1.00   1077.9±7.11µs        ? ?/sec
smol-songs.csv: desc/tamo                                                                                1.00     70.6±0.57µs        ? ?/sec    1.01     71.3±0.48µs        ? ?/sec
smol-songs.csv: desc/thelonious monk                                                                     1.01      2.9±0.01ms        ? ?/sec    1.00      2.9±0.01ms        ? ?/sec
smol-songs.csv: prefix search/a                                                                          1.00   1236.2±9.43µs        ? ?/sec    1.00  1232.0±12.07µs        ? ?/sec
smol-songs.csv: prefix search/b                                                                          1.00   1090.8±9.89µs        ? ?/sec    1.00   1090.8±9.43µs        ? ?/sec
smol-songs.csv: prefix search/i                                                                          1.00   1333.9±8.28µs        ? ?/sec    1.00  1334.2±11.21µs        ? ?/sec
smol-songs.csv: prefix search/s                                                                          1.00    810.5±3.69µs        ? ?/sec    1.00    806.6±3.50µs        ? ?/sec
smol-songs.csv: prefix search/x                                                                          1.00    290.5±1.88µs        ? ?/sec    1.00    291.0±1.85µs        ? ?/sec
smol-songs.csv: proximity/7000 Danses Un Jour Dans Notre Vie                                             1.00      4.7±0.02ms        ? ?/sec    1.00      4.7±0.02ms        ? ?/sec
smol-songs.csv: proximity/The Disneyland Sing-Along Chorus                                               1.01      5.6±0.02ms        ? ?/sec    1.00      5.6±0.03ms        ? ?/sec
smol-songs.csv: proximity/Under Great Northern Lights                                                    1.00      2.5±0.01ms        ? ?/sec    1.00      2.5±0.01ms        ? ?/sec
smol-songs.csv: proximity/black saint sinner lady                                                        1.00      4.8±0.02ms        ? ?/sec    1.00      4.8±0.02ms        ? ?/sec
smol-songs.csv: proximity/les dangeureuses 1960                                                          1.00      3.2±0.01ms        ? ?/sec    1.01      3.2±0.01ms        ? ?/sec
smol-songs.csv: typo/Arethla Franklin                                                                    1.00    388.7±5.16µs        ? ?/sec    1.00    390.0±2.11µs        ? ?/sec
smol-songs.csv: typo/Disnaylande                                                                         1.01      2.6±0.01ms        ? ?/sec    1.00      2.6±0.01ms        ? ?/sec
smol-songs.csv: typo/dire straights                                                                      1.00    125.9±1.22µs        ? ?/sec    1.00    126.0±0.71µs        ? ?/sec
smol-songs.csv: typo/fear of the duck                                                                    1.00    373.7±4.25µs        ? ?/sec    1.01   375.7±14.17µs        ? ?/sec
smol-songs.csv: typo/indochie                                                                            1.00    103.6±0.94µs        ? ?/sec    1.00    103.4±0.74µs        ? ?/sec
smol-songs.csv: typo/indochien                                                                           1.00    155.6±1.14µs        ? ?/sec    1.01    157.5±1.75µs        ? ?/sec
smol-songs.csv: typo/klub des loopers                                                                    1.00    160.6±2.98µs        ? ?/sec    1.01    161.7±1.96µs        ? ?/sec
smol-songs.csv: typo/michel depech                                                                       1.00     79.4±0.54µs        ? ?/sec    1.01     79.9±0.60µs        ? ?/sec
smol-songs.csv: typo/mongus                                                                              1.00    126.7±1.85µs        ? ?/sec    1.00    126.1±0.74µs        ? ?/sec
smol-songs.csv: typo/stromal                                                                             1.01    132.9±0.99µs        ? ?/sec    1.00    131.9±1.09µs        ? ?/sec
smol-songs.csv: typo/the white striper                                                                   1.00    287.8±2.88µs        ? ?/sec    1.00    286.5±1.91µs        ? ?/sec
smol-songs.csv: typo/thelonius monk                                                                      1.00    304.2±1.49µs        ? ?/sec    1.01    306.5±1.50µs        ? ?/sec
smol-songs.csv: words/7000 Danses / Le Baiser / je me trompe de mots                                     1.01     20.9±0.08ms        ? ?/sec    1.00     20.7±0.07ms        ? ?/sec
smol-songs.csv: words/Bring Your Daughter To The Slaughter but now this is not part of the title         1.00     48.9±0.13ms        ? ?/sec    1.00     48.9±0.11ms        ? ?/sec
smol-songs.csv: words/The Disneyland Children's Sing-Alone song                                          1.01     13.9±0.06ms        ? ?/sec    1.00     13.8±0.07ms        ? ?/sec
smol-songs.csv: words/les liaisons dangeureuses 1793                                                     1.01      3.7±0.01ms        ? ?/sec    1.00      3.6±0.02ms        ? ?/sec
smol-songs.csv: words/seven nation mummy                                                                 1.00  1054.2±14.49µs        ? ?/sec    1.00  1056.6±10.53µs        ? ?/sec
smol-songs.csv: words/the black saint and the sinner lady and the good doggo                             1.00     58.2±0.29ms        ? ?/sec    1.00     57.9±0.21ms        ? ?/sec
smol-songs.csv: words/whathavenotnsuchforth and a good amount of words to pop to match the first one     1.00     66.1±0.21ms        ? ?/sec    1.00     66.0±0.24ms        ? ?/sec
</code>
</details>

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
This commit is contained in:
bors[bot] 2022-10-25 10:42:04 +00:00 committed by GitHub
commit d11a6e187f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
22 changed files with 1002 additions and 622 deletions

View File

@ -15,4 +15,4 @@ pub use self::roaring_bitmap_length::{
BoRoaringBitmapLenCodec, CboRoaringBitmapLenCodec, RoaringBitmapLenCodec, BoRoaringBitmapLenCodec, CboRoaringBitmapLenCodec, RoaringBitmapLenCodec,
}; };
pub use self::str_beu32_codec::StrBEU32Codec; pub use self::str_beu32_codec::StrBEU32Codec;
pub use self::str_str_u8_codec::{StrStrU8Codec, UncheckedStrStrU8Codec}; pub use self::str_str_u8_codec::{U8StrStrCodec, UncheckedU8StrStrCodec};

View File

@ -1,61 +1,57 @@
use std::borrow::Cow; use std::borrow::Cow;
use std::str; use std::str;
pub struct StrStrU8Codec; pub struct U8StrStrCodec;
impl<'a> heed::BytesDecode<'a> for StrStrU8Codec { impl<'a> heed::BytesDecode<'a> for U8StrStrCodec {
type DItem = (&'a str, &'a str, u8); type DItem = (u8, &'a str, &'a str);
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> { fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (n, bytes) = bytes.split_last()?; let (n, bytes) = bytes.split_first()?;
let s1_end = bytes.iter().position(|b| *b == 0)?; let s1_end = bytes.iter().position(|b| *b == 0)?;
let (s1_bytes, rest) = bytes.split_at(s1_end); let (s1_bytes, rest) = bytes.split_at(s1_end);
let rest = &rest[1..]; let s2_bytes = &rest[1..];
let s1 = str::from_utf8(s1_bytes).ok()?; let s1 = str::from_utf8(s1_bytes).ok()?;
let (_, s2_bytes) = rest.split_last()?;
let s2 = str::from_utf8(s2_bytes).ok()?; let s2 = str::from_utf8(s2_bytes).ok()?;
Some((s1, s2, *n)) Some((*n, s1, s2))
} }
} }
impl<'a> heed::BytesEncode<'a> for StrStrU8Codec { impl<'a> heed::BytesEncode<'a> for U8StrStrCodec {
type EItem = (&'a str, &'a str, u8); type EItem = (u8, &'a str, &'a str);
fn bytes_encode((s1, s2, n): &Self::EItem) -> Option<Cow<[u8]>> { fn bytes_encode((n, s1, s2): &Self::EItem) -> Option<Cow<[u8]>> {
let mut bytes = Vec::with_capacity(s1.len() + s2.len() + 1 + 1); let mut bytes = Vec::with_capacity(s1.len() + s2.len() + 1);
bytes.push(*n);
bytes.extend_from_slice(s1.as_bytes()); bytes.extend_from_slice(s1.as_bytes());
bytes.push(0); bytes.push(0);
bytes.extend_from_slice(s2.as_bytes()); bytes.extend_from_slice(s2.as_bytes());
bytes.push(0);
bytes.push(*n);
Some(Cow::Owned(bytes)) Some(Cow::Owned(bytes))
} }
} }
pub struct UncheckedStrStrU8Codec; pub struct UncheckedU8StrStrCodec;
impl<'a> heed::BytesDecode<'a> for UncheckedStrStrU8Codec { impl<'a> heed::BytesDecode<'a> for UncheckedU8StrStrCodec {
type DItem = (&'a [u8], &'a [u8], u8); type DItem = (u8, &'a [u8], &'a [u8]);
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> { fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (n, bytes) = bytes.split_last()?; let (n, bytes) = bytes.split_first()?;
let s1_end = bytes.iter().position(|b| *b == 0)?; let s1_end = bytes.iter().position(|b| *b == 0)?;
let (s1_bytes, rest) = bytes.split_at(s1_end); let (s1_bytes, rest) = bytes.split_at(s1_end);
let rest = &rest[1..]; let s2_bytes = &rest[1..];
let (_, s2_bytes) = rest.split_last()?; Some((*n, s1_bytes, s2_bytes))
Some((s1_bytes, s2_bytes, *n))
} }
} }
impl<'a> heed::BytesEncode<'a> for UncheckedStrStrU8Codec { impl<'a> heed::BytesEncode<'a> for UncheckedU8StrStrCodec {
type EItem = (&'a [u8], &'a [u8], u8); type EItem = (u8, &'a [u8], &'a [u8]);
fn bytes_encode((s1, s2, n): &Self::EItem) -> Option<Cow<[u8]>> { fn bytes_encode((n, s1, s2): &Self::EItem) -> Option<Cow<[u8]>> {
let mut bytes = Vec::with_capacity(s1.len() + s2.len() + 1 + 1); let mut bytes = Vec::with_capacity(s1.len() + s2.len() + 1);
bytes.push(*n);
bytes.extend_from_slice(s1); bytes.extend_from_slice(s1);
bytes.push(0); bytes.push(0);
bytes.extend_from_slice(s2); bytes.extend_from_slice(s2);
bytes.push(0);
bytes.push(*n);
Some(Cow::Owned(bytes)) Some(Cow::Owned(bytes))
} }
} }

View File

@ -21,7 +21,7 @@ use crate::{
default_criteria, BEU32StrCodec, BoRoaringBitmapCodec, CboRoaringBitmapCodec, Criterion, default_criteria, BEU32StrCodec, BoRoaringBitmapCodec, CboRoaringBitmapCodec, Criterion,
DocumentId, ExternalDocumentsIds, FacetDistribution, FieldDistribution, FieldId, DocumentId, ExternalDocumentsIds, FacetDistribution, FieldDistribution, FieldId,
FieldIdWordCountCodec, GeoPoint, ObkvCodec, Result, RoaringBitmapCodec, RoaringBitmapLenCodec, FieldIdWordCountCodec, GeoPoint, ObkvCodec, Result, RoaringBitmapCodec, RoaringBitmapLenCodec,
Search, StrBEU32Codec, StrStrU8Codec, BEU16, BEU32, Search, StrBEU32Codec, U8StrStrCodec, BEU16, BEU32,
}; };
pub const DEFAULT_MIN_WORD_LEN_ONE_TYPO: u8 = 5; pub const DEFAULT_MIN_WORD_LEN_ONE_TYPO: u8 = 5;
@ -71,6 +71,7 @@ pub mod db_name {
pub const DOCID_WORD_POSITIONS: &str = "docid-word-positions"; pub const DOCID_WORD_POSITIONS: &str = "docid-word-positions";
pub const WORD_PAIR_PROXIMITY_DOCIDS: &str = "word-pair-proximity-docids"; pub const WORD_PAIR_PROXIMITY_DOCIDS: &str = "word-pair-proximity-docids";
pub const WORD_PREFIX_PAIR_PROXIMITY_DOCIDS: &str = "word-prefix-pair-proximity-docids"; pub const WORD_PREFIX_PAIR_PROXIMITY_DOCIDS: &str = "word-prefix-pair-proximity-docids";
pub const PREFIX_WORD_PAIR_PROXIMITY_DOCIDS: &str = "prefix-word-pair-proximity-docids";
pub const WORD_POSITION_DOCIDS: &str = "word-position-docids"; pub const WORD_POSITION_DOCIDS: &str = "word-position-docids";
pub const WORD_PREFIX_POSITION_DOCIDS: &str = "word-prefix-position-docids"; pub const WORD_PREFIX_POSITION_DOCIDS: &str = "word-prefix-position-docids";
pub const FIELD_ID_WORD_COUNT_DOCIDS: &str = "field-id-word-count-docids"; pub const FIELD_ID_WORD_COUNT_DOCIDS: &str = "field-id-word-count-docids";
@ -106,9 +107,11 @@ pub struct Index {
pub docid_word_positions: Database<BEU32StrCodec, BoRoaringBitmapCodec>, pub docid_word_positions: Database<BEU32StrCodec, BoRoaringBitmapCodec>,
/// Maps the proximity between a pair of words with all the docids where this relation appears. /// Maps the proximity between a pair of words with all the docids where this relation appears.
pub word_pair_proximity_docids: Database<StrStrU8Codec, CboRoaringBitmapCodec>, pub word_pair_proximity_docids: Database<U8StrStrCodec, CboRoaringBitmapCodec>,
/// Maps the proximity between a pair of word and prefix with all the docids where this relation appears. /// Maps the proximity between a pair of word and prefix with all the docids where this relation appears.
pub word_prefix_pair_proximity_docids: Database<StrStrU8Codec, CboRoaringBitmapCodec>, pub word_prefix_pair_proximity_docids: Database<U8StrStrCodec, CboRoaringBitmapCodec>,
/// Maps the proximity between a pair of prefix and word with all the docids where this relation appears.
pub prefix_word_pair_proximity_docids: Database<U8StrStrCodec, CboRoaringBitmapCodec>,
/// Maps the word and the position with the docids that corresponds to it. /// Maps the word and the position with the docids that corresponds to it.
pub word_position_docids: Database<StrBEU32Codec, CboRoaringBitmapCodec>, pub word_position_docids: Database<StrBEU32Codec, CboRoaringBitmapCodec>,
@ -138,7 +141,7 @@ impl Index {
pub fn new<P: AsRef<Path>>(mut options: heed::EnvOpenOptions, path: P) -> Result<Index> { pub fn new<P: AsRef<Path>>(mut options: heed::EnvOpenOptions, path: P) -> Result<Index> {
use db_name::*; use db_name::*;
options.max_dbs(17); options.max_dbs(18);
unsafe { options.flag(Flags::MdbAlwaysFreePages) }; unsafe { options.flag(Flags::MdbAlwaysFreePages) };
let env = options.open(path)?; let env = options.open(path)?;
@ -151,6 +154,8 @@ impl Index {
let word_pair_proximity_docids = env.create_database(Some(WORD_PAIR_PROXIMITY_DOCIDS))?; let word_pair_proximity_docids = env.create_database(Some(WORD_PAIR_PROXIMITY_DOCIDS))?;
let word_prefix_pair_proximity_docids = let word_prefix_pair_proximity_docids =
env.create_database(Some(WORD_PREFIX_PAIR_PROXIMITY_DOCIDS))?; env.create_database(Some(WORD_PREFIX_PAIR_PROXIMITY_DOCIDS))?;
let prefix_word_pair_proximity_docids =
env.create_database(Some(PREFIX_WORD_PAIR_PROXIMITY_DOCIDS))?;
let word_position_docids = env.create_database(Some(WORD_POSITION_DOCIDS))?; let word_position_docids = env.create_database(Some(WORD_POSITION_DOCIDS))?;
let field_id_word_count_docids = env.create_database(Some(FIELD_ID_WORD_COUNT_DOCIDS))?; let field_id_word_count_docids = env.create_database(Some(FIELD_ID_WORD_COUNT_DOCIDS))?;
let word_prefix_position_docids = env.create_database(Some(WORD_PREFIX_POSITION_DOCIDS))?; let word_prefix_position_docids = env.create_database(Some(WORD_PREFIX_POSITION_DOCIDS))?;
@ -175,6 +180,7 @@ impl Index {
docid_word_positions, docid_word_positions,
word_pair_proximity_docids, word_pair_proximity_docids,
word_prefix_pair_proximity_docids, word_prefix_pair_proximity_docids,
prefix_word_pair_proximity_docids,
word_position_docids, word_position_docids,
word_prefix_position_docids, word_prefix_position_docids,
field_id_word_count_docids, field_id_word_count_docids,

View File

@ -37,7 +37,7 @@ pub use self::fields_ids_map::FieldsIdsMap;
pub use self::heed_codec::{ pub use self::heed_codec::{
BEU32StrCodec, BoRoaringBitmapCodec, BoRoaringBitmapLenCodec, CboRoaringBitmapCodec, BEU32StrCodec, BoRoaringBitmapCodec, BoRoaringBitmapLenCodec, CboRoaringBitmapCodec,
CboRoaringBitmapLenCodec, FieldIdWordCountCodec, ObkvCodec, RoaringBitmapCodec, CboRoaringBitmapLenCodec, FieldIdWordCountCodec, ObkvCodec, RoaringBitmapCodec,
RoaringBitmapLenCodec, StrBEU32Codec, StrStrU8Codec, UncheckedStrStrU8Codec, RoaringBitmapLenCodec, StrBEU32Codec, U8StrStrCodec, UncheckedU8StrStrCodec,
}; };
pub use self::index::Index; pub use self::index::Index;
pub use self::search::{ pub use self::search::{

View File

@ -7,7 +7,7 @@ use log::debug;
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use crate::search::criteria::{ use crate::search::criteria::{
resolve_query_tree, Context, Criterion, CriterionParameters, CriterionResult, resolve_phrase, resolve_query_tree, Context, Criterion, CriterionParameters, CriterionResult,
}; };
use crate::search::query_tree::{Operation, PrimitiveQueryPart}; use crate::search::query_tree::{Operation, PrimitiveQueryPart};
use crate::{absolute_from_relative_position, FieldId, Result}; use crate::{absolute_from_relative_position, FieldId, Result};
@ -226,19 +226,7 @@ fn resolve_state(
} }
// compute intersection on pair of words with a proximity of 0. // compute intersection on pair of words with a proximity of 0.
Phrase(phrase) => { Phrase(phrase) => {
let mut bitmaps = Vec::with_capacity(phrase.len().saturating_sub(1)); candidates |= resolve_phrase(ctx, phrase)?;
for words in phrase.windows(2) {
if let [left, right] = words {
match ctx.word_pair_proximity_docids(left, right, 0)? {
Some(docids) => bitmaps.push(docids),
None => {
bitmaps.clear();
break;
}
}
}
}
candidates |= intersection_of(bitmaps.iter().collect());
} }
} }
parts_candidates_array.push(candidates); parts_candidates_array.push(candidates);

View File

@ -71,6 +71,7 @@ pub trait Context<'c> {
fn exact_word_docids(&self, word: &str) -> heed::Result<Option<RoaringBitmap>>; fn exact_word_docids(&self, word: &str) -> heed::Result<Option<RoaringBitmap>>;
fn word_prefix_docids(&self, word: &str) -> heed::Result<Option<RoaringBitmap>>; fn word_prefix_docids(&self, word: &str) -> heed::Result<Option<RoaringBitmap>>;
fn exact_word_prefix_docids(&self, word: &str) -> heed::Result<Option<RoaringBitmap>>; fn exact_word_prefix_docids(&self, word: &str) -> heed::Result<Option<RoaringBitmap>>;
fn word_pair_proximity_docids( fn word_pair_proximity_docids(
&self, &self,
left: &str, left: &str,
@ -83,6 +84,12 @@ pub trait Context<'c> {
right: &str, right: &str,
proximity: u8, proximity: u8,
) -> heed::Result<Option<RoaringBitmap>>; ) -> heed::Result<Option<RoaringBitmap>>;
fn prefix_word_pair_proximity_docids(
&self,
prefix: &str,
right: &str,
proximity: u8,
) -> heed::Result<Option<RoaringBitmap>>;
fn words_fst<'t>(&self) -> &'t fst::Set<Cow<[u8]>>; fn words_fst<'t>(&self) -> &'t fst::Set<Cow<[u8]>>;
fn in_prefix_cache(&self, word: &str) -> bool; fn in_prefix_cache(&self, word: &str) -> bool;
fn docid_words_positions( fn docid_words_positions(
@ -111,6 +118,68 @@ pub struct CriteriaBuilder<'t> {
words_prefixes_fst: fst::Set<Cow<'t, [u8]>>, words_prefixes_fst: fst::Set<Cow<'t, [u8]>>,
} }
/// Return the docids for the following word pairs and proximities using [`Context::word_pair_proximity_docids`].
/// * `left, right, prox` (leftward proximity)
/// * `right, left, prox-1` (rightward proximity)
///
/// ## Example
/// For a document with the text `the good fox eats the apple`, we have:
/// * `rightward_proximity(the, eats) = 3`
/// * `leftward_proximity(eats, the) = 1`
///
/// So both the expressions `word_pair_overall_proximity_docids(ctx, the, eats, 3)`
/// and `word_pair_overall_proximity_docids(ctx, the, eats, 2)` would return a bitmap containing
/// the id of this document.
fn word_pair_overall_proximity_docids(
ctx: &dyn Context,
left: &str,
right: &str,
prox: u8,
) -> heed::Result<Option<RoaringBitmap>> {
let rightward = ctx.word_pair_proximity_docids(left, right, prox)?;
let leftward =
if prox > 1 { ctx.word_pair_proximity_docids(right, left, prox - 1)? } else { None };
if let Some(mut all) = rightward {
if let Some(leftward) = leftward {
all |= leftward;
}
Ok(Some(all))
} else {
Ok(leftward)
}
}
/// This function works identically to [`word_pair_overall_proximity_docids`] except that the
/// right word is replaced by a prefix string.
///
/// It will return None if no documents were found or if the prefix does not exist in the
/// `word_prefix_pair_proximity_docids` database.
fn word_prefix_pair_overall_proximity_docids(
ctx: &dyn Context,
left: &str,
prefix: &str,
proximity: u8,
) -> heed::Result<Option<RoaringBitmap>> {
// We retrieve the docids for the original and swapped word pairs:
// A: word1 prefix2 proximity
// B: prefix2 word1 proximity-1
let rightward = ctx.word_prefix_pair_proximity_docids(left, prefix, proximity)?;
let leftward = if proximity > 1 {
ctx.prefix_word_pair_proximity_docids(prefix, left, proximity - 1)?
} else {
None
};
if let Some(mut all) = rightward {
if let Some(leftward) = leftward {
all |= leftward;
}
Ok(Some(all))
} else {
Ok(leftward)
}
}
impl<'c> Context<'c> for CriteriaBuilder<'c> { impl<'c> Context<'c> for CriteriaBuilder<'c> {
fn documents_ids(&self) -> heed::Result<RoaringBitmap> { fn documents_ids(&self) -> heed::Result<RoaringBitmap> {
self.index.documents_ids(self.rtxn) self.index.documents_ids(self.rtxn)
@ -138,18 +207,24 @@ impl<'c> Context<'c> for CriteriaBuilder<'c> {
right: &str, right: &str,
proximity: u8, proximity: u8,
) -> heed::Result<Option<RoaringBitmap>> { ) -> heed::Result<Option<RoaringBitmap>> {
let key = (left, right, proximity); self.index.word_pair_proximity_docids.get(self.rtxn, &(proximity, left, right))
self.index.word_pair_proximity_docids.get(self.rtxn, &key)
} }
fn word_prefix_pair_proximity_docids( fn word_prefix_pair_proximity_docids(
&self, &self,
left: &str, left: &str,
prefix: &str,
proximity: u8,
) -> heed::Result<Option<RoaringBitmap>> {
self.index.word_prefix_pair_proximity_docids.get(self.rtxn, &(proximity, left, prefix))
}
fn prefix_word_pair_proximity_docids(
&self,
prefix: &str,
right: &str, right: &str,
proximity: u8, proximity: u8,
) -> heed::Result<Option<RoaringBitmap>> { ) -> heed::Result<Option<RoaringBitmap>> {
let key = (left, right, proximity); self.index.prefix_word_pair_proximity_docids.get(self.rtxn, &(proximity, prefix, right))
self.index.word_prefix_pair_proximity_docids.get(self.rtxn, &key)
} }
fn words_fst<'t>(&self) -> &'t fst::Set<Cow<[u8]>> { fn words_fst<'t>(&self) -> &'t fst::Set<Cow<[u8]>> {
@ -352,18 +427,31 @@ pub fn resolve_query_tree(
pub fn resolve_phrase(ctx: &dyn Context, phrase: &[String]) -> Result<RoaringBitmap> { pub fn resolve_phrase(ctx: &dyn Context, phrase: &[String]) -> Result<RoaringBitmap> {
let mut candidates = RoaringBitmap::new(); let mut candidates = RoaringBitmap::new();
let mut first_iter = true; let mut first_iter = true;
let winsize = phrase.len().min(7); let winsize = phrase.len().min(3);
for win in phrase.windows(winsize) { for win in phrase.windows(winsize) {
// Get all the documents with the matching distance for each word pairs. // Get all the documents with the matching distance for each word pairs.
let mut bitmaps = Vec::with_capacity(winsize.pow(2)); let mut bitmaps = Vec::with_capacity(winsize.pow(2));
for (offset, s1) in win.iter().enumerate() { for (offset, s1) in win.iter().enumerate() {
for (dist, s2) in win.iter().skip(offset + 1).enumerate() { for (dist, s2) in win.iter().skip(offset + 1).enumerate() {
match ctx.word_pair_proximity_docids(s1, s2, dist as u8 + 1)? { if dist == 0 {
Some(m) => bitmaps.push(m), match ctx.word_pair_proximity_docids(s1, s2, 1)? {
// If there are no document for this distance, there will be no Some(m) => bitmaps.push(m),
// results for the phrase query. // If there are no document for this pair, there will be no
None => return Ok(RoaringBitmap::new()), // results for the phrase query.
None => return Ok(RoaringBitmap::new()),
}
} else {
let mut bitmap = RoaringBitmap::new();
for dist in 0..=dist {
if let Some(m) = ctx.word_pair_proximity_docids(s1, s2, dist as u8 + 1)? {
bitmap |= m
}
}
if bitmap.is_empty() {
return Ok(bitmap);
} else {
bitmaps.push(bitmap);
}
} }
} }
} }
@ -387,7 +475,7 @@ pub fn resolve_phrase(ctx: &dyn Context, phrase: &[String]) -> Result<RoaringBit
Ok(candidates) Ok(candidates)
} }
fn all_word_pair_proximity_docids<T: AsRef<str>, U: AsRef<str>>( fn all_word_pair_overall_proximity_docids<T: AsRef<str>, U: AsRef<str>>(
ctx: &dyn Context, ctx: &dyn Context,
left_words: &[(T, u8)], left_words: &[(T, u8)],
right_words: &[(U, u8)], right_words: &[(U, u8)],
@ -396,9 +484,9 @@ fn all_word_pair_proximity_docids<T: AsRef<str>, U: AsRef<str>>(
let mut docids = RoaringBitmap::new(); let mut docids = RoaringBitmap::new();
for (left, _l_typo) in left_words { for (left, _l_typo) in left_words {
for (right, _r_typo) in right_words { for (right, _r_typo) in right_words {
let current_docids = ctx let current_docids =
.word_pair_proximity_docids(left.as_ref(), right.as_ref(), proximity)? word_pair_overall_proximity_docids(ctx, left.as_ref(), right.as_ref(), proximity)?
.unwrap_or_default(); .unwrap_or_default();
docids |= current_docids; docids |= current_docids;
} }
} }
@ -472,7 +560,8 @@ fn query_pair_proximity_docids(
match (&left.kind, &right.kind) { match (&left.kind, &right.kind) {
(QueryKind::Exact { word: left, .. }, QueryKind::Exact { word: right, .. }) => { (QueryKind::Exact { word: left, .. }, QueryKind::Exact { word: right, .. }) => {
if prefix { if prefix {
match ctx.word_prefix_pair_proximity_docids( match word_prefix_pair_overall_proximity_docids(
ctx,
left.as_str(), left.as_str(),
right.as_str(), right.as_str(),
proximity, proximity,
@ -480,7 +569,12 @@ fn query_pair_proximity_docids(
Some(docids) => Ok(docids), Some(docids) => Ok(docids),
None => { None => {
let r_words = word_derivations(&right, true, 0, ctx.words_fst(), wdcache)?; let r_words = word_derivations(&right, true, 0, ctx.words_fst(), wdcache)?;
all_word_pair_proximity_docids(ctx, &[(left, 0)], &r_words, proximity) all_word_pair_overall_proximity_docids(
ctx,
&[(left, 0)],
&r_words,
proximity,
)
} }
} }
} else { } else {
@ -495,7 +589,8 @@ fn query_pair_proximity_docids(
if prefix { if prefix {
let mut docids = RoaringBitmap::new(); let mut docids = RoaringBitmap::new();
for (left, _) in l_words { for (left, _) in l_words {
let current_docids = match ctx.word_prefix_pair_proximity_docids( let current_docids = match word_prefix_pair_overall_proximity_docids(
ctx,
left.as_str(), left.as_str(),
right.as_str(), right.as_str(),
proximity, proximity,
@ -504,19 +599,24 @@ fn query_pair_proximity_docids(
None => { None => {
let r_words = let r_words =
word_derivations(&right, true, 0, ctx.words_fst(), wdcache)?; word_derivations(&right, true, 0, ctx.words_fst(), wdcache)?;
all_word_pair_proximity_docids(ctx, &[(left, 0)], &r_words, proximity) all_word_pair_overall_proximity_docids(
ctx,
&[(left, 0)],
&r_words,
proximity,
)
} }
}?; }?;
docids |= current_docids; docids |= current_docids;
} }
Ok(docids) Ok(docids)
} else { } else {
all_word_pair_proximity_docids(ctx, &l_words, &[(right, 0)], proximity) all_word_pair_overall_proximity_docids(ctx, &l_words, &[(right, 0)], proximity)
} }
} }
(QueryKind::Exact { word: left, .. }, QueryKind::Tolerant { typo, word: right }) => { (QueryKind::Exact { word: left, .. }, QueryKind::Tolerant { typo, word: right }) => {
let r_words = word_derivations(&right, prefix, *typo, ctx.words_fst(), wdcache)?; let r_words = word_derivations(&right, prefix, *typo, ctx.words_fst(), wdcache)?;
all_word_pair_proximity_docids(ctx, &[(left, 0)], &r_words, proximity) all_word_pair_overall_proximity_docids(ctx, &[(left, 0)], &r_words, proximity)
} }
( (
QueryKind::Tolerant { typo: l_typo, word: left }, QueryKind::Tolerant { typo: l_typo, word: left },
@ -525,7 +625,7 @@ fn query_pair_proximity_docids(
let l_words = let l_words =
word_derivations(&left, false, *l_typo, ctx.words_fst(), wdcache)?.to_owned(); word_derivations(&left, false, *l_typo, ctx.words_fst(), wdcache)?.to_owned();
let r_words = word_derivations(&right, prefix, *r_typo, ctx.words_fst(), wdcache)?; let r_words = word_derivations(&right, prefix, *r_typo, ctx.words_fst(), wdcache)?;
all_word_pair_proximity_docids(ctx, &l_words, &r_words, proximity) all_word_pair_overall_proximity_docids(ctx, &l_words, &r_words, proximity)
} }
} }
} }
@ -552,6 +652,7 @@ pub mod test {
exact_word_prefix_docids: HashMap<String, RoaringBitmap>, exact_word_prefix_docids: HashMap<String, RoaringBitmap>,
word_pair_proximity_docids: HashMap<(String, String, i32), RoaringBitmap>, word_pair_proximity_docids: HashMap<(String, String, i32), RoaringBitmap>,
word_prefix_pair_proximity_docids: HashMap<(String, String, i32), RoaringBitmap>, word_prefix_pair_proximity_docids: HashMap<(String, String, i32), RoaringBitmap>,
prefix_word_pair_proximity_docids: HashMap<(String, String, i32), RoaringBitmap>,
docid_words: HashMap<u32, Vec<String>>, docid_words: HashMap<u32, Vec<String>>,
} }
@ -588,13 +689,22 @@ pub mod test {
fn word_prefix_pair_proximity_docids( fn word_prefix_pair_proximity_docids(
&self, &self,
left: &str, word: &str,
right: &str, prefix: &str,
proximity: u8, proximity: u8,
) -> heed::Result<Option<RoaringBitmap>> { ) -> heed::Result<Option<RoaringBitmap>> {
let key = (left.to_string(), right.to_string(), proximity.into()); let key = (word.to_string(), prefix.to_string(), proximity.into());
Ok(self.word_prefix_pair_proximity_docids.get(&key).cloned()) Ok(self.word_prefix_pair_proximity_docids.get(&key).cloned())
} }
fn prefix_word_pair_proximity_docids(
&self,
prefix: &str,
word: &str,
proximity: u8,
) -> heed::Result<Option<RoaringBitmap>> {
let key = (prefix.to_string(), word.to_string(), proximity.into());
Ok(self.prefix_word_pair_proximity_docids.get(&key).cloned())
}
fn words_fst<'t>(&self) -> &'t fst::Set<Cow<[u8]>> { fn words_fst<'t>(&self) -> &'t fst::Set<Cow<[u8]>> {
&self.words_fst &self.words_fst
@ -708,6 +818,8 @@ pub mod test {
let mut word_pair_proximity_docids = HashMap::new(); let mut word_pair_proximity_docids = HashMap::new();
let mut word_prefix_pair_proximity_docids = HashMap::new(); let mut word_prefix_pair_proximity_docids = HashMap::new();
let mut prefix_word_pair_proximity_docids = HashMap::new();
for (lword, lcandidates) in &word_docids { for (lword, lcandidates) in &word_docids {
for (rword, rcandidates) in &word_docids { for (rword, rcandidates) in &word_docids {
if lword == rword { if lword == rword {
@ -740,15 +852,19 @@ pub mod test {
let lposition = docid_words.iter().position(|w| w == lword).unwrap(); let lposition = docid_words.iter().position(|w| w == lword).unwrap();
let rposition = let rposition =
docid_words.iter().position(|w| w.starts_with(pword)).unwrap(); docid_words.iter().position(|w| w.starts_with(pword)).unwrap();
let key = if lposition < rposition { if lposition < rposition {
(s(lword), s(pword), (rposition - lposition) as i32) let key = (s(lword), s(pword), (rposition - lposition) as i32);
let docids = word_prefix_pair_proximity_docids
.entry(key)
.or_insert(RoaringBitmap::new());
docids.push(candidate);
} else { } else {
(s(lword), s(pword), (lposition - rposition + 1) as i32) let key = (s(lword), s(pword), (lposition - rposition) as i32);
let docids = prefix_word_pair_proximity_docids
.entry(key)
.or_insert(RoaringBitmap::new());
docids.push(candidate);
}; };
let docids = word_prefix_pair_proximity_docids
.entry(key)
.or_insert(RoaringBitmap::new());
docids.push(candidate);
} }
} }
} }
@ -766,6 +882,7 @@ pub mod test {
exact_word_prefix_docids, exact_word_prefix_docids,
word_pair_proximity_docids, word_pair_proximity_docids,
word_prefix_pair_proximity_docids, word_prefix_pair_proximity_docids,
prefix_word_pair_proximity_docids,
docid_words, docid_words,
} }
} }

View File

@ -203,7 +203,7 @@ impl<'a> Context for QueryTreeBuilder<'a> {
right_word: &str, right_word: &str,
proximity: u8, proximity: u8,
) -> heed::Result<Option<u64>> { ) -> heed::Result<Option<u64>> {
let key = (left_word, right_word, proximity); let key = (proximity, left_word, right_word);
self.index self.index
.word_pair_proximity_docids .word_pair_proximity_docids
.remap_data_type::<CboRoaringBitmapLenCodec>() .remap_data_type::<CboRoaringBitmapLenCodec>()

View File

@ -182,19 +182,28 @@ pub fn snap_docid_word_positions(index: &Index) -> String {
} }
pub fn snap_word_pair_proximity_docids(index: &Index) -> String { pub fn snap_word_pair_proximity_docids(index: &Index) -> String {
let snap = make_db_snap_from_iter!(index, word_pair_proximity_docids, |( let snap = make_db_snap_from_iter!(index, word_pair_proximity_docids, |(
(word1, word2, proximity), (proximity, word1, word2),
b, b,
)| { )| {
&format!("{word1:<16} {word2:<16} {proximity:<2} {}", display_bitmap(&b)) &format!("{proximity:<2} {word1:<16} {word2:<16} {}", display_bitmap(&b))
}); });
snap snap
} }
pub fn snap_word_prefix_pair_proximity_docids(index: &Index) -> String { pub fn snap_word_prefix_pair_proximity_docids(index: &Index) -> String {
let snap = make_db_snap_from_iter!(index, word_prefix_pair_proximity_docids, |( let snap = make_db_snap_from_iter!(index, word_prefix_pair_proximity_docids, |(
(word1, prefix, proximity), (proximity, word1, prefix),
b, b,
)| { )| {
&format!("{word1:<16} {prefix:<4} {proximity:<2} {}", display_bitmap(&b)) &format!("{proximity:<2} {word1:<16} {prefix:<4} {}", display_bitmap(&b))
});
snap
}
pub fn snap_prefix_word_pair_proximity_docids(index: &Index) -> String {
let snap = make_db_snap_from_iter!(index, prefix_word_pair_proximity_docids, |(
(proximity, prefix, word2),
b,
)| {
&format!("{proximity:<2} {prefix:<4} {word2:<16} {}", display_bitmap(&b))
}); });
snap snap
} }
@ -427,6 +436,9 @@ macro_rules! full_snap_of_db {
($index:ident, word_prefix_pair_proximity_docids) => {{ ($index:ident, word_prefix_pair_proximity_docids) => {{
$crate::snapshot_tests::snap_word_prefix_pair_proximity_docids(&$index) $crate::snapshot_tests::snap_word_prefix_pair_proximity_docids(&$index)
}}; }};
($index:ident, prefix_word_pair_proximity_docids) => {{
$crate::snapshot_tests::snap_prefix_word_pair_proximity_docids(&$index)
}};
($index:ident, word_position_docids) => {{ ($index:ident, word_position_docids) => {{
$crate::snapshot_tests::snap_word_position_docids(&$index) $crate::snapshot_tests::snap_word_position_docids(&$index)
}}; }};

View File

@ -25,6 +25,7 @@ impl<'t, 'u, 'i> ClearDocuments<'t, 'u, 'i> {
docid_word_positions, docid_word_positions,
word_pair_proximity_docids, word_pair_proximity_docids,
word_prefix_pair_proximity_docids, word_prefix_pair_proximity_docids,
prefix_word_pair_proximity_docids,
word_position_docids, word_position_docids,
field_id_word_count_docids, field_id_word_count_docids,
word_prefix_position_docids, word_prefix_position_docids,
@ -66,6 +67,7 @@ impl<'t, 'u, 'i> ClearDocuments<'t, 'u, 'i> {
docid_word_positions.clear(self.wtxn)?; docid_word_positions.clear(self.wtxn)?;
word_pair_proximity_docids.clear(self.wtxn)?; word_pair_proximity_docids.clear(self.wtxn)?;
word_prefix_pair_proximity_docids.clear(self.wtxn)?; word_prefix_pair_proximity_docids.clear(self.wtxn)?;
prefix_word_pair_proximity_docids.clear(self.wtxn)?;
word_position_docids.clear(self.wtxn)?; word_position_docids.clear(self.wtxn)?;
field_id_word_count_docids.clear(self.wtxn)?; field_id_word_count_docids.clear(self.wtxn)?;
word_prefix_position_docids.clear(self.wtxn)?; word_prefix_position_docids.clear(self.wtxn)?;

View File

@ -183,6 +183,7 @@ impl<'t, 'u, 'i> DeleteDocuments<'t, 'u, 'i> {
word_pair_proximity_docids, word_pair_proximity_docids,
field_id_word_count_docids, field_id_word_count_docids,
word_prefix_pair_proximity_docids, word_prefix_pair_proximity_docids,
prefix_word_pair_proximity_docids,
word_position_docids, word_position_docids,
word_prefix_position_docids, word_prefix_position_docids,
facet_id_f64_docids, facet_id_f64_docids,
@ -327,26 +328,26 @@ impl<'t, 'u, 'i> DeleteDocuments<'t, 'u, 'i> {
self.index.put_words_prefixes_fst(self.wtxn, &new_words_prefixes_fst)?; self.index.put_words_prefixes_fst(self.wtxn, &new_words_prefixes_fst)?;
} }
// We delete the documents ids from the word prefix pair proximity database docids for db in [word_prefix_pair_proximity_docids, prefix_word_pair_proximity_docids] {
// and remove the empty pairs too. // We delete the documents ids from the word prefix pair proximity database docids
let db = word_prefix_pair_proximity_docids.remap_key_type::<ByteSlice>(); // and remove the empty pairs too.
let mut iter = db.iter_mut(self.wtxn)?; let db = db.remap_key_type::<ByteSlice>();
while let Some(result) = iter.next() { let mut iter = db.iter_mut(self.wtxn)?;
let (key, mut docids) = result?; while let Some(result) = iter.next() {
let previous_len = docids.len(); let (key, mut docids) = result?;
docids -= &self.to_delete_docids; let previous_len = docids.len();
if docids.is_empty() { docids -= &self.to_delete_docids;
// safety: we don't keep references from inside the LMDB database. if docids.is_empty() {
unsafe { iter.del_current()? }; // safety: we don't keep references from inside the LMDB database.
} else if docids.len() != previous_len { unsafe { iter.del_current()? };
let key = key.to_owned(); } else if docids.len() != previous_len {
// safety: we don't keep references from inside the LMDB database. let key = key.to_owned();
unsafe { iter.put_current(&key, &docids)? }; // safety: we don't keep references from inside the LMDB database.
unsafe { iter.put_current(&key, &docids)? };
}
} }
} }
drop(iter);
// We delete the documents ids that are under the pairs of words, // We delete the documents ids that are under the pairs of words,
// it is faster and use no memory to iterate over all the words pairs than // it is faster and use no memory to iterate over all the words pairs than
// to compute the cartesian product of every words of the deleted documents. // to compute the cartesian product of every words of the deleted documents.

View File

@ -106,17 +106,6 @@ fn document_word_positions_into_sorter(
*p = cmp::min(*p, prox); *p = cmp::min(*p, prox);
}) })
.or_insert(prox); .or_insert(prox);
// We also compute the inverse proximity.
let prox = prox + 1;
if prox < MAX_DISTANCE {
word_pair_proximity
.entry((word.clone(), head.word.clone()))
.and_modify(|p| {
*p = cmp::min(*p, prox);
})
.or_insert(prox);
}
} }
} }
@ -151,11 +140,10 @@ fn document_word_positions_into_sorter(
let mut key_buffer = Vec::new(); let mut key_buffer = Vec::new();
for ((w1, w2), prox) in word_pair_proximity { for ((w1, w2), prox) in word_pair_proximity {
key_buffer.clear(); key_buffer.clear();
key_buffer.push(prox as u8);
key_buffer.extend_from_slice(w1.as_bytes()); key_buffer.extend_from_slice(w1.as_bytes());
key_buffer.push(0); key_buffer.push(0);
key_buffer.extend_from_slice(w2.as_bytes()); key_buffer.extend_from_slice(w2.as_bytes());
key_buffer.push(0);
key_buffer.push(prox as u8);
word_pair_proximity_docids_sorter.insert(&key_buffer, &document_id.to_ne_bytes())?; word_pair_proximity_docids_sorter.insert(&key_buffer, &document_id.to_ne_bytes())?;
} }

View File

@ -36,8 +36,8 @@ use crate::documents::{obkv_to_object, DocumentsBatchReader};
use crate::error::UserError; use crate::error::UserError;
pub use crate::update::index_documents::helpers::CursorClonableMmap; pub use crate::update::index_documents::helpers::CursorClonableMmap;
use crate::update::{ use crate::update::{
self, Facets, IndexerConfig, UpdateIndexingStep, WordPrefixDocids, self, Facets, IndexerConfig, PrefixWordPairsProximityDocids, UpdateIndexingStep,
WordPrefixPairProximityDocids, WordPrefixPositionDocids, WordsPrefixesFst, WordPrefixDocids, WordPrefixPositionDocids, WordsPrefixesFst,
}; };
use crate::{Index, Result, RoaringBitmapCodec}; use crate::{Index, Result, RoaringBitmapCodec};
@ -522,12 +522,13 @@ where
if let Some(word_pair_proximity_docids) = word_pair_proximity_docids { if let Some(word_pair_proximity_docids) = word_pair_proximity_docids {
// Run the word prefix pair proximity docids update operation. // Run the word prefix pair proximity docids update operation.
let mut builder = WordPrefixPairProximityDocids::new(self.wtxn, self.index); PrefixWordPairsProximityDocids::new(
builder.chunk_compression_type = self.indexer_config.chunk_compression_type; self.wtxn,
builder.chunk_compression_level = self.indexer_config.chunk_compression_level; self.index,
builder.max_nb_chunks = self.indexer_config.max_nb_chunks; self.indexer_config.chunk_compression_type,
builder.max_memory = self.indexer_config.max_memory; self.indexer_config.chunk_compression_level,
builder.execute( )
.execute(
word_pair_proximity_docids, word_pair_proximity_docids,
&new_prefix_fst_words, &new_prefix_fst_words,
&common_prefix_fst_words, &common_prefix_fst_words,

View File

@ -6,10 +6,10 @@ pub use self::index_documents::{
DocumentAdditionResult, DocumentId, IndexDocuments, IndexDocumentsConfig, IndexDocumentsMethod, DocumentAdditionResult, DocumentId, IndexDocuments, IndexDocumentsConfig, IndexDocumentsMethod,
}; };
pub use self::indexer_config::IndexerConfig; pub use self::indexer_config::IndexerConfig;
pub use self::prefix_word_pairs::PrefixWordPairsProximityDocids;
pub use self::settings::{Setting, Settings}; pub use self::settings::{Setting, Settings};
pub use self::update_step::UpdateIndexingStep; pub use self::update_step::UpdateIndexingStep;
pub use self::word_prefix_docids::WordPrefixDocids; pub use self::word_prefix_docids::WordPrefixDocids;
pub use self::word_prefix_pair_proximity_docids::WordPrefixPairProximityDocids;
pub use self::words_prefix_position_docids::WordPrefixPositionDocids; pub use self::words_prefix_position_docids::WordPrefixPositionDocids;
pub use self::words_prefixes_fst::WordsPrefixesFst; pub use self::words_prefixes_fst::WordsPrefixesFst;
@ -19,9 +19,9 @@ mod delete_documents;
mod facets; mod facets;
mod index_documents; mod index_documents;
mod indexer_config; mod indexer_config;
mod prefix_word_pairs;
mod settings; mod settings;
mod update_step; mod update_step;
mod word_prefix_docids; mod word_prefix_docids;
mod word_prefix_pair_proximity_docids;
mod words_prefix_position_docids; mod words_prefix_position_docids;
mod words_prefixes_fst; mod words_prefixes_fst;

View File

@ -0,0 +1,241 @@
use std::borrow::Cow;
use std::collections::HashSet;
use std::io::BufReader;
use grenad::CompressionType;
use heed::types::ByteSlice;
use super::index_documents::{merge_cbo_roaring_bitmaps, CursorClonableMmap};
use crate::{Index, Result};
mod prefix_word;
mod word_prefix;
pub use prefix_word::index_prefix_word_database;
pub use word_prefix::index_word_prefix_database;
pub struct PrefixWordPairsProximityDocids<'t, 'u, 'i> {
wtxn: &'t mut heed::RwTxn<'i, 'u>,
index: &'i Index,
max_proximity: u8,
max_prefix_length: usize,
chunk_compression_type: CompressionType,
chunk_compression_level: Option<u32>,
}
impl<'t, 'u, 'i> PrefixWordPairsProximityDocids<'t, 'u, 'i> {
pub fn new(
wtxn: &'t mut heed::RwTxn<'i, 'u>,
index: &'i Index,
chunk_compression_type: CompressionType,
chunk_compression_level: Option<u32>,
) -> Self {
Self {
wtxn,
index,
max_proximity: 4,
max_prefix_length: 2,
chunk_compression_type,
chunk_compression_level,
}
}
/// Set the maximum proximity required to make a prefix be part of the words prefixes
/// database. If two words are too far from the threshold the associated documents will
/// not be part of the prefix database.
///
/// Default value is 4. This value must be lower or equal than 7 and will be clamped
/// to this bound otherwise.
pub fn max_proximity(&mut self, value: u8) -> &mut Self {
self.max_proximity = value.max(7);
self
}
/// Set the maximum length the prefix of a word pair is allowed to have to be part of the words
/// prefixes database. If the prefix length is higher than the threshold, the associated documents
/// will not be part of the prefix database.
///
/// Default value is 2.
pub fn max_prefix_length(&mut self, value: usize) -> &mut Self {
self.max_prefix_length = value;
self
}
#[logging_timer::time("WordPrefixPairProximityDocids::{}")]
pub fn execute<'a>(
self,
new_word_pair_proximity_docids: grenad::Reader<CursorClonableMmap>,
new_prefix_fst_words: &'a [String],
common_prefix_fst_words: &[&'a [String]],
del_prefix_fst_words: &HashSet<Vec<u8>>,
) -> Result<()> {
index_word_prefix_database(
self.wtxn,
self.index.word_pair_proximity_docids,
self.index.word_prefix_pair_proximity_docids,
self.max_proximity,
self.max_prefix_length,
new_word_pair_proximity_docids.clone(),
new_prefix_fst_words,
common_prefix_fst_words,
del_prefix_fst_words,
self.chunk_compression_type,
self.chunk_compression_level,
)?;
index_prefix_word_database(
self.wtxn,
self.index.word_pair_proximity_docids,
self.index.prefix_word_pair_proximity_docids,
self.max_proximity,
self.max_prefix_length,
new_word_pair_proximity_docids,
new_prefix_fst_words,
common_prefix_fst_words,
del_prefix_fst_words,
self.chunk_compression_type,
self.chunk_compression_level,
)?;
Ok(())
}
}
// This is adapted from `sorter_into_lmdb_database`
pub fn insert_into_database(
wtxn: &mut heed::RwTxn,
database: heed::PolyDatabase,
new_key: &[u8],
new_value: &[u8],
) -> Result<()> {
let mut iter = database.prefix_iter_mut::<_, ByteSlice, ByteSlice>(wtxn, new_key)?;
match iter.next().transpose()? {
Some((key, old_val)) if new_key == key => {
let val =
merge_cbo_roaring_bitmaps(key, &[Cow::Borrowed(old_val), Cow::Borrowed(new_value)])
.map_err(|_| {
// TODO just wrap this error?
crate::error::InternalError::IndexingMergingKeys {
process: "get-put-merge",
}
})?;
// safety: we use the new_key, not the one from the database iterator, to avoid undefined behaviour
unsafe { iter.put_current(new_key, &val)? };
}
_ => {
drop(iter);
database.put::<_, ByteSlice, ByteSlice>(wtxn, new_key, new_value)?;
}
}
Ok(())
}
// This is adapted from `sorter_into_lmdb_database` and `write_into_lmdb_database`,
// but it uses `append` if the database is empty, and it assumes that the values in the
// writer don't conflict with values in the database.
pub fn write_into_lmdb_database_without_merging(
wtxn: &mut heed::RwTxn,
database: heed::PolyDatabase,
writer: grenad::Writer<std::fs::File>,
) -> Result<()> {
let file = writer.into_inner()?;
let reader = grenad::Reader::new(BufReader::new(file))?;
if database.is_empty(wtxn)? {
let mut out_iter = database.iter_mut::<_, ByteSlice, ByteSlice>(wtxn)?;
let mut cursor = reader.into_cursor()?;
while let Some((k, v)) = cursor.move_on_next()? {
// safety: the key comes from the grenad reader, not the database
unsafe { out_iter.append(k, v)? };
}
} else {
let mut cursor = reader.into_cursor()?;
while let Some((k, v)) = cursor.move_on_next()? {
database.put::<_, ByteSlice, ByteSlice>(wtxn, k, v)?;
}
}
Ok(())
}
#[cfg(test)]
mod tests {
use std::io::Cursor;
use crate::db_snap;
use crate::documents::{DocumentsBatchBuilder, DocumentsBatchReader};
use crate::index::tests::TempIndex;
fn documents_with_enough_different_words_for_prefixes(prefixes: &[&str]) -> Vec<crate::Object> {
let mut documents = Vec::new();
for prefix in prefixes {
for i in 0..50 {
documents.push(
serde_json::json!({
"text": format!("{prefix}{i:x}"),
})
.as_object()
.unwrap()
.clone(),
)
}
}
documents
}
#[test]
fn test_update() {
let mut index = TempIndex::new();
index.index_documents_config.words_prefix_threshold = Some(50);
index.index_documents_config.autogenerate_docids = true;
index
.update_settings(|settings| {
settings.set_searchable_fields(vec!["text".to_owned()]);
})
.unwrap();
let batch_reader_from_documents = |documents| {
let mut builder = DocumentsBatchBuilder::new(Vec::new());
for object in documents {
builder.append_json_object(&object).unwrap();
}
DocumentsBatchReader::from_reader(Cursor::new(builder.into_inner().unwrap())).unwrap()
};
let mut documents = documents_with_enough_different_words_for_prefixes(&["a", "be"]);
// now we add some documents where the text should populate the word_prefix_pair_proximity_docids database
documents.push(
serde_json::json!({
"text": "At an amazing and beautiful house"
})
.as_object()
.unwrap()
.clone(),
);
documents.push(
serde_json::json!({
"text": "The bell rings at 5 am"
})
.as_object()
.unwrap()
.clone(),
);
let documents = batch_reader_from_documents(documents);
index.add_documents(documents).unwrap();
db_snap!(index, word_prefix_pair_proximity_docids, "initial");
let mut documents = documents_with_enough_different_words_for_prefixes(&["am", "an"]);
documents.push(
serde_json::json!({
"text": "At an extraordinary house"
})
.as_object()
.unwrap()
.clone(),
);
let documents = batch_reader_from_documents(documents);
index.add_documents(documents).unwrap();
db_snap!(index, word_pair_proximity_docids, "update");
db_snap!(index, word_prefix_pair_proximity_docids, "update");
db_snap!(index, prefix_word_pair_proximity_docids, "update");
}
}

View File

@ -0,0 +1,182 @@
use std::borrow::Cow;
use std::collections::{BTreeMap, HashSet};
use grenad::CompressionType;
use heed::types::ByteSlice;
use heed::BytesDecode;
use log::debug;
use crate::update::index_documents::{create_writer, CursorClonableMmap};
use crate::update::prefix_word_pairs::{
insert_into_database, write_into_lmdb_database_without_merging,
};
use crate::{CboRoaringBitmapCodec, Result, U8StrStrCodec, UncheckedU8StrStrCodec};
#[logging_timer::time]
pub fn index_prefix_word_database(
wtxn: &mut heed::RwTxn,
word_pair_proximity_docids: heed::Database<U8StrStrCodec, CboRoaringBitmapCodec>,
prefix_word_pair_proximity_docids: heed::Database<U8StrStrCodec, CboRoaringBitmapCodec>,
max_proximity: u8,
max_prefix_length: usize,
new_word_pair_proximity_docids: grenad::Reader<CursorClonableMmap>,
new_prefix_fst_words: &[String],
common_prefix_fst_words: &[&[String]],
del_prefix_fst_words: &HashSet<Vec<u8>>,
chunk_compression_type: CompressionType,
chunk_compression_level: Option<u32>,
) -> Result<()> {
let max_proximity = max_proximity - 1;
debug!("Computing and writing the word prefix pair proximity docids into LMDB on disk...");
let common_prefixes: Vec<_> = common_prefix_fst_words
.into_iter()
.map(|s| s.into_iter())
.flatten()
.map(|s| s.as_str())
.filter(|s| s.len() <= max_prefix_length)
.collect();
for proximity in 1..max_proximity {
for prefix in common_prefixes.iter() {
let mut prefix_key = vec![];
prefix_key.push(proximity);
prefix_key.extend_from_slice(prefix.as_bytes());
let mut cursor = new_word_pair_proximity_docids.clone().into_prefix_iter(prefix_key)?;
// This is the core of the algorithm
execute_on_word_pairs_and_prefixes(
proximity,
prefix.as_bytes(),
// the next two arguments tell how to iterate over the new word pairs
&mut cursor,
|cursor| {
if let Some((key, value)) = cursor.next()? {
let (_, _, word2) = UncheckedU8StrStrCodec::bytes_decode(key)
.ok_or(heed::Error::Decoding)?;
Ok(Some((word2, value)))
} else {
Ok(None)
}
},
// and this argument tells what to do with each new key (proximity, prefix, word2) and value (roaring bitmap)
|key, value| {
insert_into_database(
wtxn,
*prefix_word_pair_proximity_docids.as_polymorph(),
key,
value,
)
},
)?;
}
}
// Now we do the same thing with the new prefixes and all word pairs in the DB
let new_prefixes: Vec<_> = new_prefix_fst_words
.into_iter()
.map(|s| s.as_str())
.filter(|s| s.len() <= max_prefix_length)
.collect();
// Since we read the DB, we can't write to it directly, so we add each new (word1, prefix, proximity)
// element in an intermediary grenad
let mut writer =
create_writer(chunk_compression_type, chunk_compression_level, tempfile::tempfile()?);
for proximity in 1..max_proximity {
for prefix in new_prefixes.iter() {
let mut prefix_key = vec![];
prefix_key.push(proximity);
prefix_key.extend_from_slice(prefix.as_bytes());
let mut db_iter = word_pair_proximity_docids
.as_polymorph()
.prefix_iter::<_, ByteSlice, ByteSlice>(wtxn, prefix_key.as_slice())?
.remap_key_type::<UncheckedU8StrStrCodec>();
execute_on_word_pairs_and_prefixes(
proximity,
prefix.as_bytes(),
&mut db_iter,
|db_iter| {
db_iter
.next()
.transpose()
.map(|x| x.map(|((_, _, word2), value)| (word2, value)))
.map_err(|e| e.into())
},
|key, value| writer.insert(key, value).map_err(|e| e.into()),
)?;
drop(db_iter);
}
}
// and then we write the grenad into the DB
// Since the grenad contains only new prefixes, we know in advance that none
// of its elements already exist in the DB, thus there is no need to specify
// how to merge conflicting elements
write_into_lmdb_database_without_merging(
wtxn,
*prefix_word_pair_proximity_docids.as_polymorph(),
writer,
)?;
// All of the word prefix pairs in the database that have a w2
// that is contained in the `suppr_pw` set must be removed as well.
if !del_prefix_fst_words.is_empty() {
let mut iter =
prefix_word_pair_proximity_docids.remap_data_type::<ByteSlice>().iter_mut(wtxn)?;
while let Some(((_, prefix, _), _)) = iter.next().transpose()? {
if del_prefix_fst_words.contains(prefix.as_bytes()) {
// Delete this entry as the w2 prefix is no more in the words prefix fst.
unsafe { iter.del_current()? };
}
}
}
Ok(())
}
/// This is the core of the algorithm to initialise the Prefix Word Pair Proximity Docids database.
///
/// Its arguments are:
/// - an iterator over the words following the given `prefix` with the given `proximity`
/// - a closure to describe how to handle the new computed (proximity, prefix, word2) elements
fn execute_on_word_pairs_and_prefixes<I>(
proximity: u8,
prefix: &[u8],
iter: &mut I,
mut next_word2_and_docids: impl for<'a> FnMut(&'a mut I) -> Result<Option<(&'a [u8], &'a [u8])>>,
mut insert: impl for<'a> FnMut(&'a [u8], &'a [u8]) -> Result<()>,
) -> Result<()> {
let mut batch: BTreeMap<Vec<u8>, Vec<Cow<'static, [u8]>>> = BTreeMap::default();
// Memory usage check:
// The content of the loop will be called for each `word2` that follows a word beginning
// with `prefix` with the given proximity.
// In practice, I don't think the batch can ever get too big.
while let Some((word2, docids)) = next_word2_and_docids(iter)? {
let entry = batch.entry(word2.to_owned()).or_default();
entry.push(Cow::Owned(docids.to_owned()));
}
let mut key_buffer = Vec::with_capacity(512);
key_buffer.push(proximity);
key_buffer.extend_from_slice(prefix);
key_buffer.push(0);
let mut value_buffer = Vec::with_capacity(65_536);
for (word2, docids) in batch {
key_buffer.truncate(prefix.len() + 2);
value_buffer.clear();
key_buffer.extend_from_slice(&word2);
let data = if docids.len() > 1 {
CboRoaringBitmapCodec::merge_into(&docids, &mut value_buffer)?;
value_buffer.as_slice()
} else {
&docids[0]
};
insert(key_buffer.as_slice(), data)?;
}
Ok(())
}

View File

@ -0,0 +1,26 @@
---
source: milli/src/update/prefix_word_pairs/mod.rs
---
1 5 a [101, ]
1 amazing a [100, ]
1 an a [100, ]
1 and b [100, ]
1 and be [100, ]
1 at a [100, ]
1 rings a [101, ]
1 the b [101, ]
1 the be [101, ]
2 amazing b [100, ]
2 amazing be [100, ]
2 an a [100, ]
2 at a [100, 101, ]
2 bell a [101, ]
3 an b [100, ]
3 an be [100, ]
3 at a [100, ]
3 rings a [101, ]
3 the a [101, ]
4 at b [100, ]
4 at be [100, ]
4 bell a [101, ]

View File

@ -0,0 +1,29 @@
---
source: milli/src/update/prefix_word_pairs/mod.rs
---
1 a 5 [101, ]
1 a amazing [100, ]
1 a an [100, 202, ]
1 a and [100, ]
1 a beautiful [100, ]
1 a extraordinary [202, ]
1 am and [100, ]
1 an amazing [100, ]
1 an beautiful [100, ]
1 an extraordinary [202, ]
1 b house [100, ]
1 b rings [101, ]
1 be house [100, ]
1 be rings [101, ]
2 a am [101, ]
2 a amazing [100, ]
2 a and [100, ]
2 a beautiful [100, ]
2 a extraordinary [202, ]
2 a house [100, 202, ]
2 am beautiful [100, ]
2 an and [100, ]
2 an house [100, 202, ]
2 b at [101, ]
2 be at [101, ]

View File

@ -0,0 +1,39 @@
---
source: milli/src/update/prefix_word_pairs/mod.rs
---
1 5 am [101, ]
1 amazing and [100, ]
1 an amazing [100, ]
1 an extraordinary [202, ]
1 and beautiful [100, ]
1 at 5 [101, ]
1 at an [100, 202, ]
1 beautiful house [100, ]
1 bell rings [101, ]
1 extraordinary house [202, ]
1 rings at [101, ]
1 the bell [101, ]
2 amazing beautiful [100, ]
2 an and [100, ]
2 an house [202, ]
2 and house [100, ]
2 at am [101, ]
2 at amazing [100, ]
2 at extraordinary [202, ]
2 bell at [101, ]
2 rings 5 [101, ]
2 the rings [101, ]
3 amazing house [100, ]
3 an beautiful [100, ]
3 at and [100, ]
3 at house [202, ]
3 bell 5 [101, ]
3 rings am [101, ]
3 the at [101, ]
4 an house [100, ]
4 at beautiful [100, ]
4 bell am [101, ]
4 the 5 [101, ]
5 at house [100, ]
5 the am [101, ]

View File

@ -0,0 +1,35 @@
---
source: milli/src/update/prefix_word_pairs/mod.rs
---
1 5 a [101, ]
1 5 am [101, ]
1 amazing a [100, ]
1 amazing an [100, ]
1 an a [100, ]
1 an am [100, ]
1 and b [100, ]
1 and be [100, ]
1 at a [100, 202, ]
1 at an [100, 202, ]
1 rings a [101, ]
1 the b [101, ]
1 the be [101, ]
2 amazing b [100, ]
2 amazing be [100, ]
2 an a [100, ]
2 an an [100, ]
2 at a [100, 101, ]
2 at am [100, 101, ]
2 bell a [101, ]
3 an b [100, ]
3 an be [100, ]
3 at a [100, ]
3 at an [100, ]
3 rings a [101, ]
3 rings am [101, ]
3 the a [101, ]
4 at b [100, ]
4 at be [100, ]
4 bell a [101, ]
4 bell am [101, ]

View File

@ -1,7 +1,6 @@
/*! /*!
## What is WordPrefixPairProximityDocids?
The word-prefix-pair-proximity-docids database is a database whose keys are of The word-prefix-pair-proximity-docids database is a database whose keys are of
the form (`word`, `prefix`, `proximity`) and the values are roaring bitmaps of the form `(proximity, word, prefix)` and the values are roaring bitmaps of
the documents which contain `word` followed by another word starting with the documents which contain `word` followed by another word starting with
`prefix` at a distance of `proximity`. `prefix` at a distance of `proximity`.
@ -23,127 +22,100 @@ dog
Note that only prefixes which correspond to more than a certain number of Note that only prefixes which correspond to more than a certain number of
different words from the database are included in this list. different words from the database are included in this list.
* a sorted list of word pairs and the distance between them (i.e. proximity), * a sorted list of proximities and word pairs (the proximity is the distance between the two words),
* associated with a roaring bitmap, such as: associated with a roaring bitmap, such as:
```text ```text
good dog 3 -> docids1: [2, 5, 6] 1 good doggo -> docids1: [8]
good doggo 1 -> docids2: [8] 1 good door -> docids2: [7, 19, 20]
good dogma 1 -> docids3: [7, 19, 20] 1 good ghost -> docids3: [1]
good ghost 2 -> docids4: [1] 2 good dog -> docids4: [2, 5, 6]
horror cathedral 4 -> docids5: [1, 2] 2 horror cathedral -> docids5: [1, 2]
``` ```
I illustrate a simplified version of the algorithm to create the word-prefix I illustrate a simplified version of the algorithm to create the word-prefix
pair-proximity database below: pair-proximity database below:
1. **Outer loop:** First, we iterate over each word pair and its proximity: 1. **Outer loop:** First, we iterate over each proximity and word pair:
```text ```text
proximity: 1
word1 : good word1 : good
word2 : dog word2 : doggo
proximity: 3
``` ```
2. **Inner loop:** Then, we iterate over all the prefixes of `word2` that are 2. **Inner loop:** Then, we iterate over all the prefixes of `word2` that are
in the list of sorted prefixes. And we insert the key (`prefix`, `proximity`) in the list of sorted prefixes. And we insert the key `prefix`
and the value (`docids`) to a sorted map which we call the batch. For example, and the value (`docids`) to a sorted map which we call the batch. For example,
at the end of the first inner loop, we may have: at the end of the first inner loop, we may have:
```text ```text
Outer loop 1: Outer loop 1:
------------------------------ ------------------------------
proximity: 1
word1 : good word1 : good
word2 : dog word2 : doggo
proximity: 3
docids : docids1 docids : docids1
prefixes: [d, do, dog] prefixes: [d, do, dog]
batch: [ batch: [
(d, 3) -> [docids1] d, -> [docids1]
(do, 3) -> [docids1] do -> [docids1]
(dog, 3) -> [docids1] dog -> [docids1]
] ]
``` ```
3. For illustration purpose, let's run through a second iteration of the outer loop: 3. For illustration purpose, let's run through a second iteration of the outer loop:
```text ```text
Outer loop 2: Outer loop 2:
------------------------------ ------------------------------
word1 : good
word2 : doggo
proximity: 1 proximity: 1
word1 : good
word2 : door
docids : docids2 docids : docids2
prefixes: [d, do, dog] prefixes: [d, do, doo]
batch: [ batch: [
(d, 1) -> [docids2] d -> [docids1, docids2]
(d, 3) -> [docids1] do -> [docids1, docids2]
(do, 1) -> [docids2] dog -> [docids1]
(do, 3) -> [docids1] doo -> [docids2]
(dog, 1) -> [docids2]
(dog, 3) -> [docids1]
]
```
Notice that the batch had to re-order some (`prefix`, `proximity`) keys: some
of the elements inserted in the second iteration of the outer loop appear
*before* elements from the first iteration.
4. And a third:
```text
Outer loop 3:
------------------------------
word1 : good
word2 : dogma
proximity: 1
docids : docids3
prefixes: [d, do, dog]
batch: [
(d, 1) -> [docids2, docids3]
(d, 3) -> [docids1]
(do, 1) -> [docids2, docids3]
(do, 3) -> [docids1]
(dog, 1) -> [docids2, docids3]
(dog, 3) -> [docids1]
] ]
``` ```
Notice that there were some conflicts which were resolved by merging the Notice that there were some conflicts which were resolved by merging the
conflicting values together. conflicting values together. Also, an additional prefix was added at the
end of the batch.
5. On the fourth iteration of the outer loop, we have: 4. On the third iteration of the outer loop, we have:
```text ```text
Outer loop 4: Outer loop 4:
------------------------------ ------------------------------
proximity: 1
word1 : good word1 : good
word2 : ghost word2 : ghost
proximity: 2
``` ```
Because `word2` begins with a different letter than the previous `word2`, Because `word2` begins with a different letter than the previous `word2`,
we know that: we know that all the prefixes of `word2` are greater than the prefixes of the previous word2
1. All the prefixes of `word2` are greater than the prefixes of the previous word2
2. And therefore, every instance of (`word2`, `prefix`) will be greater than
any element in the batch.
Therefore, we know that we can insert every element from the batch into the Therefore, we know that we can insert every element from the batch into the
database before proceeding any further. This operation is called database before proceeding any further. This operation is called
flushing the batch. Flushing the batch should also be done whenever `word1` flushing the batch. Flushing the batch should also be done whenever:
is different than the previous `word1`. * `proximity` is different than the previous `proximity`.
* `word1` is different than the previous `word1`.
* `word2` starts with a different letter than the previous word2
6. **Flushing the batch:** to flush the batch, we look at the `word1` and 6. **Flushing the batch:** to flush the batch, we iterate over its elements:
iterate over the elements of the batch in sorted order:
```text ```text
Flushing Batch loop 1: Flushing Batch loop 1:
------------------------------ ------------------------------
word1 : good proximity : 1
word2 : d word1 : good
proximity: 1 prefix : d
docids : [docids2, docids3] docids : [docids2, docids3]
``` ```
We then merge the array of `docids` (of type `Vec<Vec<u8>>`) using We then merge the array of `docids` (of type `Vec<Vec<u8>>`) using
`merge_cbo_roaring_bitmap` in order to get a single byte vector representing a `merge_cbo_roaring_bitmap` in order to get a single byte vector representing a
roaring bitmap of all the document ids where `word1` is followed by `prefix` roaring bitmap of all the document ids where `word1` is followed by `prefix`
at a distance of `proximity`. at a distance of `proximity`.
Once we have done that, we insert (`word1`, `prefix`, `proximity`) -> `merged_docids` Once we have done that, we insert `(proximity, word1, prefix) -> merged_docids`
into the database. into the database.
7. That's it! ... except... 7. That's it! ... except...
@ -166,7 +138,7 @@ inputs described above, which come from different places:
2. `word_pairs_db`, which is the list of word pairs from the database. 2. `word_pairs_db`, which is the list of word pairs from the database.
This list includes all elements in `new_word_pairs` since `new_word_pairs` This list includes all elements in `new_word_pairs` since `new_word_pairs`
was added to the database prior to calling the `WordPrefixPairProximityDocIds::execute` was added to the database prior to calling the `WordPrefix::execute`
function. function.
To update the prefix database correctly, we call the algorithm described earlier first To update the prefix database correctly, we call the algorithm described earlier first
@ -184,199 +156,146 @@ Note, also, that since we read data from the database when iterating over
`word_pairs_db`, we cannot insert the computed word-prefix-pair-proximity- `word_pairs_db`, we cannot insert the computed word-prefix-pair-proximity-
docids from the batch directly into the database (we would have a concurrent docids from the batch directly into the database (we would have a concurrent
reader and writer). Therefore, when calling the algorithm on reader and writer). Therefore, when calling the algorithm on
(`new_prefixes`, `word_pairs_db`), we insert the computed `(new_prefixes, word_pairs_db)`, we insert the computed
((`word`, `prefix`, `proximity`), `docids`) elements in an intermediary grenad `((proximity, word, prefix), docids)` elements in an intermediary grenad
Writer instead of the DB. At the end of the outer loop, we finally read from Writer instead of the DB. At the end of the outer loop, we finally read from
the grenad and insert its elements in the database. the grenad and insert its elements in the database.
*/ */
use std::borrow::Cow; use std::borrow::Cow;
use std::collections::HashSet; use std::collections::HashSet;
use std::io::BufReader;
use grenad::CompressionType; use grenad::CompressionType;
use heed::types::ByteSlice; use heed::types::ByteSlice;
use heed::BytesDecode; use heed::BytesDecode;
use log::debug; use log::debug;
use crate::update::index_documents::{ use crate::update::index_documents::{create_writer, CursorClonableMmap};
create_writer, merge_cbo_roaring_bitmaps, CursorClonableMmap, use crate::update::prefix_word_pairs::{
insert_into_database, write_into_lmdb_database_without_merging,
}; };
use crate::{CboRoaringBitmapCodec, Index, Result, UncheckedStrStrU8Codec}; use crate::{CboRoaringBitmapCodec, Result, U8StrStrCodec, UncheckedU8StrStrCodec};
pub struct WordPrefixPairProximityDocids<'t, 'u, 'i> { #[logging_timer::time]
wtxn: &'t mut heed::RwTxn<'i, 'u>, pub fn index_word_prefix_database(
index: &'i Index, wtxn: &mut heed::RwTxn,
pub(crate) chunk_compression_type: CompressionType, word_pair_proximity_docids: heed::Database<U8StrStrCodec, CboRoaringBitmapCodec>,
pub(crate) chunk_compression_level: Option<u32>, word_prefix_pair_proximity_docids: heed::Database<U8StrStrCodec, CboRoaringBitmapCodec>,
pub(crate) max_nb_chunks: Option<usize>,
pub(crate) max_memory: Option<usize>,
max_proximity: u8, max_proximity: u8,
max_prefix_length: usize, max_prefix_length: usize,
} new_word_pair_proximity_docids: grenad::Reader<CursorClonableMmap>,
new_prefix_fst_words: &[String],
common_prefix_fst_words: &[&[String]],
del_prefix_fst_words: &HashSet<Vec<u8>>,
chunk_compression_type: CompressionType,
chunk_compression_level: Option<u32>,
) -> Result<()> {
debug!("Computing and writing the word prefix pair proximity docids into LMDB on disk...");
impl<'t, 'u, 'i> WordPrefixPairProximityDocids<'t, 'u, 'i> { // Make a prefix trie from the common prefixes that are shorter than self.max_prefix_length
pub fn new( let prefixes = PrefixTrieNode::from_sorted_prefixes(
wtxn: &'t mut heed::RwTxn<'i, 'u>, common_prefix_fst_words
index: &'i Index, .into_iter()
) -> WordPrefixPairProximityDocids<'t, 'u, 'i> { .map(|s| s.into_iter())
WordPrefixPairProximityDocids { .flatten()
wtxn, .map(|s| s.as_str())
index, .filter(|s| s.len() <= max_prefix_length),
chunk_compression_type: CompressionType::None, );
chunk_compression_level: None,
max_nb_chunks: None,
max_memory: None,
max_proximity: 4,
max_prefix_length: 2,
}
}
/// Set the maximum proximity required to make a prefix be part of the words prefixes // If the prefix trie is not empty, then we can iterate over all new
/// database. If two words are too far from the threshold the associated documents will // word pairs to look for new (proximity, word1, common_prefix) elements
/// not be part of the prefix database. // to insert in the DB
/// if !prefixes.is_empty() {
/// Default value is 4. This value must be lower or equal than 7 and will be clamped let mut cursor = new_word_pair_proximity_docids.into_cursor()?;
/// to this bound otherwise. // This is the core of the algorithm
pub fn max_proximity(&mut self, value: u8) -> &mut Self { execute_on_word_pairs_and_prefixes(
self.max_proximity = value.max(7); // the first two arguments tell how to iterate over the new word pairs
self &mut cursor,
} |cursor| {
if let Some((key, value)) = cursor.move_on_next()? {
/// Set the maximum length the prefix of a word pair is allowed to have to be part of the words let (proximity, word1, word2) =
/// prefixes database. If the prefix length is higher than the threshold, the associated documents UncheckedU8StrStrCodec::bytes_decode(key).ok_or(heed::Error::Decoding)?;
/// will not be part of the prefix database. Ok(Some(((proximity, word1, word2), value)))
/// } else {
/// Default value is 2. Ok(None)
pub fn max_prefix_length(&mut self, value: usize) -> &mut Self {
self.max_prefix_length = value;
self
}
#[logging_timer::time("WordPrefixPairProximityDocids::{}")]
pub fn execute<'a>(
self,
new_word_pair_proximity_docids: grenad::Reader<CursorClonableMmap>,
new_prefix_fst_words: &'a [String],
common_prefix_fst_words: &[&'a [String]],
del_prefix_fst_words: &HashSet<Vec<u8>>,
) -> Result<()> {
debug!("Computing and writing the word prefix pair proximity docids into LMDB on disk...");
// Make a prefix trie from the common prefixes that are shorter than self.max_prefix_length
let prefixes = PrefixTrieNode::from_sorted_prefixes(
common_prefix_fst_words
.iter()
.flat_map(|s| s.iter())
.map(|s| s.as_str())
.filter(|s| s.len() <= self.max_prefix_length),
);
// If the prefix trie is not empty, then we can iterate over all new
// word pairs to look for new (word1, common_prefix, proximity) elements
// to insert in the DB
if !prefixes.is_empty() {
let mut cursor = new_word_pair_proximity_docids.into_cursor()?;
// This is the core of the algorithm
execute_on_word_pairs_and_prefixes(
// the first two arguments tell how to iterate over the new word pairs
&mut cursor,
|cursor| {
if let Some((key, value)) = cursor.move_on_next()? {
let (word1, word2, proximity) = UncheckedStrStrU8Codec::bytes_decode(key)
.ok_or(heed::Error::Decoding)?;
Ok(Some(((word1, word2, proximity), value)))
} else {
Ok(None)
}
},
&prefixes,
self.max_proximity,
// and this argument tells what to do with each new key (word1, prefix, proximity) and value (roaring bitmap)
|key, value| {
insert_into_database(
self.wtxn,
*self.index.word_prefix_pair_proximity_docids.as_polymorph(),
key,
value,
)
},
)?;
}
// Now we do the same thing with the new prefixes and all word pairs in the DB
let prefixes = PrefixTrieNode::from_sorted_prefixes(
new_prefix_fst_words
.iter()
.map(|s| s.as_str())
.filter(|s| s.len() <= self.max_prefix_length),
);
if !prefixes.is_empty() {
let mut db_iter = self
.index
.word_pair_proximity_docids
.remap_key_type::<UncheckedStrStrU8Codec>()
.remap_data_type::<ByteSlice>()
.iter(self.wtxn)?;
// Since we read the DB, we can't write to it directly, so we add each new (word1, prefix, proximity)
// element in an intermediary grenad
let mut writer = create_writer(
self.chunk_compression_type,
self.chunk_compression_level,
tempfile::tempfile()?,
);
execute_on_word_pairs_and_prefixes(
&mut db_iter,
|db_iter| db_iter.next().transpose().map_err(|e| e.into()),
&prefixes,
self.max_proximity,
|key, value| writer.insert(key, value).map_err(|e| e.into()),
)?;
drop(db_iter);
// and then we write the grenad into the DB
// Since the grenad contains only new prefixes, we know in advance that none
// of its elements already exist in the DB, thus there is no need to specify
// how to merge conflicting elements
write_into_lmdb_database_without_merging(
self.wtxn,
*self.index.word_prefix_pair_proximity_docids.as_polymorph(),
writer,
)?;
}
// All of the word prefix pairs in the database that have a w2
// that is contained in the `suppr_pw` set must be removed as well.
if !del_prefix_fst_words.is_empty() {
let mut iter = self
.index
.word_prefix_pair_proximity_docids
.remap_data_type::<ByteSlice>()
.iter_mut(self.wtxn)?;
while let Some(((_, w2, _), _)) = iter.next().transpose()? {
if del_prefix_fst_words.contains(w2.as_bytes()) {
// Delete this entry as the w2 prefix is no more in the words prefix fst.
unsafe { iter.del_current()? };
} }
},
&prefixes,
max_proximity,
// and this argument tells what to do with each new key (proximity, word1, prefix) and value (roaring bitmap)
|key, value| {
insert_into_database(
wtxn,
*word_prefix_pair_proximity_docids.as_polymorph(),
key,
value,
)
},
)?;
}
// Now we do the same thing with the new prefixes and all word pairs in the DB
let prefixes = PrefixTrieNode::from_sorted_prefixes(
new_prefix_fst_words
.into_iter()
.map(|s| s.as_str())
.filter(|s| s.len() <= max_prefix_length),
);
if !prefixes.is_empty() {
let mut db_iter = word_pair_proximity_docids
.remap_key_type::<UncheckedU8StrStrCodec>()
.remap_data_type::<ByteSlice>()
.iter(wtxn)?;
// Since we read the DB, we can't write to it directly, so we add each new (proximity, word1, prefix)
// element in an intermediary grenad
let mut writer =
create_writer(chunk_compression_type, chunk_compression_level, tempfile::tempfile()?);
execute_on_word_pairs_and_prefixes(
&mut db_iter,
|db_iter| db_iter.next().transpose().map_err(|e| e.into()),
&prefixes,
max_proximity,
|key, value| writer.insert(key, value).map_err(|e| e.into()),
)?;
drop(db_iter);
// and then we write the grenad into the DB
// Since the grenad contains only new prefixes, we know in advance that none
// of its elements already exist in the DB, thus there is no need to specify
// how to merge conflicting elements
write_into_lmdb_database_without_merging(
wtxn,
*word_prefix_pair_proximity_docids.as_polymorph(),
writer,
)?;
}
// All of the word prefix pairs in the database that have a w2
// that is contained in the `suppr_pw` set must be removed as well.
if !del_prefix_fst_words.is_empty() {
let mut iter =
word_prefix_pair_proximity_docids.remap_data_type::<ByteSlice>().iter_mut(wtxn)?;
while let Some(((_, _, prefix), _)) = iter.next().transpose()? {
if del_prefix_fst_words.contains(prefix.as_bytes()) {
// Delete this entry as the w2 prefix is no more in the words prefix fst.
unsafe { iter.del_current()? };
} }
} }
Ok(())
} }
Ok(())
} }
/// This is the core of the algorithm to initialise the Word Prefix Pair Proximity Docids database. /// This is the core of the algorithm to initialise the Word Prefix Pair Proximity Docids database.
/// ///
/// Its main arguments are: /// Its main arguments are:
/// 1. a sorted iterator over ((word1, word2, proximity), docids) elements /// 1. a sorted iterator over ((proximity, word1, word2), docids) elements
/// 2. a prefix trie /// 2. a prefix trie
/// 3. a closure to describe how to handle the new computed (word1, prefix, proximity) elements /// 3. a closure to describe how to handle the new computed (proximity, word1, prefix) elements
/// ///
/// For more information about what this function does, read the module documentation. /// For more information about what this function does, read the module documentation.
fn execute_on_word_pairs_and_prefixes<I>( fn execute_on_word_pairs_and_prefixes<I>(
@ -384,7 +303,7 @@ fn execute_on_word_pairs_and_prefixes<I>(
mut next_word_pair_proximity: impl for<'a> FnMut( mut next_word_pair_proximity: impl for<'a> FnMut(
&'a mut I, &'a mut I,
) -> Result< ) -> Result<
Option<((&'a [u8], &'a [u8], u8), &'a [u8])>, Option<((u8, &'a [u8], &'a [u8]), &'a [u8])>,
>, >,
prefixes: &PrefixTrieNode, prefixes: &PrefixTrieNode,
max_proximity: u8, max_proximity: u8,
@ -402,10 +321,10 @@ fn execute_on_word_pairs_and_prefixes<I>(
let mut prefix_buffer = Vec::with_capacity(8); let mut prefix_buffer = Vec::with_capacity(8);
let mut merge_buffer = Vec::with_capacity(65_536); let mut merge_buffer = Vec::with_capacity(65_536);
while let Some(((word1, word2, proximity), data)) = next_word_pair_proximity(iter)? { while let Some(((proximity, word1, word2), data)) = next_word_pair_proximity(iter)? {
// skip this iteration if the proximity is over the threshold // stop indexing if the proximity is over the threshold
if proximity > max_proximity { if proximity > max_proximity {
continue; break;
}; };
let word2_start_different_than_prev = word2[0] != prev_word2_start; let word2_start_different_than_prev = word2[0] != prev_word2_start;
// if there were no potential prefixes for the previous word2 based on its first letter, // if there were no potential prefixes for the previous word2 based on its first letter,
@ -415,16 +334,21 @@ fn execute_on_word_pairs_and_prefixes<I>(
continue; continue;
} }
// if word1 is different than the previous word1 OR if the start of word2 is different // if the proximity is different to the previous one, OR
// than the previous start of word2, then we'll need to flush the batch // if word1 is different than the previous word1, OR
// if the start of word2 is different than the previous start of word2,
// THEN we'll need to flush the batch
let prox_different_than_prev = proximity != batch.proximity;
let word1_different_than_prev = word1 != batch.word1; let word1_different_than_prev = word1 != batch.word1;
if word1_different_than_prev || word2_start_different_than_prev { if prox_different_than_prev || word1_different_than_prev || word2_start_different_than_prev
{
batch.flush(&mut merge_buffer, &mut insert)?; batch.flush(&mut merge_buffer, &mut insert)?;
// don't forget to reset the value of batch.word1 and prev_word2_start // don't forget to reset the value of batch.word1 and prev_word2_start
if word1_different_than_prev { if word1_different_than_prev {
prefix_search_start.0 = 0; prefix_search_start.0 = 0;
batch.word1.clear(); batch.word1.clear();
batch.word1.extend_from_slice(word1); batch.word1.extend_from_slice(word1);
batch.proximity = proximity;
} }
if word2_start_different_than_prev { if word2_start_different_than_prev {
// word2_start_different_than_prev == true // word2_start_different_than_prev == true
@ -436,74 +360,70 @@ fn execute_on_word_pairs_and_prefixes<I>(
if !empty_prefixes { if !empty_prefixes {
// All conditions are satisfied, we can now insert each new prefix of word2 into the batch // All conditions are satisfied, we can now insert each new prefix of word2 into the batch
prefix_buffer.clear();
prefixes.for_each_prefix_of( prefixes.for_each_prefix_of(
word2, word2,
&mut prefix_buffer, &mut prefix_buffer,
&prefix_search_start, &prefix_search_start,
|prefix_buffer| { |prefix_buffer| {
let prefix_len = prefix_buffer.len(); batch.insert(&prefix_buffer, data.to_vec());
prefix_buffer.push(0);
prefix_buffer.push(proximity);
batch.insert(prefix_buffer, data.to_vec());
prefix_buffer.truncate(prefix_len);
}, },
); );
prefix_buffer.clear();
} }
} }
batch.flush(&mut merge_buffer, &mut insert)?; batch.flush(&mut merge_buffer, &mut insert)?;
Ok(()) Ok(())
} }
/** /**
A map structure whose keys are (prefix, proximity) and whose values are vectors of bitstrings (serialized roaring bitmaps). A map structure whose keys are prefixes and whose values are vectors of bitstrings (serialized roaring bitmaps).
The keys are sorted and conflicts are resolved by merging the vectors of bitstrings together. The keys are sorted and conflicts are resolved by merging the vectors of bitstrings together.
It is used to ensure that all ((word1, prefix, proximity), docids) are inserted into the database in sorted order and efficiently. It is used to ensure that all ((proximity, word1, prefix), docids) are inserted into the database in sorted order and efficiently.
The batch is flushed as often as possible, when we are sure that every (word1, prefix, proximity) key derived from its content The batch is flushed as often as possible, when we are sure that every (proximity, word1, prefix) key derived from its content
can be inserted into the database in sorted order. When it is flushed, it calls a user-provided closure with the following arguments: can be inserted into the database in sorted order. When it is flushed, it calls a user-provided closure with the following arguments:
- key : (word1, prefix, proximity) as bytes - key : (proximity, word1, prefix) as bytes
- value : merged roaring bitmaps from all values associated with (prefix, proximity) in the batch, serialised to bytes - value : merged roaring bitmaps from all values associated with prefix in the batch, serialised to bytes
*/ */
#[derive(Default)] #[derive(Default)]
struct PrefixAndProximityBatch { struct PrefixAndProximityBatch {
proximity: u8,
word1: Vec<u8>, word1: Vec<u8>,
batch: Vec<(Vec<u8>, Vec<Cow<'static, [u8]>>)>, batch: Vec<(Vec<u8>, Vec<Cow<'static, [u8]>>)>,
} }
impl PrefixAndProximityBatch { impl PrefixAndProximityBatch {
/// Insert the new key and value into the batch /// Insert the new key and value into the batch
///
/// The key must either exist in the batch or be greater than all existing keys
fn insert(&mut self, new_key: &[u8], new_value: Vec<u8>) { fn insert(&mut self, new_key: &[u8], new_value: Vec<u8>) {
match self.batch.binary_search_by_key(&new_key, |(k, _)| k.as_slice()) { match self.batch.iter_mut().find(|el| el.0 == new_key) {
Ok(position) => { Some((_prefix, docids)) => docids.push(Cow::Owned(new_value)),
self.batch[position].1.push(Cow::Owned(new_value)); None => self.batch.push((new_key.to_vec(), vec![Cow::Owned(new_value)])),
}
Err(position) => {
self.batch.insert(position, (new_key.to_vec(), vec![Cow::Owned(new_value)]));
}
} }
} }
/// Empties the batch, calling `insert` on each element. /// Empties the batch, calling `insert` on each element.
/// ///
/// The key given to `insert` is `(word1, prefix, proximity)` and the value is the associated merged roaring bitmap. /// The key given to `insert` is `(proximity, word1, prefix)` and the value is the associated merged roaring bitmap.
fn flush( fn flush(
&mut self, &mut self,
merge_buffer: &mut Vec<u8>, merge_buffer: &mut Vec<u8>,
insert: &mut impl for<'buffer> FnMut(&'buffer [u8], &'buffer [u8]) -> Result<()>, insert: &mut impl for<'buffer> FnMut(&'buffer [u8], &'buffer [u8]) -> Result<()>,
) -> Result<()> { ) -> Result<()> {
let PrefixAndProximityBatch { word1, batch } = self; let PrefixAndProximityBatch { proximity, word1, batch } = self;
if batch.is_empty() { if batch.is_empty() {
return Ok(()); return Ok(());
} }
merge_buffer.clear(); merge_buffer.clear();
let mut buffer = Vec::with_capacity(word1.len() + 1 + 6 + 1); let mut buffer = Vec::with_capacity(word1.len() + 1 + 6);
buffer.push(*proximity);
buffer.extend_from_slice(word1); buffer.extend_from_slice(word1);
buffer.push(0); buffer.push(0);
for (key, mergeable_data) in batch.drain(..) { for (key, mergeable_data) in batch.drain(..) {
buffer.truncate(word1.len() + 1); buffer.truncate(1 + word1.len() + 1);
buffer.extend_from_slice(key.as_slice()); buffer.extend_from_slice(key.as_slice());
let data = if mergeable_data.len() > 1 { let data = if mergeable_data.len() > 1 {
@ -520,61 +440,6 @@ impl PrefixAndProximityBatch {
} }
} }
// This is adapted from `sorter_into_lmdb_database`
fn insert_into_database(
wtxn: &mut heed::RwTxn,
database: heed::PolyDatabase,
new_key: &[u8],
new_value: &[u8],
) -> Result<()> {
let mut iter = database.prefix_iter_mut::<_, ByteSlice, ByteSlice>(wtxn, new_key)?;
match iter.next().transpose()? {
Some((key, old_val)) if new_key == key => {
let val =
merge_cbo_roaring_bitmaps(key, &[Cow::Borrowed(old_val), Cow::Borrowed(new_value)])
.map_err(|_| {
// TODO just wrap this error?
crate::error::InternalError::IndexingMergingKeys {
process: "get-put-merge",
}
})?;
// safety: we use the new_key, not the one from the database iterator, to avoid undefined behaviour
unsafe { iter.put_current(new_key, &val)? };
}
_ => {
drop(iter);
database.put::<_, ByteSlice, ByteSlice>(wtxn, new_key, new_value)?;
}
}
Ok(())
}
// This is adapted from `sorter_into_lmdb_database` and `write_into_lmdb_database`,
// but it uses `append` if the database is empty, and it assumes that the values in the
// writer don't conflict with values in the database.
pub fn write_into_lmdb_database_without_merging(
wtxn: &mut heed::RwTxn,
database: heed::PolyDatabase,
writer: grenad::Writer<std::fs::File>,
) -> Result<()> {
let file = writer.into_inner()?;
let reader = grenad::Reader::new(BufReader::new(file))?;
if database.is_empty(wtxn)? {
let mut out_iter = database.iter_mut::<_, ByteSlice, ByteSlice>(wtxn)?;
let mut cursor = reader.into_cursor()?;
while let Some((k, v)) = cursor.move_on_next()? {
// safety: the key comes from the grenad reader, not the database
unsafe { out_iter.append(k, v)? };
}
} else {
let mut cursor = reader.into_cursor()?;
while let Some((k, v)) = cursor.move_on_next()? {
database.put::<_, ByteSlice, ByteSlice>(wtxn, k, v)?;
}
}
Ok(())
}
/** A prefix trie. Used to iterate quickly over the prefixes of a word that are /** A prefix trie. Used to iterate quickly over the prefixes of a word that are
within a set. within a set.
@ -619,7 +484,7 @@ impl PrefixTrieNode {
fn set_search_start(&self, word: &[u8], search_start: &mut PrefixTrieNodeSearchStart) -> bool { fn set_search_start(&self, word: &[u8], search_start: &mut PrefixTrieNodeSearchStart) -> bool {
let byte = word[0]; let byte = word[0];
if self.children[search_start.0].1 == byte { if self.children[search_start.0].1 == byte {
true return true;
} else { } else {
match self.children[search_start.0..].binary_search_by_key(&byte, |x| x.1) { match self.children[search_start.0..].binary_search_by_key(&byte, |x| x.1) {
Ok(position) => { Ok(position) => {
@ -637,7 +502,7 @@ impl PrefixTrieNode {
fn from_sorted_prefixes<'a>(prefixes: impl Iterator<Item = &'a str>) -> Self { fn from_sorted_prefixes<'a>(prefixes: impl Iterator<Item = &'a str>) -> Self {
let mut node = PrefixTrieNode::default(); let mut node = PrefixTrieNode::default();
for prefix in prefixes { for prefix in prefixes {
node.insert_sorted_prefix(prefix.as_bytes().iter()); node.insert_sorted_prefix(prefix.as_bytes().into_iter());
} }
node node
} }
@ -701,90 +566,10 @@ impl PrefixTrieNode {
} }
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use std::io::Cursor;
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use super::*; use super::*;
use crate::documents::{DocumentsBatchBuilder, DocumentsBatchReader}; use crate::{CboRoaringBitmapCodec, U8StrStrCodec};
use crate::index::tests::TempIndex;
use crate::{db_snap, CboRoaringBitmapCodec, StrStrU8Codec};
fn documents_with_enough_different_words_for_prefixes(prefixes: &[&str]) -> Vec<crate::Object> {
let mut documents = Vec::new();
for prefix in prefixes {
for i in 0..50 {
documents.push(
serde_json::json!({
"text": format!("{prefix}{i:x}"),
})
.as_object()
.unwrap()
.clone(),
)
}
}
documents
}
#[test]
fn test_update() {
let mut index = TempIndex::new();
index.index_documents_config.words_prefix_threshold = Some(50);
index.index_documents_config.autogenerate_docids = true;
index
.update_settings(|settings| {
settings.set_searchable_fields(vec!["text".to_owned()]);
})
.unwrap();
let batch_reader_from_documents = |documents| {
let mut builder = DocumentsBatchBuilder::new(Vec::new());
for object in documents {
builder.append_json_object(&object).unwrap();
}
DocumentsBatchReader::from_reader(Cursor::new(builder.into_inner().unwrap())).unwrap()
};
let mut documents = documents_with_enough_different_words_for_prefixes(&["a", "be"]);
// now we add some documents where the text should populate the word_prefix_pair_proximity_docids database
documents.push(
serde_json::json!({
"text": "At an amazing and beautiful house"
})
.as_object()
.unwrap()
.clone(),
);
documents.push(
serde_json::json!({
"text": "The bell rings at 5 am"
})
.as_object()
.unwrap()
.clone(),
);
let documents = batch_reader_from_documents(documents);
index.add_documents(documents).unwrap();
db_snap!(index, word_prefix_pair_proximity_docids, "initial");
let mut documents = documents_with_enough_different_words_for_prefixes(&["am", "an"]);
documents.push(
serde_json::json!({
"text": "At an extraordinary house"
})
.as_object()
.unwrap()
.clone(),
);
let documents = batch_reader_from_documents(documents);
index.add_documents(documents).unwrap();
db_snap!(index, word_prefix_pair_proximity_docids, "update");
}
fn check_prefixes( fn check_prefixes(
trie: &PrefixTrieNode, trie: &PrefixTrieNode,
@ -883,58 +668,40 @@ mod tests {
CboRoaringBitmapCodec::serialize_into(&bitmap_ranges, &mut serialised_bitmap_ranges); CboRoaringBitmapCodec::serialize_into(&bitmap_ranges, &mut serialised_bitmap_ranges);
let word_pairs = [ let word_pairs = [
// 1, 3: (healthy arb 2) and (healthy arbre 2) with (bitmap123 | bitmap456) ((1, "healthy", "arbres"), &serialised_bitmap123),
(("healthy", "arbre", 2), &serialised_bitmap123), ((1, "healthy", "boat"), &serialised_bitmap123),
// not inserted because 3 > max_proximity ((1, "healthy", "ca"), &serialised_bitmap123),
(("healthy", "arbre", 3), &serialised_bitmap456), ((1, "healthy", "cats"), &serialised_bitmap456),
// 0, 2: (healthy arb 1) and (healthy arbre 1) with (bitmap123) ((1, "healthy", "cattos"), &serialised_bitmap123),
(("healthy", "arbres", 1), &serialised_bitmap123), ((1, "jittery", "cat"), &serialised_bitmap123),
// 1, 3: ((1, "jittery", "cata"), &serialised_bitmap456),
(("healthy", "arbres", 2), &serialised_bitmap456), ((1, "jittery", "catb"), &serialised_bitmap789),
// not be inserted because 3 > max_proximity ((1, "jittery", "catc"), &serialised_bitmap_ranges),
(("healthy", "arbres", 3), &serialised_bitmap789), ((2, "healthy", "arbre"), &serialised_bitmap123),
// not inserted because no prefixes for boat ((2, "healthy", "arbres"), &serialised_bitmap456),
(("healthy", "boat", 1), &serialised_bitmap123), ((2, "healthy", "cats"), &serialised_bitmap789),
// not inserted because no prefixes for ca ((2, "healthy", "cattos"), &serialised_bitmap_ranges),
(("healthy", "ca", 1), &serialised_bitmap123), ((3, "healthy", "arbre"), &serialised_bitmap456),
// 4: (healthy cat 1) with (bitmap456 + bitmap123) ((3, "healthy", "arbres"), &serialised_bitmap789),
(("healthy", "cats", 1), &serialised_bitmap456),
// 5: (healthy cat 2) with (bitmap789 + bitmap_ranges)
(("healthy", "cats", 2), &serialised_bitmap789),
// 4 + 6: (healthy catto 1) with (bitmap123)
(("healthy", "cattos", 1), &serialised_bitmap123),
// 5 + 7: (healthy catto 2) with (bitmap_ranges)
(("healthy", "cattos", 2), &serialised_bitmap_ranges),
// 8: (jittery cat 1) with (bitmap123 | bitmap456 | bitmap789 | bitmap_ranges)
(("jittery", "cat", 1), &serialised_bitmap123),
// 8:
(("jittery", "cata", 1), &serialised_bitmap456),
// 8:
(("jittery", "catb", 1), &serialised_bitmap789),
// 8:
(("jittery", "catc", 1), &serialised_bitmap_ranges),
]; ];
let expected_result = [ let expected_result = [
// first batch: ((1, "healthy", "arb"), bitmap123.clone()),
(("healthy", "arb", 1), bitmap123.clone()), ((1, "healthy", "arbre"), bitmap123.clone()),
(("healthy", "arb", 2), &bitmap123 | &bitmap456), ((1, "healthy", "cat"), &bitmap456 | &bitmap123),
(("healthy", "arbre", 1), bitmap123.clone()), ((1, "healthy", "catto"), bitmap123.clone()),
(("healthy", "arbre", 2), &bitmap123 | &bitmap456), ((1, "jittery", "cat"), (&bitmap123 | &bitmap456 | &bitmap789 | &bitmap_ranges)),
// second batch: ((2, "healthy", "arb"), &bitmap123 | &bitmap456),
(("healthy", "cat", 1), &bitmap456 | &bitmap123), ((2, "healthy", "arbre"), &bitmap123 | &bitmap456),
(("healthy", "cat", 2), &bitmap789 | &bitmap_ranges), ((2, "healthy", "cat"), &bitmap789 | &bitmap_ranges),
(("healthy", "catto", 1), bitmap123.clone()), ((2, "healthy", "catto"), bitmap_ranges.clone()),
(("healthy", "catto", 2), bitmap_ranges.clone()),
// third batch
(("jittery", "cat", 1), (&bitmap123 | &bitmap456 | &bitmap789 | &bitmap_ranges)),
]; ];
let mut result = vec![]; let mut result = vec![];
let mut iter = let mut iter =
IntoIterator::into_iter(word_pairs).map(|((word1, word2, proximity), data)| { IntoIterator::into_iter(word_pairs).map(|((proximity, word1, word2), data)| {
((word1.as_bytes(), word2.as_bytes(), proximity), data.as_slice()) ((proximity, word1.as_bytes(), word2.as_bytes()), data.as_slice())
}); });
execute_on_word_pairs_and_prefixes( execute_on_word_pairs_and_prefixes(
&mut iter, &mut iter,
@ -942,17 +709,17 @@ mod tests {
&prefixes, &prefixes,
2, 2,
|k, v| { |k, v| {
let (word1, prefix, proximity) = StrStrU8Codec::bytes_decode(k).unwrap(); let (proximity, word1, prefix) = U8StrStrCodec::bytes_decode(k).unwrap();
let bitmap = CboRoaringBitmapCodec::bytes_decode(v).unwrap(); let bitmap = CboRoaringBitmapCodec::bytes_decode(v).unwrap();
result.push(((word1.to_owned(), prefix.to_owned(), proximity.to_owned()), bitmap)); result.push(((proximity.to_owned(), word1.to_owned(), prefix.to_owned()), bitmap));
Ok(()) Ok(())
}, },
) )
.unwrap(); .unwrap();
for (x, y) in result.into_iter().zip(IntoIterator::into_iter(expected_result)) { for (x, y) in result.into_iter().zip(IntoIterator::into_iter(expected_result)) {
let ((actual_word1, actual_prefix, actual_proximity), actual_bitmap) = x; let ((actual_proximity, actual_word1, actual_prefix), actual_bitmap) = x;
let ((expected_word1, expected_prefix, expected_proximity), expected_bitmap) = y; let ((expected_proximity, expected_word1, expected_prefix), expected_bitmap) = y;
assert_eq!(actual_word1, expected_word1); assert_eq!(actual_word1, expected_word1);
assert_eq!(actual_prefix, expected_prefix); assert_eq!(actual_prefix, expected_prefix);

View File

@ -1,46 +0,0 @@
---
source: milli/src/update/word_prefix_pair_proximity_docids.rs
---
5 a 1 [101, ]
5 a 2 [101, ]
5 b 4 [101, ]
5 be 4 [101, ]
am a 3 [101, ]
amazing a 1 [100, ]
amazing a 2 [100, ]
amazing a 3 [100, ]
amazing b 2 [100, ]
amazing be 2 [100, ]
an a 1 [100, ]
an a 2 [100, ]
an b 3 [100, ]
an be 3 [100, ]
and a 2 [100, ]
and a 3 [100, ]
and a 4 [100, ]
and b 1 [100, ]
and be 1 [100, ]
at a 1 [100, ]
at a 2 [100, 101, ]
at a 3 [100, ]
at b 3 [101, ]
at b 4 [100, ]
at be 3 [101, ]
at be 4 [100, ]
beautiful a 2 [100, ]
beautiful a 3 [100, ]
beautiful a 4 [100, ]
bell a 2 [101, ]
bell a 4 [101, ]
house a 3 [100, ]
house a 4 [100, ]
house b 2 [100, ]
house be 2 [100, ]
rings a 1 [101, ]
rings a 3 [101, ]
rings b 2 [101, ]
rings be 2 [101, ]
the a 3 [101, ]
the b 1 [101, ]
the be 1 [101, ]

View File

@ -1,4 +0,0 @@
---
source: milli/src/update/word_prefix_pair_proximity_docids.rs
---
5ed4bf83317b10962a55ade353427bdd