Add newlines in documentation of word_prefix_pair_proximity_docids

2024-11-27 04:25:06 +08:00 · 2022-07-19 07:03:30 +02:00 · 2022-07-19 07:03:30 +02:00 · 34c991ea02
commit 34c991ea02
parent 06f3fd8c6d
1 changed files with 71 additions and 24 deletions
--- a/milli/src/update/word_prefix_pair_proximity_docids.rs
+++ b/milli/src/update/word_prefix_pair_proximity_docids.rs
@ -1,8 +1,12 @@
 /*!
 ## What is WordPrefixPairProximityDocids?
-The word-prefix-pair-proximity-docids database is a database whose keys are of the form (`word`, `prefix`, `proximity`) and the values are roaring bitmaps of the documents which contain `word` followed by another word starting with `prefix` at a distance of `proximity`.
+The word-prefix-pair-proximity-docids database is a database whose keys are of
+the form (`word`, `prefix`, `proximity`) and the values are roaring bitmaps of
+the documents which contain `word` followed by another word starting with
+`prefix` at a distance of `proximity`.

-The prefixes present in this database are only those that correspond to many different words in the documents.
+The prefixes present in this database are only those that correspond to many
+different words in the documents.

 ## How is it created/updated? (simplified version)
 To compute it, we have access to (mainly) two inputs:
@ -16,9 +20,11 @@ d
 do
 dog
 ```
-Note that only prefixes which correspond to more than a certain number of different words from the database are included in this list.
+Note that only prefixes which correspond to more than a certain number of
+different words from the database are included in this list.

-* a sorted list of word pairs and the distance between them (i.e. proximity), associated with a roaring bitmap, such as:
+* a sorted list of word pairs and the distance between them (i.e. proximity),
+* associated with a roaring bitmap, such as:
 ```
 good dog   3         -> docids1: [2, 5, 6]
 good doggo 1         -> docids2: [8]
@ -27,7 +33,8 @@ good ghost 2         -> docids4: [1]
 horror cathedral 4   -> docids5: [1, 2]
 ```

-I illustrate a simplified version of the algorithm to create the word-prefix-pair-proximity database below:
+I illustrate a simplified version of the algorithm to create the word-prefix
+pair-proximity database below:

 1. **Outer loop:** First, we iterate over each word pair and its proximity:
 ```
@ -35,7 +42,10 @@ word1    : good
 word2    : dog
 proximity: 3
 ```
-2. **Inner loop:** Then, we iterate over all the prefixes of `word2` that are in the list of sorted prefixes. And we insert the key (`prefix`, `proximity`) and the value (`docids`) to a sorted map which we call the “batch”. For example, at the end of the first inner loop, we may have:
+2. **Inner loop:** Then, we iterate over all the prefixes of `word2` that are
+in the list of sorted prefixes. And we insert the key (`prefix`, `proximity`)
+and the value (`docids`) to a sorted map which we call the “batch”. For example,
+at the end of the first inner loop, we may have:
 ```
 Outer loop 1:
 ------------------------------
@ -72,7 +82,9 @@ batch: [
    (dog, 3) -> [docids1]
 ]
 ```
-Notice that the batch had to re-order some (`prefix`, `proximity`) keys: some of the elements inserted in the second iteration of the outer loop appear *before* elements from the first iteration.
+Notice that the batch had to re-order some (`prefix`, `proximity`) keys: some
+of the elements inserted in the second iteration of the outer loop appear
+*before* elements from the first iteration.

 4. And a third:
 ```
@ -94,7 +106,8 @@ batch: [
    (dog, 3) -> [docids1]
 ]
 ```
-Notice that there were some conflicts which were resolved by merging the conflicting values together.
+Notice that there were some conflicts which were resolved by merging the
+conflicting values together.

 5. On the fourth iteration of the outer loop, we have:
 ```
@ -104,12 +117,20 @@ word1    : good
 word2    : ghost
 proximity: 2
 ```
-Because `word2` begins with a different letter than the previous `word2`, we know that:
-1. All the prefixes of `word2` are greater than the prefixes of the previous word2
-2. And therefore, every instance of (`word2`, `prefix`) will be greater than any element in the batch.
-Therefore, we know that we can insert every element from the batch into the database before proceeding any further. This operation is called “flushing the batch”. Flushing the batch should also be done whenever `word1` is different than the previous `word1`.
+Because `word2` begins with a different letter than the previous `word2`,
+we know that:

-6. **Flushing the batch:** to flush the batch, we look at the `word1` and iterate over the elements of the batch in sorted order:
+1. All the prefixes of `word2` are greater than the prefixes of the previous word2
+2. And therefore, every instance of (`word2`, `prefix`) will be greater than
+any element in the batch.
+
+Therefore, we know that we can insert every element from the batch into the
+database before proceeding any further. This operation is called
+“flushing the batch”. Flushing the batch should also be done whenever `word1`
+is different than the previous `word1`.
+
+6. **Flushing the batch:** to flush the batch, we look at the `word1` and
+iterate over the elements of the batch in sorted order:
 ```
 Flushing Batch loop 1:
 ------------------------------
@ -118,29 +139,55 @@ word2    : d
 proximity: 1
 docids   : [docids2, docids3]
 ```
-We then merge the array of `docids` (of type `Vec<Vec<u8>>`) using `merge_cbo_roaring_bitmap` in order to get a single byte vector representing a roaring bitmap of all the document ids where `word1` is followed by `prefix` at a distance of `proximity`.
-Once we have done that, we insert (`word1`, `prefix`, `proximity`) -> `merged_docids` into the database.
+We then merge the array of `docids` (of type `Vec<Vec<u8>>`) using
+`merge_cbo_roaring_bitmap` in order to get a single byte vector representing a
+roaring bitmap of all the document ids where `word1` is followed by `prefix`
+at a distance of `proximity`.
+Once we have done that, we insert (`word1`, `prefix`, `proximity`) -> `merged_docids`
+into the database.

 7. That's it! ... except...

 ## How is it created/updated (continued)

-I lied a little bit about the input data. In reality, we get two sets of the inputs described above, which come from different places:
+I lied a little bit about the input data. In reality, we get two sets of the
+inputs described above, which come from different places:

 * For the list of sorted prefixes, we have:
-    * `new_prefixes`, which are all the prefixes that were not present in the database before the insertion of the new documents
-    * `common_prefixes` which are the prefixes that are present both in the database and in the newly added documents
+    1. `new_prefixes`, which are all the prefixes that were not present in the
+    database before the insertion of the new documents
+
+    2. `common_prefixes` which are the prefixes that are present both in the
+    database and in the newly added documents

 * For the list of word pairs and proximities, we have:
-    * `new_word_pairs`, which is the list of word pairs and their proximities present in the newly added documents
-    * `word_pairs_db`, which is the list of word pairs from the database. **This list includes all elements in `new_word_pairs`** since `new_word_pairs` was added to the database prior to calling the `WordPrefixPairProximityDocIds::execute` function.
+    1. `new_word_pairs`, which is the list of word pairs and their proximities
+    present in the newly added documents

-To update the prefix database correctly, we call the algorithm described earlier first on (`common_prefixes`, `new_word_pairs`) and then on (`new_prefixes`, `word_pairs_db`). Thus:
+    2. `word_pairs_db`, which is the list of word pairs from the database.
+    This list includes all elements in `new_word_pairs`** since `new_word_pairs`
+    was added to the database prior to calling the `WordPrefixPairProximityDocIds::execute`
+    function.

-1. For all the word pairs that were already present in the DB, we insert them again with the `new_prefixes`. Calling the algorithm on them with the `common_prefixes` would not result in any new data.
-3. For all the new word pairs, we insert them twice: first with the `common_prefixes`, and then, because they are part of `word_pairs_db`, with the `new_prefixes`.
+To update the prefix database correctly, we call the algorithm described earlier first
+on (`common_prefixes`, `new_word_pairs`) and then on (`new_prefixes`, `word_pairs_db`).
+Thus:

-Note, also, that since we read data from the database when iterating over `word_pairs_db`, we cannot insert the computed word-prefix-pair-proximity-docids from the batch directly into the database (we would have a concurrent reader and writer). Therefore, when calling the algorithm on (`new_prefixes`, `word_pairs_db`), we insert the computed ((`word`, `prefix`, `proximity`), `docids`) elements in an intermediary grenad Writer instead of the DB. At the end of the outer loop, we finally read from the grenad and insert its elements in the database.
+1. For all the word pairs that were already present in the DB, we insert them
+again with the `new_prefixes`. Calling the algorithm on them with the
+`common_prefixes` would not result in any new data.
+
+2. For all the new word pairs, we insert them twice: first with the `common_prefixes`,
+and then, because they are part of `word_pairs_db`, with the `new_prefixes`.
+
+Note, also, that since we read data from the database when iterating over
+`word_pairs_db`, we cannot insert the computed word-prefix-pair-proximity-
+docids from the batch directly into the database (we would have a concurrent
+reader and writer). Therefore, when calling the algorithm on
+(`new_prefixes`, `word_pairs_db`), we insert the computed
+((`word`, `prefix`, `proximity`), `docids`) elements in an intermediary grenad
+Writer instead of the DB. At the end of the outer loop, we finally read from
+the grenad and insert its elements in the database.