590: Optimise facets indexing r=Kerollmops a=loiclec

# Pull Request

## What does this PR do?
Fixes #589 

## Notes
I added documentation for the whole module which attempts to explain the shape of the databases and their purpose. However, I realise there is already some documentation about this, so I am not sure if we want to keep it.

## Benchmarks

We get a ~1.15x speed up on the geo_point benchmark.

```
group                                                                     indexing_main_57042355                  indexing_optimise-facets-indexation_5728619a
-----                                                                     ----------------------                  --------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1862.7±294.45µs        ? ?/sec    1.58      2.9±1.32ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.11      8.9±2.44ms        ? ?/sec     1.00      8.0±1.42ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00     12.8±3.32ms        ? ?/sec     1.32     16.9±6.98ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.09     43.8±4.78ms        ? ?/sec     1.00     40.3±3.79ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.08   287.4±28.72ms        ? ?/sec     1.00    264.9±9.46ms        ? ?/sec
indexing/Indexing geo_point                                               1.14      61.2±0.39s        ? ?/sec     1.00      53.8±0.57s        ? ?/sec
indexing/Indexing movies in three batches                                 1.00      16.6±0.12s        ? ?/sec     1.00      16.5±0.10s        ? ?/sec
indexing/Indexing movies with default settings                            1.00      14.1±0.30s        ? ?/sec     1.00      14.0±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.9±0.50s        ? ?/sec     1.00      10.0±0.10s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.01       9.6±0.23s        ? ?/sec     1.00       9.5±0.06s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.07      66.3±0.55s        ? ?/sec     1.00      61.8±0.63s        ? ?/sec
indexing/Indexing songs with default settings                             1.03      58.8±0.82s        ? ?/sec     1.00      57.1±1.22s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      53.6±1.09s        ? ?/sec     1.01      54.0±0.58s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.02      58.0±1.29s        ? ?/sec     1.00      57.1±1.43s        ? ?/sec
indexing/Indexing wiki                                                    1.00   1064.1±21.20s        ? ?/sec     1.00   1068.0±20.49s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00    1182.5±9.62s        ? ?/sec     1.01   1191.2±10.96s        ? ?/sec
indexing/Reindexing geo_point                                             1.12      68.0±0.21s        ? ?/sec     1.00      60.5±0.82s        ? ?/sec
indexing/Reindexing movies with default settings                          1.01      14.1±0.21s        ? ?/sec     1.00      14.0±0.26s        ? ?/sec
indexing/Reindexing songs with default settings                           1.04      61.6±0.57s        ? ?/sec     1.00      59.2±0.87s        ? ?/sec
indexing/Reindexing wiki                                                  1.00   1734.0±11.38s        ? ?/sec     1.01   1746.6±22.48s        ? ?/sec
```


Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
This commit is contained in:
bors[bot] 2022-08-17 11:46:55 +00:00 committed by GitHub
commit 39869be23b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -1,10 +1,136 @@
/*!
This module initialises the databases that are used to quickly get the list
of documents with a faceted field value falling within a certain range. For
example, they can be used to implement filters such as `x >= 3`.
These databases are `facet_id_string_docids` and `facet_id_f64_docids`.
## Example with numbers
In the case of numbers, we start with a sorted list whose keys are
`(field_id, number_value)` and whose value is a roaring bitmap of the document ids
which contain the value `number_value` for the faceted field `field_id`.
From this list, we want to compute two things:
1. the bitmap of all documents that contain **any** number for each faceted field
2. a structure that allows us to use a (sort of) binary search to find all documents
containing numbers inside a certain range for a faceted field
To achieve goal (2), we recursively split the list into chunks. Every time we split it, we
create a new "level" that is several times smaller than the level below it. The base level,
level 0, is the starting list. Level 1 is composed of chunks of up to N elements. Each element
contains a range and a bitmap of docids. Level 2 is composed of chunks up to N^2 elements, etc.
For example, let's say we have 26 documents which we identify through the letters a-z.
We will focus on a single faceted field. When there are multiple faceted fields, the structure
described below is simply repeated for each field.
What we want to obtain is the following structure for each faceted field:
```text
all [a, b, c, d, e, f, g, u, y, z]
1.2 2 3.4 100 102 104
Level 2
[a, b, d, f, z] [c, d, e, f, g] [u, y]
1.2 1.3 1.6 2 3.4 12 12.3 100 102 104
Level 1
[a, b, d, z] [a, b, f] [c, d, g] [e, f] [u, y]
1.2 1.3 1.6 2 3.4 12 12.3 100 102 104
Level 0
[a, b] [d, z] [b, f] [a, f] [c, d] [g] [e] [e, f] [y] [u]
```
You can read more about this structure (for strings) in `[crate::search::facet::facet_strings]`.
To create the levels, we use a recursive algorithm which makes sure that we only need to iterate
over the elements of level 0 once. It is implemented by [`recursive_compute_levels`].
## Encoding
### Numbers
For numbers we use the same encoding for level 0 and the other levels.
The key is given by `FacetLevelValueF64Codec`. It consists of:
1. The field id : u16
2. The height of the level : u8
3. The start bound : f64
4. The end bound : f64
Note that at level 0, we have start bound == end bound.
The value is a serialised `RoaringBitmap`.
### Strings
For strings, we use a different encoding for level 0 and the other levels.
At level 0, the key is given by `FacetStringLevelZeroCodec`. It consists of:
1. The field id : u16
2. The height of the level : u8 <-- always == 0
3. The normalised string value : &str
And the value is given by `FacetStringLevelZeroValueCodec`. It consists of:
1. The original string
2. A serialised `RoaringBitmap`
At level 1, the key is given by `FacetLevelValueU32Codec`. It consists of:
1. The field id : u16
2. The height of the level : u8 <-- always >= 1
3. The start bound : u32
4. The end bound : u32
where the bounds are indices inside level 0.
The value is given by `FacetStringZeroBoundsValueCodec<CboRoaringBitmapCodec>`.
If the level is 1, then it consists of:
1. The normalised string of the start bound
2. The normalised string of the end bound
3. A serialised `RoaringBitmap`
If the level is higher, then it consists only of the serialised roaring bitmap.
The distinction between the value encoding of level 1 and the levels above it
is to allow us to retrieve the value in level 0 quickly by reading the key of
level 1 (we obtain the string value of the bound and execute a prefix search
in the database).
Therefore, for strings, the structure for a single faceted field looks more like this:
```text
all [a, b, c, d, e, f, g, u, y, z]
0 3 4 7 8 9
Level 2
[a, b, d, f, z] [c, d, e, f, g] [u, y]
0 1 2 3 4 5 6 7 8 9
Level 1 "ab" "ac" "ba" "bac" "gaf" "gal" "form" "wow" "woz" "zz"
[a, b, d, z] [a, b, f] [c, d, g] [e, f] [u, y]
"ab" "ac" "ba" "bac" "gaf" "gal" "form" "wow" "woz" "zz"
Level 0 "AB" " Ac" "ba " "Bac" " GAF" "gal" "Form" " wow" "woz" "ZZ"
[a, b] [d, z] [b, f] [a, f] [c, d] [g] [e] [e, f] [y] [u]
The first line in a cell is its key (without the field id and level height) and the last two
lines are its values.
```
*/
use std::cmp;
use std::fs::File;
use std::num::{NonZeroU8, NonZeroUsize};
use std::{cmp, mem};
use std::ops::RangeFrom;
use grenad::{CompressionType, Reader, Writer};
use heed::types::{ByteSlice, DecodeIgnore};
use heed::{BytesEncode, Error};
use heed::{BytesDecode, BytesEncode, Error};
use log::debug;
use roaring::RoaringBitmap;
use time::OffsetDateTime;
@ -39,11 +165,15 @@ impl<'t, 'u, 'i> Facets<'t, 'u, 'i> {
}
}
/// The number of elements from the level below that are represented by a single element in the level above
///
/// This setting is always greater than or equal to 2.
pub fn level_group_size(&mut self, value: NonZeroUsize) -> &mut Self {
self.level_group_size = NonZeroUsize::new(cmp::max(value.get(), 2)).unwrap();
self
}
/// The minimum number of elements that a level is allowed to have.
pub fn min_level_size(&mut self, value: NonZeroUsize) -> &mut Self {
self.min_level_size = value;
self
@ -65,14 +195,7 @@ impl<'t, 'u, 'i> Facets<'t, 'u, 'i> {
field_id,
)?;
// Compute and store the faceted strings documents ids.
let string_documents_ids = compute_faceted_strings_documents_ids(
self.wtxn,
self.index.facet_id_string_docids.remap_key_type::<ByteSlice>(),
field_id,
)?;
let facet_string_levels = compute_facet_string_levels(
let (facet_string_levels, string_documents_ids) = compute_facet_strings_levels(
self.wtxn,
self.index.facet_id_string_docids,
self.chunk_compression_type,
@ -82,56 +205,367 @@ impl<'t, 'u, 'i> Facets<'t, 'u, 'i> {
field_id,
)?;
// Clear the facet number levels.
clear_field_number_levels(self.wtxn, self.index.facet_id_f64_docids, field_id)?;
// Compute and store the faceted numbers documents ids.
let number_documents_ids = compute_faceted_numbers_documents_ids(
self.wtxn,
self.index.facet_id_f64_docids.remap_key_type::<ByteSlice>(),
field_id,
)?;
let facet_number_levels = compute_facet_number_levels(
self.wtxn,
self.index.facet_id_f64_docids,
self.chunk_compression_type,
self.chunk_compression_level,
self.level_group_size,
self.min_level_size,
field_id,
)?;
self.index.put_string_faceted_documents_ids(
self.wtxn,
field_id,
&string_documents_ids,
)?;
for facet_strings_level in facet_string_levels {
write_into_lmdb_database(
self.wtxn,
*self.index.facet_id_string_docids.as_polymorph(),
facet_strings_level,
|_, _| {
Err(InternalError::IndexingMergingKeys { process: "facet string levels" })?
},
)?;
}
// Clear the facet number levels.
clear_field_number_levels(self.wtxn, self.index.facet_id_f64_docids, field_id)?;
let (facet_number_levels, number_documents_ids) = compute_facet_number_levels(
self.wtxn,
self.index.facet_id_f64_docids,
self.chunk_compression_type,
self.chunk_compression_level,
self.level_group_size,
self.min_level_size,
field_id,
)?;
self.index.put_number_faceted_documents_ids(
self.wtxn,
field_id,
&number_documents_ids,
)?;
write_into_lmdb_database(
self.wtxn,
*self.index.facet_id_f64_docids.as_polymorph(),
facet_number_levels,
|_, _| Err(InternalError::IndexingMergingKeys { process: "facet number levels" })?,
)?;
write_into_lmdb_database(
self.wtxn,
*self.index.facet_id_string_docids.as_polymorph(),
facet_string_levels,
|_, _| Err(InternalError::IndexingMergingKeys { process: "facet string levels" })?,
)?;
for facet_number_level in facet_number_levels {
write_into_lmdb_database(
self.wtxn,
*self.index.facet_id_f64_docids.as_polymorph(),
facet_number_level,
|_, _| {
Err(InternalError::IndexingMergingKeys { process: "facet number levels" })?
},
)?;
}
}
Ok(())
}
}
/// Compute the content of the database levels from its level 0 for the given field id.
///
/// ## Returns:
/// 1. a vector of grenad::Reader. The reader at index `i` corresponds to the elements of level `i + 1`
/// that must be inserted into the database.
/// 2. a roaring bitmap of all the document ids present in the database
fn compute_facet_number_levels<'t>(
rtxn: &'t heed::RoTxn,
db: heed::Database<FacetLevelValueF64Codec, CboRoaringBitmapCodec>,
compression_type: CompressionType,
compression_level: Option<u32>,
level_group_size: NonZeroUsize,
min_level_size: NonZeroUsize,
field_id: FieldId,
) -> Result<(Vec<Reader<File>>, RoaringBitmap)> {
let first_level_size = db
.remap_key_type::<ByteSlice>()
.prefix_iter(rtxn, &field_id.to_be_bytes())?
.remap_types::<DecodeIgnore, DecodeIgnore>()
.fold(Ok(0usize), |count, result| result.and(count).map(|c| c + 1))?;
let level_0_start = (field_id, 0, f64::MIN, f64::MIN);
// Groups sizes are always a power of the original level_group_size and therefore a group
// always maps groups of the previous level and never splits previous levels groups in half.
let group_size_iter = (1u8..)
.map(|l| (l, level_group_size.get().pow(l as u32)))
.take_while(|(_, s)| first_level_size / *s >= min_level_size.get())
.collect::<Vec<_>>();
let mut number_document_ids = RoaringBitmap::new();
if let Some((top_level, _)) = group_size_iter.last() {
let subwriters =
recursive_compute_levels::<FacetLevelValueF64Codec, CboRoaringBitmapCodec, f64>(
rtxn,
db,
compression_type,
compression_level,
*top_level,
level_0_start,
&(level_0_start..),
first_level_size,
level_group_size,
&mut |bitmaps, _, _| {
for bitmap in bitmaps {
number_document_ids |= bitmap;
}
Ok(())
},
&|_i, (_field_id, _level, left, _right)| *left,
&|bitmap| bitmap,
&|writer, level, left, right, docids| {
write_number_entry(writer, field_id, level.get(), left, right, &docids)?;
Ok(())
},
)?;
Ok((subwriters, number_document_ids))
} else {
let mut documents_ids = RoaringBitmap::new();
for result in db.range(rtxn, &(level_0_start..))?.take(first_level_size) {
let (_key, docids) = result?;
documents_ids |= docids;
}
Ok((vec![], documents_ids))
}
}
/// Compute the content of the database levels from its level 0 for the given field id.
///
/// ## Returns:
/// 1. a vector of grenad::Reader. The reader at index `i` corresponds to the elements of level `i + 1`
/// that must be inserted into the database.
/// 2. a roaring bitmap of all the document ids present in the database
fn compute_facet_strings_levels<'t>(
rtxn: &'t heed::RoTxn,
db: heed::Database<FacetStringLevelZeroCodec, FacetStringLevelZeroValueCodec>,
compression_type: CompressionType,
compression_level: Option<u32>,
level_group_size: NonZeroUsize,
min_level_size: NonZeroUsize,
field_id: FieldId,
) -> Result<(Vec<Reader<File>>, RoaringBitmap)> {
let first_level_size = db
.remap_key_type::<ByteSlice>()
.prefix_iter(rtxn, &field_id.to_be_bytes())?
.remap_types::<DecodeIgnore, DecodeIgnore>()
.fold(Ok(0usize), |count, result| result.and(count).map(|c| c + 1))?;
let level_0_start = (field_id, "");
// Groups sizes are always a power of the original level_group_size and therefore a group
// always maps groups of the previous level and never splits previous levels groups in half.
let group_size_iter = (1u8..)
.map(|l| (l, level_group_size.get().pow(l as u32)))
.take_while(|(_, s)| first_level_size / *s >= min_level_size.get())
.collect::<Vec<_>>();
let mut strings_document_ids = RoaringBitmap::new();
if let Some((top_level, _)) = group_size_iter.last() {
let subwriters = recursive_compute_levels::<
FacetStringLevelZeroCodec,
FacetStringLevelZeroValueCodec,
(u32, &str),
>(
rtxn,
db,
compression_type,
compression_level,
*top_level,
level_0_start,
&(level_0_start..),
first_level_size,
level_group_size,
&mut |bitmaps, _, _| {
for bitmap in bitmaps {
strings_document_ids |= bitmap;
}
Ok(())
},
&|i, (_field_id, value)| (i as u32, *value),
&|value| value.1,
&|writer, level, start_bound, end_bound, docids| {
write_string_entry(writer, field_id, level, start_bound, end_bound, docids)?;
Ok(())
},
)?;
Ok((subwriters, strings_document_ids))
} else {
let mut documents_ids = RoaringBitmap::new();
for result in db.range(rtxn, &(level_0_start..))?.take(first_level_size) {
let (_key, (_original_value, docids)) = result?;
documents_ids |= docids;
}
Ok((vec![], documents_ids))
}
}
/**
Compute a level from the levels below it, with the elements of level 0 already existing in the given `db`.
This function is generic to work with both numbers and strings. The generic type parameters are:
* `KeyCodec`/`ValueCodec`: the codecs used to read the elements of the database.
* `Bound`: part of the range in the levels structure. For example, for numbers, the `Bound` is `f64`
because each chunk in a level contains a range such as (1.2 ..= 4.5).
## Arguments
* `rtxn` : LMDB read transaction
* `db`: a database which already contains a `level 0`
* `compression_type`/`compression_level`: parameters used to create the `grenad::Writer` that
will contain the new levels
* `level` : the height of the level to create, or `0` to read elements from level 0.
* `level_0_start` : a key in the database that points to the beginning of its level 0
* `level_0_range` : equivalent to `level_0_start..`
* `level_0_size` : the number of elements in level 0
* `level_group_size` : the number of elements from the level below that are represented by a
single element of the new level
* `computed_group_bitmap` : a callback that is called whenever at most `level_group_size` elements
from the level below were read/created. Its arguments are:
0. the list of bitmaps from each read/created element of the level below
1. the start bound corresponding to the first element
2. the end bound corresponding to the last element
* `bound_from_db_key` : finds the `Bound` from a key in the database
* `bitmap_from_db_value` : finds the `RoaringBitmap` from a value in the database
* `write_entry` : writes an element of a level into the writer. The arguments are:
0. the writer
1. the height of the level
2. the start bound
3. the end bound
4. the docids of all elements between the start and end bound
## Return
A vector of grenad::Reader. The reader at index `i` corresponds to the elements of level `i + 1`
that must be inserted into the database.
*/
fn recursive_compute_levels<'t, KeyCodec, ValueCodec, Bound>(
rtxn: &'t heed::RoTxn,
db: heed::Database<KeyCodec, ValueCodec>,
compression_type: CompressionType,
compression_level: Option<u32>,
level: u8,
level_0_start: <KeyCodec as BytesDecode<'t>>::DItem,
level_0_range: &'t RangeFrom<<KeyCodec as BytesDecode<'t>>::DItem>,
level_0_size: usize,
level_group_size: NonZeroUsize,
computed_group_bitmap: &mut dyn FnMut(&[RoaringBitmap], Bound, Bound) -> Result<()>,
bound_from_db_key: &dyn for<'a> Fn(usize, &'a <KeyCodec as BytesDecode<'t>>::DItem) -> Bound,
bitmap_from_db_value: &dyn Fn(<ValueCodec as BytesDecode<'t>>::DItem) -> RoaringBitmap,
write_entry: &dyn Fn(&mut Writer<File>, NonZeroU8, Bound, Bound, RoaringBitmap) -> Result<()>,
) -> Result<Vec<Reader<File>>>
where
KeyCodec: for<'a> BytesEncode<'a>
+ for<'a> BytesDecode<'a, DItem = <KeyCodec as BytesEncode<'a>>::EItem>,
for<'a> <KeyCodec as BytesEncode<'a>>::EItem: Sized,
ValueCodec: for<'a> BytesEncode<'a>
+ for<'a> BytesDecode<'a, DItem = <ValueCodec as BytesEncode<'a>>::EItem>,
for<'a> <ValueCodec as BytesEncode<'a>>::EItem: Sized,
Bound: Copy,
{
if level == 0 {
// base case for the recursion
// we read the elements one by one and
// 1. keep track of the start and end bounds
// 2. fill the `bitmaps` vector to give it to level 1 once `level_group_size` elements were read
let mut bitmaps = vec![];
let mut start_bound = bound_from_db_key(0, &level_0_start);
let mut end_bound = bound_from_db_key(0, &level_0_start);
let mut first_iteration_for_new_group = true;
for (i, db_result_item) in db.range(rtxn, level_0_range)?.take(level_0_size).enumerate() {
let (key, value) = db_result_item?;
let bound = bound_from_db_key(i, &key);
let docids = bitmap_from_db_value(value);
if first_iteration_for_new_group {
start_bound = bound;
first_iteration_for_new_group = false;
}
end_bound = bound;
bitmaps.push(docids);
if bitmaps.len() == level_group_size.get() {
computed_group_bitmap(&bitmaps, start_bound, end_bound)?;
first_iteration_for_new_group = true;
bitmaps.clear();
}
}
// don't forget to give the leftover bitmaps as well
if !bitmaps.is_empty() {
computed_group_bitmap(&bitmaps, start_bound, end_bound)?;
bitmaps.clear();
}
// level 0 is already stored in the DB
return Ok(vec![]);
} else {
// level >= 1
// we compute each element of this level based on the elements of the level below it
// once we have computed `level_group_size` elements, we give the start and end bounds
// of those elements, and their bitmaps, to the level above
let mut cur_writer =
create_writer(compression_type, compression_level, tempfile::tempfile()?);
let mut range_for_bitmaps = vec![];
let mut bitmaps = vec![];
// compute the levels below
// in the callback, we fill `cur_writer` with the correct elements for this level
let mut sub_writers = recursive_compute_levels(
rtxn,
db,
compression_type,
compression_level,
level - 1,
level_0_start,
level_0_range,
level_0_size,
level_group_size,
&mut |sub_bitmaps: &[RoaringBitmap], start_range, end_range| {
let mut combined_bitmap = RoaringBitmap::default();
for bitmap in sub_bitmaps {
combined_bitmap |= bitmap;
}
range_for_bitmaps.push((start_range, end_range));
bitmaps.push(combined_bitmap);
if bitmaps.len() == level_group_size.get() {
let start_bound = range_for_bitmaps.first().unwrap().0;
let end_bound = range_for_bitmaps.last().unwrap().1;
computed_group_bitmap(&bitmaps, start_bound, end_bound)?;
for (bitmap, (start_bound, end_bound)) in
bitmaps.drain(..).zip(range_for_bitmaps.drain(..))
{
write_entry(
&mut cur_writer,
NonZeroU8::new(level).unwrap(),
start_bound,
end_bound,
bitmap,
)?;
}
}
Ok(())
},
bound_from_db_key,
bitmap_from_db_value,
write_entry,
)?;
// don't forget to insert the leftover elements into the writer as well
if !bitmaps.is_empty() {
let start_range = range_for_bitmaps.first().unwrap().0;
let end_range = range_for_bitmaps.last().unwrap().1;
computed_group_bitmap(&bitmaps, start_range, end_range)?;
for (bitmap, (left, right)) in bitmaps.drain(..).zip(range_for_bitmaps.drain(..)) {
write_entry(&mut cur_writer, NonZeroU8::new(level).unwrap(), left, right, bitmap)?;
}
}
sub_writers.push(writer_into_reader(cur_writer)?);
return Ok(sub_writers);
}
}
fn clear_field_number_levels<'t>(
wtxn: &'t mut heed::RwTxn,
db: heed::Database<FacetLevelValueF64Codec, CboRoaringBitmapCodec>,
@ -143,68 +577,15 @@ fn clear_field_number_levels<'t>(
db.delete_range(wtxn, &range).map(drop)
}
fn compute_facet_number_levels<'t>(
rtxn: &'t heed::RoTxn,
db: heed::Database<FacetLevelValueF64Codec, CboRoaringBitmapCodec>,
compression_type: CompressionType,
compression_level: Option<u32>,
level_group_size: NonZeroUsize,
min_level_size: NonZeroUsize,
fn clear_field_string_levels<'t>(
wtxn: &'t mut heed::RwTxn,
db: heed::Database<ByteSlice, DecodeIgnore>,
field_id: FieldId,
) -> Result<Reader<File>> {
let first_level_size = db
.remap_key_type::<ByteSlice>()
.prefix_iter(rtxn, &field_id.to_be_bytes())?
.remap_types::<DecodeIgnore, DecodeIgnore>()
.fold(Ok(0usize), |count, result| result.and(count).map(|c| c + 1))?;
// It is forbidden to keep a cursor and write in a database at the same time with LMDB
// therefore we write the facet levels entries into a grenad file before transfering them.
let mut writer = create_writer(compression_type, compression_level, tempfile::tempfile()?);
let level_0_range = {
let left = (field_id, 0, f64::MIN, f64::MIN);
let right = (field_id, 0, f64::MAX, f64::MAX);
left..=right
};
// Groups sizes are always a power of the original level_group_size and therefore a group
// always maps groups of the previous level and never splits previous levels groups in half.
let group_size_iter = (1u8..)
.map(|l| (l, level_group_size.get().pow(l as u32)))
.take_while(|(_, s)| first_level_size / *s >= min_level_size.get());
for (level, group_size) in group_size_iter {
let mut left = 0.0;
let mut right = 0.0;
let mut group_docids = RoaringBitmap::new();
for (i, result) in db.range(rtxn, &level_0_range)?.enumerate() {
let ((_field_id, _level, value, _right), docids) = result?;
if i == 0 {
left = value;
} else if i % group_size == 0 {
// we found the first bound of the next group, we must store the left
// and right bounds associated with the docids.
write_number_entry(&mut writer, field_id, level, left, right, &group_docids)?;
// We save the left bound for the new group and also reset the docids.
group_docids = RoaringBitmap::new();
left = value;
}
// The right bound is always the bound we run through.
group_docids |= docids;
right = value;
}
if !group_docids.is_empty() {
write_number_entry(&mut writer, field_id, level, left, right, &group_docids)?;
}
}
writer_into_reader(writer)
) -> heed::Result<()> {
let left = (field_id, NonZeroU8::new(1).unwrap(), u32::MIN, u32::MIN);
let right = (field_id, NonZeroU8::new(u8::MAX).unwrap(), u32::MAX, u32::MAX);
let range = left..=right;
db.remap_key_type::<FacetLevelValueU32Codec>().delete_range(wtxn, &range).map(drop)
}
fn write_number_entry(
@ -221,108 +602,6 @@ fn write_number_entry(
writer.insert(&key, &data)?;
Ok(())
}
fn compute_faceted_strings_documents_ids(
rtxn: &heed::RoTxn,
db: heed::Database<ByteSlice, FacetStringLevelZeroValueCodec>,
field_id: FieldId,
) -> Result<RoaringBitmap> {
let mut documents_ids = RoaringBitmap::new();
for result in db.prefix_iter(rtxn, &field_id.to_be_bytes())? {
let (_key, (_original_value, docids)) = result?;
documents_ids |= docids;
}
Ok(documents_ids)
}
fn compute_faceted_numbers_documents_ids(
rtxn: &heed::RoTxn,
db: heed::Database<ByteSlice, CboRoaringBitmapCodec>,
field_id: FieldId,
) -> Result<RoaringBitmap> {
let mut documents_ids = RoaringBitmap::new();
for result in db.prefix_iter(rtxn, &field_id.to_be_bytes())? {
let (_key, docids) = result?;
documents_ids |= docids;
}
Ok(documents_ids)
}
fn clear_field_string_levels<'t>(
wtxn: &'t mut heed::RwTxn,
db: heed::Database<ByteSlice, DecodeIgnore>,
field_id: FieldId,
) -> heed::Result<()> {
let left = (field_id, NonZeroU8::new(1).unwrap(), u32::MIN, u32::MIN);
let right = (field_id, NonZeroU8::new(u8::MAX).unwrap(), u32::MAX, u32::MAX);
let range = left..=right;
db.remap_key_type::<FacetLevelValueU32Codec>().delete_range(wtxn, &range).map(drop)
}
fn compute_facet_string_levels<'t>(
rtxn: &'t heed::RoTxn,
db: heed::Database<FacetStringLevelZeroCodec, FacetStringLevelZeroValueCodec>,
compression_type: CompressionType,
compression_level: Option<u32>,
level_group_size: NonZeroUsize,
min_level_size: NonZeroUsize,
field_id: FieldId,
) -> Result<Reader<File>> {
let first_level_size = db
.remap_key_type::<ByteSlice>()
.prefix_iter(rtxn, &field_id.to_be_bytes())?
.remap_types::<DecodeIgnore, DecodeIgnore>()
.fold(Ok(0usize), |count, result| result.and(count).map(|c| c + 1))?;
// It is forbidden to keep a cursor and write in a database at the same time with LMDB
// therefore we write the facet levels entries into a grenad file before transfering them.
let mut writer = create_writer(compression_type, compression_level, tempfile::tempfile()?);
// Groups sizes are always a power of the original level_group_size and therefore a group
// always maps groups of the previous level and never splits previous levels groups in half.
let group_size_iter = (1u8..)
.map(|l| (l, level_group_size.get().pow(l as u32)))
.take_while(|(_, s)| first_level_size / *s >= min_level_size.get());
for (level, group_size) in group_size_iter {
let level = NonZeroU8::new(level).unwrap();
let mut left = (0, "");
let mut right = (0, "");
let mut group_docids = RoaringBitmap::new();
// Because we know the size of the level 0 we can use a range iterator that starts
// at the first value of the level and goes to the last by simply counting.
for (i, result) in db.range(rtxn, &((field_id, "")..))?.take(first_level_size).enumerate() {
let ((_field_id, value), (_original_value, docids)) = result?;
if i == 0 {
left = (i as u32, value);
} else if i % group_size == 0 {
// we found the first bound of the next group, we must store the left
// and right bounds associated with the docids. We also reset the docids.
let docids = mem::take(&mut group_docids);
write_string_entry(&mut writer, field_id, level, left, right, docids)?;
// We save the left bound for the new group.
left = (i as u32, value);
}
// The right bound is always the bound we run through.
group_docids |= docids;
right = (i as u32, value);
}
if !group_docids.is_empty() {
let docids = mem::take(&mut group_docids);
write_string_entry(&mut writer, field_id, level, left, right, docids)?;
}
}
writer_into_reader(writer)
}
fn write_string_entry(
writer: &mut Writer<File>,
field_id: FieldId,