563: Improve the `estimatedNbHits` when a `distinctAttribute` is specified r=irevoire a=Kerollmops

This PR is related to https://github.com/meilisearch/meilisearch/issues/2532 but it doesn't fix it entirely. It improves it by computing the excluded documents (the ones with an already-seen distinct value) before stopping the loop, I think it was a mistake and should always have been this way.

The reason it doesn't fix the issue is that Meilisearch is lazy, just to be sure not to compute too many things and answer by taking too much time. When we deduplicate the documents by their distinct value we must do it along the water, everytime we see a new document we check that its distinct value of it doesn't collide with an already returned document. 

The reason we can see the correct result when enough documents are fetched is that we were lucky to see all of the different distinct values possible in the dataset and all of the deduplication was done, no document can be returned.

If we wanted to implement that to have a correct `extimatedNbHits` every time we should have done a pass on the whole set of possible distinct values for the distinct attribute and do a big intersection, this could cost a lot of CPU cycles.

Co-authored-by: Kerollmops <clement@meilisearch.com>
This commit is contained in:
bors[bot] 2022-06-22 12:39:44 +00:00 committed by GitHub
commit d546f6f40e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -223,7 +223,6 @@ impl<'a> Search<'a> {
debug!("Number of candidates found {}", candidates.len()); debug!("Number of candidates found {}", candidates.len());
let excluded = take(&mut excluded_candidates); let excluded = take(&mut excluded_candidates);
let mut candidates = distinct.distinct(candidates, excluded); let mut candidates = distinct.distinct(candidates, excluded);
initial_candidates |= bucket_candidates; initial_candidates |= bucket_candidates;
@ -236,10 +235,12 @@ impl<'a> Search<'a> {
for candidate in candidates.by_ref().take(self.limit - documents_ids.len()) { for candidate in candidates.by_ref().take(self.limit - documents_ids.len()) {
documents_ids.push(candidate?); documents_ids.push(candidate?);
} }
excluded_candidates |= candidates.into_excluded();
if documents_ids.len() == self.limit { if documents_ids.len() == self.limit {
break; break;
} }
excluded_candidates = candidates.into_excluded();
} }
Ok(SearchResult { Ok(SearchResult {