Commit Graph

320 Commits

Author SHA1 Message Date
Louis Dureuil
4310928803
Fixes #3912 2023-07-12 10:08:56 +02:00
Louis Dureuil
74315b4ea8
Fixes #3911 2023-07-12 10:08:29 +02:00
Louis Dureuil
55cd7738b9
Update snapshots 2023-07-04 16:31:01 +02:00
Louis Dureuil
48409c9183
Add missing exactness.matchingWords, exactness.maxMatchingWords 2023-07-04 16:31:01 +02:00
Louis Dureuil
324d448236
Format let-else ❤️ 🎉 2023-07-03 10:20:28 +02:00
ManyTheFish
6ec7541026 Update inta snapshots 2023-06-29 17:18:39 +02:00
ManyTheFish
84845de9ef Update Charabia 2023-06-29 15:56:32 +02:00
meili-bors[bot]
d4f10800f2
Merge #3834
3834: Define searchable fields at runtime r=Kerollmops a=ManyTheFish

## Summary
This feature allows the end-user to search in one or multiple attributes using the search parameter `attributesToSearchOn`:

```json
{
  "q": "Captain Marvel",
  "attributesToSearchOn": ["title"]
}
```

This feature act like a filter, forcing Meilisearch to only return the documents containing the requested words in the attributes-to-search-on. Note that, with the matching strategy `last`, Meilisearch will only ensure that the first word is in the attributes-to-search-on, but, the retrieved documents will be ordered taking into account the word contained in the attributes-to-search-on. 

## Trying the prototype

A dedicated docker image has been released for this feature:

#### last prototype version:

```bash
docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-1
```

#### others prototype versions:

```bash
docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-0
```

## Technical Detail

The attributes-to-search-on list is given to the search context, then, the search context uses the `fid_word_docids`database using only the allowed field ids instead of the global `word_docids` database. This is the same for the prefix databases.
The database cache is updated with the merged values, meaning that the union of the field-id-database values is only made if the requested key is missing from the cache.

### Relevancy limits

Almost all ranking rules behave as expected when ordering the documents.
Only `proximity` could miss-order documents if all the searched words are in the restricted attribute but a better proximity is found in an ignored attribute in a document that should be ranked lower. I put below a failing test showing it:
```rust
#[actix_rt::test]
async fn proximity_ranking_rule_order() {
    let server = Server::new().await;
    let index = index_with_documents(
        &server,
        &json!([
        {
            "title": "Captain super mega cool. A Marvel story",
            // Perfect distance between words in an ignored attribute
            "desc": "Captain Marvel",
            "id": "1",
        },
        {
            "title": "Captain America from Marvel",
            "desc": "a Shazam ersatz",
            "id": "2",
        }]),
    )
    .await;

    // Document 2 should appear before document 1.
    index
        .search(json!({"q": "Captain Marvel", "attributesToSearchOn": ["title"], "attributesToRetrieve": ["id"]}), |response, code| {
            assert_eq!(code, 200, "{}", response);
            assert_eq!(
                response["hits"],
                json!([
                    {"id": "2"},
                    {"id": "1"},
                ])
            );
        })
        .await;
}
```

Fixing this would force us to create a `fid_word_pair_proximity_docids` and a `fid_word_prefix_pair_proximity_docids` databases which may multiply the keys of `word_pair_proximity_docids` and `word_prefix_pair_proximity_docids` by the number of attributes in the searchable_attributes list. If we think we should fix this test, I'll suggest doing it in another PR.

## Related

Fixes #3772

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-06-28 08:19:23 +00:00
Clément Renault
29d8268c94
Fix the vector query part by using the correct universe 2023-06-27 12:32:43 +02:00
Kerollmops
ab9f2269aa
Normalize the vectors during indexation and search 2023-06-27 12:32:41 +02:00
Kerollmops
3b560ef7d0
Make clippy happy 2023-06-27 12:32:40 +02:00
Kerollmops
3c31e1cdd1
Support more pages but in an ugly way 2023-06-27 12:32:39 +02:00
Kerollmops
c79e82c62a
Move back to the hnsw crate
This reverts commit 7a4b6c065482f988b01298642f4c18775503f92f.
2023-06-27 12:32:39 +02:00
Kerollmops
268a9ef416
Move to the hgg crate 2023-06-27 12:32:38 +02:00
Clément Renault
642b0f3a1b
Expose a new vector field on the search route 2023-06-27 12:32:38 +02:00
ManyTheFish
63ca25290b Take into account small Review requests 2023-06-26 14:56:19 +02:00
ManyTheFish
59f64a5256 Return an error when an attribute is not searchable 2023-06-26 14:56:19 +02:00
ManyTheFish
42709ea9a5 Fix clippy warnings 2023-06-26 14:55:57 +02:00
ManyTheFish
fb8fa07169 Restrict field ids in search context 2023-06-26 14:55:57 +02:00
ManyTheFish
0ccf1e2e40 Allow the search cache to store owned values 2023-06-26 14:55:57 +02:00
ManyTheFish
461b5118bd Add API search setting 2023-06-26 14:55:14 +02:00
Louis Dureuil
d26e9a96ec
Add score details to new search tests 2023-06-22 12:39:14 +02:00
Louis Dureuil
49c8bc4de6
Fix tests 2023-06-22 12:39:14 +02:00
Louis Dureuil
da833eb095
Expose the scores and detailed scores in the API 2023-06-22 12:39:14 +02:00
Louis Dureuil
701d44bd91
Store the scores for each bucket
Remove optimization where ranking rules are not executed on buckets of a single document
when the score needs to be computed
2023-06-22 12:39:14 +02:00
Louis Dureuil
c621a250a7
Score for graph based ranking rules
Count phrases in matchingWords and maxMatchingWords
2023-06-22 12:39:14 +02:00
Louis Dureuil
8939e85f60
Add rank_to_score for graph based ranking rules 2023-06-22 12:39:14 +02:00
Louis Dureuil
fa41d2489e
Score for sort 2023-06-22 12:39:14 +02:00
Louis Dureuil
59c5b992c2
Score for geosort 2023-06-22 12:39:14 +02:00
Louis Dureuil
2ea8194c18
Score for exact_attributes 2023-06-22 12:39:14 +02:00
Louis Dureuil
421df64602
RankingRuleOutput now contains a Score 2023-06-22 12:39:14 +02:00
Louis Dureuil
f050634b1e
add virtual conditions to fid and position to always have the max cost 2023-06-20 10:07:18 +02:00
Louis Dureuil
becf1f066a
Change how the cost of removing words is computed 2023-06-20 09:45:43 +02:00
Louis Dureuil
701d299369
Remove out-of-date comment 2023-06-20 09:45:42 +02:00
Louis Dureuil
a20e4d447c
Position now takes into account the distance to the position of the word in the query
it used to be based on the distance to the position 0
2023-06-20 09:45:42 +02:00
Louis Dureuil
af57c3c577
Proximity costs 0 for documents that are perfectly matching 2023-06-20 09:45:42 +02:00
Louis Dureuil
0c40ef6911
Fix sort id 2023-06-20 09:45:42 +02:00
Loïc Lecrenier
2da86b31a6 Remove comments and add documentation 2023-06-14 12:39:42 +02:00
Louis Dureuil
a2a3b8c973
Fix offset difference between query and indexing for hard separators 2023-06-08 12:07:12 +02:00
Louis Dureuil
1dfc4038ab
Add test that fails before PR and passes now 2023-05-29 11:58:26 +02:00
Louis Dureuil
73198179f1
Consistently use wrapping add to avoid overflow in debug when query starts with a separator 2023-05-29 11:54:12 +02:00
meili-bors[bot]
2e49d6aec1
Merge #3768
3768: Fix bugs in graph-based ranking rules + make `words` a graph-based ranking rule r=dureuill a=loiclec

This PR contains three changes:

## 1. Don't call the `words` ranking rule if the term matching strategy is `All`

This is because the purpose of `words` is only to remove nodes from the query graph. It would never do any useful work when the matching strategy was `All`. Remember that the universe was already computed before by computing all the docids corresponding to the "maximally reduced" query graph, which, in the case of `All`, is equal to the original graph.

## 2. The `words` ranking rule is replaced by a graph-based ranking rule. 

This is for three reasons:

1. **performance**: graph-based ranking rules benefit from a lot of optimisations by default, which ensures that they are never too slow. The previous implementation of `words` could call `compute_query_graph_docids` many times if some words had to be removed from the query, which would be quite expensive. I was especially worried about its performance in cases where it is placed right after the `sort` ranking rule. Furthermore, `compute_query_graph_docids` would clone a lot of bitmaps many times unnecessarily.

2. **consistency**: every other ranking rule (except `sort`) is graph-based. It makes sense to implement `words` like that as well. It will automatically benefit from all the features, optimisations, and bug fixes that all the other ranking rules get.

3. **surfacing bugs**: as the first ranking rule to be called (most of the time), I'd like `words` to behave the same as the other ranking rules so that we can quickly detect bugs in our graph algorithms. This actually already happened, which is why this PR also contains a bug fix.

## 3. Fix the `update_all_costs_before_nodes` function

It is a bit difficult to explain what was wrong, but I'll try. The bug happened when we had graphs like:
<img width="730" alt="Screenshot 2023-05-16 at 10 58 57" src="https://github.com/meilisearch/meilisearch/assets/6040237/40db1a68-d852-4e89-99d5-0d65757242a7">
and we gave the node `is` as argument.

Then, we'd walk backwards from the node breadth-first. We'd update the costs of:
1. `sun`
2. `thesun`
3. `start`
4. `the`

which is an incorrect order. The correct order is:

1. `sun`
2. `thesun`
3. `the`
4. `start`

That is, we can only update the cost of a node when all of its successors have either already been visited or were not affected by the update to the node passed as argument. To solve this bug, I factored out the graph-traversal logic into a `traverse_breadth_first_backward` function.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-05-23 13:28:08 +00:00
Louis Dureuil
51043f78f0
Remove trailing whitespace 2023-05-23 15:27:25 +02:00
Louis Dureuil
a490a11325
Add explanatory comment on the way we're recomputing costs 2023-05-23 15:24:24 +02:00
Loïc Lecrenier
ec8f685d84 Fix bug in cheapest path algorithm 2023-05-16 17:01:30 +02:00
Loïc Lecrenier
5758268866 Don't compute split_words for phrases 2023-05-16 17:01:18 +02:00
Loïc Lecrenier
3e19702de6 Update snapshot tests 2023-05-16 12:22:46 +02:00
Loïc Lecrenier
f6524a6858 Adjust costs of edges in position ranking rule
To ensure good performance
2023-05-16 11:28:56 +02:00
meili-bors[bot]
65ad8cce36
Merge #3741
3741: Add ngram support to the highlighter r=ManyTheFish a=loiclec

This PR fixes a bug introduced by the search refactor, where ngrams were not highlighted. 

The solution was to add the ngrams to the vector of `LocatedQueryTerm` that is given to the `MatchingWords` structure.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-05-16 09:03:31 +00:00
Loïc Lecrenier
a37da36766 Implement words as a graph-based ranking rule and fix some bugs 2023-05-16 10:42:11 +02:00
Loïc Lecrenier
85d96d35a8 Highlight ngram matches as well 2023-05-16 10:39:36 +02:00
Loïc Lecrenier
4d352a21ac Compute split words derivations of terms that don't accept typos 2023-05-10 13:31:19 +02:00
Loïc Lecrenier
3625389057 Highlight ngram matches as well 2023-05-08 15:35:41 +02:00
meili-bors[bot]
eace6df91b
Merge #3726
3726: Fix prefix highlighting r=loiclec a=ManyTheFish

The prefix queries were not properly highlighted, this PR now highlights only the start of a word when it matched with a prefix

Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-05-08 07:46:46 +00:00
Loïc Lecrenier
83ab8cf4e5 Remove dbg!(..) expression in highlighter tests 2023-05-08 09:45:23 +02:00
ManyTheFish
cd2573fcc3 Fix prefix highlighting 2023-05-04 16:53:50 +02:00
Jakub Jirutka
13f1277637 Allow to disable specialized tokenizations (again)
In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai`
feature flags to allow melisearch to be built without huge specialed
tokenizations that took up 90% of the melisearch binary size.
Unfortunately, due to some recent changes, this doesn't work anymore.
The problem lies in excessive use of the `default` feature flag, which
infects the dependency graph.

Instead of adding `default-features = false` here and there, it's easier
and more future-proof to not declare `default` in `milli` and
`meilisearch-types`. I've renamed it to `all-tokenizers`, which also
makes it a bit clearer what it's about.
2023-05-04 15:45:40 +02:00
Louis Dureuil
f8f190cd40
Update exactness tests following charabia camelCase tokenization 2023-05-03 14:45:09 +02:00
Louis Dureuil
1aaf24ccbf
Cargo fmt 2023-05-03 12:21:58 +02:00
Louis Dureuil
342c4ff85d
geosort: Remove rtree unwrap 2023-05-03 09:52:16 +02:00
Tamo
c85392ce40
make the descendent geosort fast 2023-05-03 09:13:12 +02:00
Tamo
8875d24a48
deserialize the rtree only when its needed, and keep it in memory once it has been deserialized 2023-05-03 09:13:12 +02:00
Tamo
c470b67fa2
revamp the test to use execute_iterative_and_rtree_returns_the_same 2023-05-03 09:13:12 +02:00
Louis Dureuil
b60840ebff
Remove self.iterating from words 2023-05-02 18:54:23 +02:00
Louis Dureuil
fdc1763838
Use MultiOps for resolve_query_graph 2023-05-02 18:54:09 +02:00
Louis Dureuil
75819bc940
Remove too many arguments on resolve_maximally_reduced_query_graph 2023-05-02 18:53:40 +02:00
Louis Dureuil
7b8cc25625
rename located_query_terms_from_string -> located_query_terms_from_tokens 2023-05-02 18:53:01 +02:00
Loïc Lecrenier
aa63091752 Fix bug in exact_attribute 2023-05-02 10:48:32 +02:00
Loïc Lecrenier
1b514517f5 Fix bug in computation of query term at a position 2023-05-02 10:48:32 +02:00
Loïc Lecrenier
11f814821d Minor cleanup 2023-05-02 10:48:32 +02:00
Loïc Lecrenier
30fb1153cc Speed up graph based ranking rule when a lot of different costs exist 2023-05-02 09:59:42 +02:00
Loïc Lecrenier
3b2c8b9f25 Improve performance of position rr 2023-05-02 09:59:42 +02:00
Loïc Lecrenier
2a7f9adf78 Build query graph more correctly from paths
Update snapshots
2023-05-02 09:59:42 +02:00
Loïc Lecrenier
608ceea440 Fix bug in position rr 2023-05-02 09:59:42 +02:00
Loïc Lecrenier
79001b9c97 Improve performance of the cheapest path finder algorithm 2023-05-02 09:59:42 +02:00
Loïc Lecrenier
59b12fca87 Fix errors, clippy warnings, and add review comments 2023-04-29 11:48:11 +02:00
Loïc Lecrenier
48f5bb1693 Implements the geo-sort ranking rule 2023-04-29 11:02:16 +02:00
Loïc Lecrenier
bc4efca611 Add more tests for the attribute ranking rule 2023-04-29 10:56:48 +02:00
Loïc Lecrenier
899baa0ea5 Update forgotten snapshot from previous commit 2023-04-27 13:43:04 +02:00
Loïc Lecrenier
374095d42c Add tests for stop words and fix a couple of bugs 2023-04-27 13:30:09 +02:00
Louis Dureuil
b41a6cbd7a
Check sort criteria also in placeholder search 2023-04-26 16:28:17 +02:00
Louis Dureuil
c8af572697
Add tests for exact words and exact attributes 2023-04-26 16:13:01 +02:00
Loïc Lecrenier
b448aca49c Add more tests for exactness rr 2023-04-26 11:04:18 +02:00
Loïc Lecrenier
55bad07c16 Fix bug in exact_attribute rr implementation 2023-04-26 10:40:05 +02:00
Loïc Lecrenier
3421125a55 Prevent the exactness ranking rule from removing random words
Make it strictly follow the term matching strategy
2023-04-26 09:09:19 +02:00
Loïc Lecrenier
d3a94e8b25 Fix bugs and add tests to exactness ranking rule 2023-04-25 16:49:08 +02:00
Loïc Lecrenier
8f2e971879 Add tests for "exactness" rr, make correct universe computation 2023-04-24 16:57:34 +02:00
Loïc Lecrenier
d1fdbb63da Make all search tests pass, fix distinctAttribute bug 2023-04-24 12:12:08 +02:00
Loïc Lecrenier
84d9c731f8 Fix bug in encoding of word_position_docids and word_fid_docids 2023-04-24 09:59:30 +02:00
Loïc Lecrenier
bd9aba4d77 Add "position" part of the attribute ranking rule 2023-04-13 10:46:09 +02:00
Loïc Lecrenier
8edad8291b Add logger to attribute rr, fix a bug 2023-04-13 10:25:00 +02:00
Kerollmops
d9cebff61c Add a simple test to check that attributes are ranking correctly 2023-04-13 08:27:09 +02:00
Loïc Lecrenier
30f7bd03f6 Fix compiler warning/errors caused by previous merge 2023-04-13 08:27:09 +02:00
Kerollmops
df0d9bb878 Introduce the attribute ranking rule in the list of ranking rules 2023-04-13 08:27:09 +02:00
Kerollmops
5230ddb3ea Resolve the attribute ranking rule conditions 2023-04-13 08:27:09 +02:00
Kerollmops
d6a7c28e4d Implement the attribute ranking rule edge computation 2023-04-13 08:27:09 +02:00
Kerollmops
e55efc419e Introduce a new cache for the words fids 2023-04-13 08:27:09 +02:00
Loïc Lecrenier
644e136aee Merge branch 'search-refactor-typo-attributes' into search-refactor 2023-04-13 08:26:56 +02:00
Louis Dureuil
38b7b31beb Decide to use prefix DB if the word is not an ngram 2023-04-12 16:45:38 +02:00
Louis Dureuil
7a01f20df7 Use word_prefix_docids, make get_word_prefix_docids private 2023-04-12 16:45:38 +02:00