Commit Graph

137 Commits

Author SHA1 Message Date
ad hoc
ac975cc747
cache context's exact words 2022-05-24 09:43:17 +02:00
bors[bot]
ea4bb9402f
Merge #483
483: Enhance matching words r=Kerollmops a=ManyTheFish

# Summary

Enhance milli word-matcher making it handle match computing and cropping.

# Implementation

## Computing best matches for cropping

Before we were considering that the first match of the attribute was the best one, this was accurate when only one word was searched but was missing the target when more than one word was searched.

Now we are searching for the best matches interval to crop around, the chosen interval is the one:
1) that have the highest count of unique matches
> for example, if we have a query `split the world`, then the interval `the split the split the` has 5 matches but only 2 unique matches (1 for `split` and 1 for `the`) where the interval `split of the world` has 3 matches and 3 unique matches. So the interval `split of the world` is considered better.
2) that have the minimum distance between matches
> for example, if we have a query `split the world`, then the interval `split of the world` has a distance of 3 (2 between `split` and `the`, and 1 between `the` and `world`) where the interval `split the world` has a distance of 2. So the interval `split the world` is considered better.
3) that have the highest count of ordered matches
> for example, if we have a query `split the world`, then the interval `the world split` has 2 ordered words where the interval `split the world` has 3. So the interval `split the world` is considered better.

## Cropping around the best matches interval

Before we were cropping around the interval without checking the context.

Now we are cropping around words in the same context as matching words.
This means that we will keep words that are farther from the matching words but are in the same phrase, than words that are nearer but separated by a dot.

> For instance, for the matching word `Split` the text:
`Natalie risk her future. Split The World is a book written by Emily Henry. I never read it.`
will be cropped like:
`…. Split The World is a book written by Emily Henry. …`
and  not like:
`Natalie risk her future. Split The World is a book …`


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-04-19 11:42:32 +00:00
ad hoc
dda28d7415
exclude excluded canditates from search result candidates 2022-04-13 12:10:35 +02:00
ad hoc
bbb6728d2f
add distinct attributes to cli 2022-04-13 12:10:35 +02:00
ManyTheFish
5809d3ae0d Add first benchmarks on formatting 2022-04-12 16:31:58 +02:00
ManyTheFish
827cedcd15 Add format option structure 2022-04-12 13:42:14 +02:00
Irevoire
4f3ce6d9cd
nested fields 2022-04-07 16:58:46 +02:00
ManyTheFish
3bb1e35ada Fix match count 2022-04-05 17:48:45 +02:00
ManyTheFish
b3f0f39106 Make some cleaning 2022-04-05 17:41:32 +02:00
ManyTheFish
734d0899d3 Publish Matcher 2022-04-05 17:41:32 +02:00
ManyTheFish
d96e72e5dc Create formater with some tests 2022-04-05 17:41:32 +02:00
ad hoc
9fe40df960
add word derivations tests 2022-04-01 11:05:18 +02:00
ad hoc
d5ddc6b080
fix 2 typos word derivation bug 2022-04-01 10:51:22 +02:00
ad hoc
6ef3bb9d83
fmt 2022-03-31 14:06:23 +02:00
ad hoc
f782fe2062
add authorize_typo_test 2022-03-31 10:08:39 +02:00
ad hoc
c4653347fd
add authorize typo setting 2022-03-31 10:05:44 +02:00
ad hoc
3f24555c3d
custom fst automatons 2022-03-15 17:38:35 +01:00
ad hoc
628c835a22
fix tests 2022-03-15 17:38:34 +01:00
mpostma
7541ab99cd
review changes 2022-02-02 12:59:01 +01:00
mpostma
d0aabde502
optimize 2 typos case 2022-02-02 12:56:09 +01:00
mpostma
55e6cb9c7b
typos on first letter counts as 2 2022-02-02 12:56:09 +01:00
Tamo
6831c23449
merge with main 2021-11-06 16:34:30 +01:00
Tamo
a58bc5bebb
update milli with the new parser_filter 2021-11-04 15:02:36 +01:00
many
ed6db19681
Fix PR comments 2021-10-28 11:18:32 +02:00
Clémentine Urquizar
208903ddde
Revert "Replacing pest with nom " 2021-10-25 11:58:00 +02:00
Tamo
e25ca9776f
start updating the exposed function to makes other modules happy 2021-10-22 17:23:22 +02:00
Tamo
c27870e765
integrate a first version without any error handling 2021-10-22 14:33:18 +02:00
Tamo
01dedde1c9
update some names and move some parser out of the lib.rs 2021-10-22 01:59:38 +02:00
刘瀚骋
7a90a101ee reorganize parser logic 2021-10-12 13:30:40 +08:00
刘瀚骋
f7796edc7e remove everything about pest 2021-10-12 13:30:40 +08:00
Tamo
47ee93b0bd
return an error when _geoPoint is used but _geo is not sortable 2021-09-22 16:37:41 +02:00
Tamo
257e621d40
create an asc_desc module 2021-09-22 16:37:41 +02:00
Tamo
13c78e5aa2
Implement the _geoPoint in the sortable 2021-09-08 18:24:09 +02:00
Kerollmops
fd3daa4423
Throw a query time error when a sort param is used but sort ranking rule is missing 2021-09-07 11:02:00 +02:00
Alexey Shekhirin
0e379558a1
fix(search): get sortable_fields only if criteria present 2021-08-31 21:35:41 +03:00
Clément Renault
89d0758713
Revert "Revert "Sort at query time"" 2021-08-24 11:55:16 +02:00
Clémentine Urquizar
922f9fd4d5
Revert "Sort at query time" 2021-08-20 18:09:17 +02:00
Kerollmops
1b7f6ea1e7
Return a new error when the sort criteria is not sortable 2021-08-18 15:04:07 +02:00
Kerollmops
407f53872a
Add a sort_criteria method to the Search builder struct 2021-08-18 15:04:07 +02:00
Kerollmops
687cd2e205
Introduce the new Sort criterion and AscDesc enum 2021-08-18 15:04:07 +02:00
Kerollmops
7aa6cc9b04
Do not insert fields in the map when changing the settings 2021-07-22 18:40:12 +02:00
Kerollmops
f858f64b1f
Move the facet number iterators into their own module 2021-07-21 16:59:37 +02:00
Kerollmops
32b7bd366f
Remove the roaring operation functions warnings 2021-06-30 14:12:56 +02:00
Tamo
3d90b03d7b
fix the limit
There was no check on the limit and thus, if a user especified a very large number this line could causes a panic
2021-06-22 14:52:13 +02:00
Tamo
9716fb3b36
format the whole project 2021-06-16 18:33:33 +02:00
Kerollmops
7ac441e473
Fix small typos 2021-06-16 11:03:37 +02:00
Kerollmops
a7d6930905
Replace the panicking expect by tracked Errors 2021-06-15 11:51:32 +02:00
Kerollmops
312c2d1d8e
Use the Error enum everywhere in the project 2021-06-14 16:58:38 +02:00
Kerollmops
3c304c89d4
Make sure that we generate the faceted database when required 2021-06-02 16:24:58 +02:00
Kerollmops
3b1cd4c4b4
Rename the FacetCondition into FilterCondition 2021-06-02 16:24:58 +02:00
Marin Postma
1e366dae3e
remove useless lifetime on Distinct Trait 2021-06-02 16:24:58 +02:00
Kerollmops
187c713de5
Remove the MapDistinct struct as now distinct attributes are faceted 2021-06-02 16:24:57 +02:00
Kerollmops
2a3f9b32ff
Rename the faceted fields into filterable fields 2021-06-02 16:24:57 +02:00
many
1df68d342a
Make the MatchingWords return the number of matching bytes 2021-05-31 18:22:29 +02:00
Clément Renault
02c655ff1a
Refine the facet distribution to use both databases 2021-05-25 11:30:00 +02:00
Clément Renault
f7efde11d9
Refine the facet condition to use both facet databases 2021-05-25 11:30:00 +02:00
Clément Renault
bd7b285bae
Split the update side to use the number and the strings facet databases 2021-05-25 11:30:00 +02:00
many
a3f8686fbf
Introduce exactness criterion 2021-05-06 14:28:30 +02:00
many
ee09e50e7f
Remove excluded document in criteria iterations
- pass excluded document to criteria to remove them in higher levels of the bucket-sort
- merge already returned document with excluded documents to avoid duplicas

Related to #125 and #112
Fix #170
2021-04-29 12:09:38 +02:00
Clément Renault
658f316511
Introduce the Initial Criterion 2021-04-27 14:35:43 +02:00
Alexey Shekhirin
6fa00c61d2
feat(search): support words_limit 2021-04-20 12:22:04 +03:00
Marin Postma
75464a1baa
review fixes 2021-04-15 16:25:56 +02:00
Marin Postma
45c45e11dd
implement distinct attribute
distinct can return error

facet distinct on numbers

return distinct error

review fixes

make get_facet_value more generic

fixes
2021-04-15 16:25:55 +02:00
tamo
dcb00b2e54
test a new implementation of the stop_words 2021-04-12 18:35:33 +02:00
tamo
a2f46029c7
implement a first version of the stop_words
The front must provide a BTreeSet containing the stop words
The stop_words are set at None if an empty Set is provided
add the stop-words in the http-ui interface

Use maplit in the test
and remove all the useless drop(rtxn) at the end of all tests
2021-04-01 13:57:55 +02:00
mpostma
9c27183876
fix broken offset 2021-03-15 20:23:50 +01:00
Kerollmops
d48008339e
Introduce two new optional_words and authorize_typos Search options 2021-03-10 11:16:30 +01:00
many
62a70c300d
Optimize words criterion 2021-03-10 10:42:53 +01:00
Clément Renault
5fcaedb880
Introduce a WordDerivationsCache struct 2021-03-08 16:00:53 +01:00
Kerollmops
9b6b35d9b7
Clean up some comments 2021-03-03 18:19:10 +01:00
Kerollmops
f118d7e067
build criteria from settings 2021-03-03 15:45:03 +01:00
Kerollmops
daf126a638
Introduce the final Fetcher criterion 2021-03-03 15:45:03 +01:00
Kerollmops
5af63c74e0
Speed-up the MatchingWords highlighting struct 2021-03-03 15:45:03 +01:00
Kerollmops
4510bbccca
Add a lot of debug 2021-03-03 15:43:44 +01:00
Kerollmops
9bc9b36645
Introduce the Proximity criterion 2021-03-03 15:43:44 +01:00
Kerollmops
22b84fe543
Use the words criterion in the search module 2021-03-03 15:43:44 +01:00
Clément Renault
14f9f85c4b
Introduce the AscDesc criterion 2021-03-03 15:43:44 +01:00
Kerollmops
e174ccbd8e
Use the words criterion in the search module 2021-03-03 15:43:43 +01:00
many
d92ad5640a
remove option on bucket_candidates 2021-03-03 15:43:43 +01:00
many
64688b3786
fix query tree builder 2021-03-03 15:43:43 +01:00
many
a273c46559
clean warnings 2021-03-03 15:43:42 +01:00
Clément Renault
fea9ffc46a
Use the bucket candidates in the search module 2021-03-03 15:43:42 +01:00
Clément Renault
5344abc008
Introduce the CriterionResult return type 2021-03-03 15:43:41 +01:00
many
98e69e63d2
implement Context trait for criteria 2021-03-03 15:43:41 +01:00
Clément Renault
f091f370d0
Use the Typo criteria in the search module 2021-03-03 15:43:41 +01:00
Kerollmops
79a143b32f
Introduce the query tree data structure 2021-03-03 13:40:18 +01:00
Clément Renault
e8639517da
Change the project to become a workspace with milli as a default-member 2021-02-12 16:15:09 +01:00