Commit Graph

107 Commits

Author SHA1 Message Date
ManyTheFish
86ac8568e6 Use Charabia in milli 2022-06-02 16:59:11 +02:00
bors[bot]
ea4bb9402f
Merge #483
483: Enhance matching words r=Kerollmops a=ManyTheFish

# Summary

Enhance milli word-matcher making it handle match computing and cropping.

# Implementation

## Computing best matches for cropping

Before we were considering that the first match of the attribute was the best one, this was accurate when only one word was searched but was missing the target when more than one word was searched.

Now we are searching for the best matches interval to crop around, the chosen interval is the one:
1) that have the highest count of unique matches
> for example, if we have a query `split the world`, then the interval `the split the split the` has 5 matches but only 2 unique matches (1 for `split` and 1 for `the`) where the interval `split of the world` has 3 matches and 3 unique matches. So the interval `split of the world` is considered better.
2) that have the minimum distance between matches
> for example, if we have a query `split the world`, then the interval `split of the world` has a distance of 3 (2 between `split` and `the`, and 1 between `the` and `world`) where the interval `split the world` has a distance of 2. So the interval `split the world` is considered better.
3) that have the highest count of ordered matches
> for example, if we have a query `split the world`, then the interval `the world split` has 2 ordered words where the interval `split the world` has 3. So the interval `split the world` is considered better.

## Cropping around the best matches interval

Before we were cropping around the interval without checking the context.

Now we are cropping around words in the same context as matching words.
This means that we will keep words that are farther from the matching words but are in the same phrase, than words that are nearer but separated by a dot.

> For instance, for the matching word `Split` the text:
`Natalie risk her future. Split The World is a book written by Emily Henry. I never read it.`
will be cropped like:
`…. Split The World is a book written by Emily Henry. …`
and  not like:
`Natalie risk her future. Split The World is a book …`


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-04-19 11:42:32 +00:00
ManyTheFish
5809d3ae0d Add first benchmarks on formatting 2022-04-12 16:31:58 +02:00
ManyTheFish
827cedcd15 Add format option structure 2022-04-12 13:42:14 +02:00
Irevoire
4f3ce6d9cd
nested fields 2022-04-07 16:58:46 +02:00
ManyTheFish
b3f0f39106 Make some cleaning 2022-04-05 17:41:32 +02:00
ManyTheFish
29c5f76d7f Use new matcher in http-ui 2022-04-05 17:41:32 +02:00
psvnl sai kumar
5e08fac729 fixes for rustfmt pass 2022-03-14 19:22:41 +05:30
psvnl sai kumar
92e2e09434 exporting heed to avoid having different versions of Heed in Meilisearch 2022-03-14 01:01:58 +05:30
Tamo
98a365aaae
store the geopoint in three dimensions 2021-12-14 12:21:24 +01:00
many
35f9499638
Export tokenizer from milli 2021-11-18 16:57:12 +01:00
Tamo
a58bc5bebb
update milli with the new parser_filter 2021-11-04 15:02:36 +01:00
Tamo
e25ca9776f
start updating the exposed function to makes other modules happy 2021-10-22 17:23:22 +02:00
Tamo
c27870e765
integrate a first version without any error handling 2021-10-22 14:33:18 +02:00
bors[bot]
59cc59e93e
Merge #358
358: Replacing pest with nom  r=Kerollmops a=CNLHC



Co-authored-by: 刘瀚骋 <cn_lhc@qq.com>
2021-10-16 20:44:38 +00:00
many
360c5ff3df
Remove limit of 1000 position per attribute
Instead of using an arbitrary limit we encode the absolute position in a u32
using one strong u16 for the field id and a weak u16 for the relative position in the attribute.
2021-10-12 10:10:50 +02:00
刘瀚骋
f7796edc7e remove everything about pest 2021-10-12 13:30:40 +08:00
many
3296bb243c
Simplify word level position DB into a word position DB 2021-10-05 12:15:02 +02:00
Tamo
c7cb816ae1
simplify the error handling of the sort syntax for meilisearch 2021-09-27 19:07:22 +02:00
Tamo
023446ecf3
create a smaller and easier to maintain CriterionError type 2021-09-22 16:37:41 +02:00
Tamo
86e272856a
create an asc_desc error type that is never supposed to be returned to the end user 2021-09-22 16:37:41 +02:00
Tamo
257e621d40
create an asc_desc module 2021-09-22 16:37:41 +02:00
mpostma
aa6c5df0bc Implement documents format
document reader transform

remove update format

support document sequences

fix document transform

clean transform

improve error handling

add documents! macro

fix transform bug

fix tests

remove csv dependency

Add comments on the transform process

replace search cli

fmt

review edits

fix http ui

fix clippy warnings

Revert "fix clippy warnings"

This reverts commit a1ce3cd96e603633dbf43e9e0b12b2453c9c5620.

fix review comments

remove smallvec in transform loop

review edits
2021-09-21 16:58:33 +02:00
Tamo
cfc62a1c15
use geoutils instead of haversine 2021-09-09 18:11:38 +02:00
Tamo
3fc145c254
if we have no rtree we return all other provided documents 2021-09-09 17:44:09 +02:00
Irevoire
a84f3a8b31
Apply suggestions from code review
Co-authored-by: Clément Renault <clement@meilisearch.com>
2021-09-09 15:09:35 +02:00
Tamo
e5ef0cad9a
use meters in the filters 2021-09-08 18:24:09 +02:00
Tamo
f0b74637dc
fix all the tests 2021-09-08 18:24:09 +02:00
Irevoire
8d9c2c4425
create a new db with getters and setters 2021-09-08 17:51:07 +02:00
many
1d314328f0
Plug new indexer 2021-09-01 16:48:36 +02:00
many
3aaf1d62f3
Publish grenad CompressionType type in milli 2021-09-01 16:42:08 +02:00
Kerollmops
af65485ba7
Reexport the grenad CompressionType from milli 2021-08-24 18:15:31 +02:00
Clément Renault
89d0758713
Revert "Revert "Sort at query time"" 2021-08-24 11:55:16 +02:00
Clémentine Urquizar
922f9fd4d5
Revert "Sort at query time" 2021-08-20 18:09:17 +02:00
many
d1df0d20f9
Add integration test of SortBy criterion 2021-08-18 16:21:51 +02:00
Kerollmops
838ed1cd32
Use an u16 field id instead of one byte 2021-07-06 11:58:03 +02:00
Clémentine Urquizar
daef43f504
Rename FieldsDistribution into FieldDistribution 2021-06-21 15:57:41 +02:00
Tamo
d08cfda796
convert the field_distribution to a BTreeMap and avoid counting twice the same documents 2021-06-17 18:31:54 +02:00
Tamo
969adaefdf
rename fields_distribution in field_distribution 2021-06-17 15:16:20 +02:00
marin
70bee7d405
re-export remaining error types
Co-authored-by: Clément Renault <clement@meilisearch.com>
2021-06-17 11:49:03 +02:00
marin postma
abbebad669
change sub errors visibility 2021-06-17 11:44:01 +02:00
Tamo
9716fb3b36
format the whole project 2021-06-16 18:33:33 +02:00
Kerollmops
f0e804afd5
Rename the FieldIdMapMissingEntry from_db_name field into process 2021-06-15 11:13:04 +02:00
Kerollmops
312c2d1d8e
Use the Error enum everywhere in the project 2021-06-14 16:58:38 +02:00
Kerollmops
23fcf7920e
Introduce a basic version of the InternalError struct 2021-06-14 16:48:51 +02:00
Kerollmops
103dddba2f
Move the UpdateStore into the http-ui crate 2021-06-08 17:59:51 +02:00
Kerollmops
3b1cd4c4b4
Rename the FacetCondition into FilterCondition 2021-06-02 16:24:58 +02:00
many
4ddf008be2
add field id word count database 2021-05-31 16:27:28 +02:00
Kerollmops
89ee2cf576
Introduce the TreeLevel struct 2021-04-27 14:25:35 +02:00
Kerollmops
b0a417f342
Introduce the word_level_position_docids Index database 2021-04-27 14:25:34 +02:00
Alexey Shekhirin
2658c5c545
feat(index): update fields distribution in clear & delete operations
fixes after review

bump the version of the tokenizer

implement a first version of the stop_words

The front must provide a BTreeSet containing the stop words
The stop_words are set at None if an empty Set is provided
add the stop-words in the http-ui interface

Use maplit in the test
and remove all the useless drop(rtxn) at the end of all tests

Integrate the stop_words in the querytree

remove the stop_words from the querytree except if it was a prefix or a typo

more fixes after review
2021-04-01 19:12:35 +03:00
Kerollmops
6bf6b40495
Remove unused files 2021-03-03 15:45:03 +01:00
Kerollmops
5af63c74e0
Speed-up the MatchingWords highlighting struct 2021-03-03 15:45:03 +01:00
Kerollmops
8d710c5130
Introduce heed codecs to retrieve the length of roaring bitmaps 2021-02-18 14:30:47 +01:00
Clément Renault
f365de636f
Compute and write the word-prefix-docids database 2021-02-17 11:12:38 +01:00
Clément Renault
fecf3d6fc1
Move the command lines helpers into different crates 2021-02-14 18:55:15 +01:00
Clément Renault
e8639517da
Change the project to become a workspace with milli as a default-member 2021-02-12 16:15:09 +01:00