Commit Graph

445 Commits

Author SHA1 Message Date
Tamo
3b6544db6d Implement the experimental log mode cli flag 2024-02-13 18:09:15 +01:00
Louis Dureuil
ef994d84d0
Change error messages and fix tests 2024-02-08 15:04:06 +01:00
Tamo
f70a615ed9
update the github discussion links 2024-02-08 15:04:05 +01:00
Tamo
7ff722b72e
get rids of the log dependencies everywhere 2024-02-08 15:04:05 +01:00
Tamo
e23ec4886d
fix the tests and add tests on the experimental features 2024-02-08 15:04:03 +01:00
Tamo
7793ba67a4
hide the route logs behind a feature flag 2024-02-08 15:03:33 +01:00
Louis Dureuil
02e6c8a440
Add tracing to index-scheduler 2024-02-08 15:03:31 +01:00
Louis Dureuil
05edd85d75
Stabilize scoreDetails 2024-02-06 11:15:19 +01:00
meili-bors[bot]
1ccde9bf0b
Merge #4316
4316: Autobatch the task deletions r=curquiza a=irevoire

# Pull Request

## Related issue
Fix part of https://github.com/meilisearch/meilisearch-support/issues/69
Fix #4315 

## What does this PR do?
- Autobatch the task deletions

Co-authored-by: Tamo <tamo@meilisearch.com>
2024-01-15 17:54:50 +00:00
Tamo
b4d7d80ad9
autobatch the task deletions 2024-01-11 14:58:07 +01:00
Louis Dureuil
97bb1ff9e2
Move currently_updating_index to IndexMapper 2024-01-09 15:37:27 +01:00
meili-bors[bot]
658ec6e0a4
Merge #4279
4279: Check experimental feature on setting update query rather than in the task. r=ManyTheFish a=dureuill

Improve the UX by checking for the vector store feature and returning an error synchronously when sending a setting update, rather than in the indexing task.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-12-22 11:36:12 +00:00
Louis Dureuil
ee54d3171e
Check experimental feature at query time 2023-12-21 15:26:12 +01:00
Clément Renault
fa2b96b9a5
Add an Authorization Header along with the webhook calls 2023-12-19 12:18:45 +01:00
Tamo
4fb25b8782
fix clippy 2023-12-19 10:35:51 +01:00
Tamo
c83a33017e
stream and chunk the data 2023-12-19 10:35:51 +01:00
Tamo
be72326c0a
gzip the tasks 2023-12-19 10:35:51 +01:00
Tamo
0b2fff27f2
update and fix the test 2023-12-19 10:35:51 +01:00
Tamo
3adbc2b942
return a task view instead of a task 2023-12-19 10:35:51 +01:00
Tamo
fbea721378
add a first working test with actixweb 2023-12-19 10:35:51 +01:00
Tamo
d78ad51082
Implement the webhook 2023-12-19 10:35:50 +01:00
Many the fish
9e1b458010
Merge branch 'main' into change-proximity-precision-settings 2023-12-18 09:08:47 +01:00
Louis Dureuil
e0cc775dc4
Various changes
- DistributionShift in Search object (to be set from model in embed?)
- Fix issue where embedder index wasn't computed at search time
- Accept as default embedder either the "default" one, or the only embedder when there is only one
2023-12-14 16:08:41 +01:00
Louis Dureuil
922a640188
WIP multi embedders
fixed template bugs
2023-12-14 16:08:41 +01:00
Louis Dureuil
abbe131084
Cosmetic change 2023-12-14 16:08:41 +01:00
Louis Dureuil
13c2c6c16b
Small commit to add hybrid search and autoembedding 2023-12-14 16:07:48 +01:00
ManyTheFish
35e1981488 Remove proximityPrecision form the experimental feature 2023-12-14 15:52:42 +01:00
Clément Renault
7e259cb0d2
Expose the --max-number-of-batched-tasks argument 2023-12-11 16:08:39 +01:00
ManyTheFish
1f4fc9c229 Make the feature experimental 2023-12-06 15:49:05 +01:00
Clément Renault
ec9b52d608
Rename copy_to_path to copy_to_file 2023-11-28 14:32:30 +01:00
Clément Renault
34c67ac389
Remove the possibility to fail fetching the env info 2023-11-28 14:31:23 +01:00
Clément Renault
0dbf1a16ff
Make clippy happy 2023-11-23 14:11:38 +01:00
Clément Renault
462b4c0080
Fix the tests 2023-11-23 12:07:35 +01:00
Clément Renault
0d4482625a
Make the changes to use heed v0.20-alpha.6 2023-11-23 11:43:58 +01:00
Clément Renault
7cb7e37ba8
Merge branch 'main' into tmp-release-v1.5.0 2023-11-21 16:30:46 +01:00
meili-bors[bot]
33b7c574ea
Merge #4090
4090: Diff indexing r=ManyTheFish a=ManyTheFish

This pull request aims to reduce the indexing time by computing a difference between the data added to the index and the data removed from the index before writing in LMDB.

## Why focus on reducing the writings in LMDB?

The indexing in Meilisearch is split into 3 main phases:
1) The computing or the extraction of the data (Multi-threaded)
2) The writing of the data in LMDB (Mono-threaded)
3) The processing of the prefix databases (Mono-threaded)

see below:
![Capture d’écran 2023-09-28 à 20 01 45](https://github.com/meilisearch/meilisearch/assets/6482087/51513162-7c39-4244-978b-2c6b60c43a56)


Because the writing is mono-threaded, it represents a bottleneck in the indexing, reducing the number of writes in LMDB will reduce the pressure on the main thread and should reduce the global time spent on the indexing.

## Give Feedback

We created [a dedicated discussion](https://github.com/meilisearch/meilisearch/discussions/4196) for users to try this new feature and to give feedback on bugs or performance issues.

## Technical approach
### Part 1: merge the addition and the deletion process
This part:
a) Aims to reduce the time spent on indexing only the filterable/sortable fields of documents, for example:
  - Updating the number of "likes" or "stars" of a song or a movie
  - Updating the "stock count" or the "price" of a product

b) Aims to reduce the time spent on writing in LMDB which should reduce the global indexing time for the highly multi-threaded machines by reducing the writing bottleneck.

c) Aims to reduce the average time spent to delete documents without having to keep the soft-deleted documents implementation

- [x] Create a preprocessing function that creates the diff-based documents chuck (`OBKV<fid, OBKV<AddDel, value>>`)
  - [x] and clearly separate the faceted fields and the searchable fields in two different chunks
- Change the parameters of the input extractor by taking an `OBKV<fid, OBKV<AddDel, value>>` instead of  `OBKV<fid, value>`.
  - [x] extract_docid_word_positions
  - [x] extract_geo_points
  - [x] extract_vector_points
  - [x] extract_fid_docid_facet_values
- Adapt the searchable extractors to the new diff-chucks
  - [x] extract_fid_word_count_docids
  - [x] extract_word_pair_proximity_docids
  - [x] extract_word_position_docids
  - [x] extract_word_docids
- Adapt the facet extractors to the new diff-chucks
  - [x] extract_facet_number_docids
  - [x] extract_facet_string_docids
  - [x] extract_fid_docid_facet_values
  - [x] FacetsUpdate
- [x] Adapt the prefix database extractors ⚠️ ⚠️ 
- [x] Make the LMDB writer remove the document_ids to delete at the same time the new document_ids are added
- [x] Remove document deletion pipeline
  - [x] remove `new_documents_ids` entirely and `replaced_documents_ids`
  - [x] reuse extracted external id from transform instead of re-extracting in `TypedChunks::Documents`
  - [x] Remove deletion pipeline after autobatcher
  - [x] remove autobatcher deletion pipeline
    - [x] everything uses `IndexOperation::DocumentOperation`
    - [x] repair deletion by internal id for filter by delete
    - [x] Improve the deletion via internal ids by avoiding iterating over the whole set of external document ids.  
- [x] Remove soft-deleted documents

#### FIXME

- [x] field distribution is not correctly updated after deletion
- [x] missing documents in the tests of tokenizer_customization

### Part 2: Only compute the documents field by field
This part aims to reduce the global indexing time for any kind of partial document modification on any size of machine from the mono-threaded one to the highly multi-threaded one.

- [ ] Make the preprocessing function only send the fields that changed to the extractors
- [ ] remove the `word_docids` and `exact_word_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] replace the `word_pair_proximity_docids` database with a `word_pair_proximity_fid_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] Adapt the prefix database extractors ⚠️ ⚠️

## Technical Concerns
- The part 1 implementation could increase the indexing time for the smallest machines (with few threads) by increasing the extracting time (multi-threaded) more than the writing time (mono-threaded)
- The part 2 implementation needs to change the databases which could have a significant impact on the search performances
- The prefix databases are a bit special to process and may be a pain to adapt to the difference-based indexing

Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-11-21 09:44:38 +00:00
Tamo
5b57fbab08 makes the dump cancellable 2023-11-14 11:23:13 +01:00
Louis Dureuil
a2d6dc8571
Fix typo, remove caching for the change of index 2023-11-13 10:44:36 +01:00
Louis Dureuil
492fc086f0
cargo fmt 2023-11-12 21:53:11 +01:00
Louis Dureuil
a2d0c73b41
Save the currently updating index so that the search can access it at all times 2023-11-10 10:52:03 +01:00
Louis Dureuil
f8289cd974
Use it from delete-by-filter 2023-11-09 14:23:15 +01:00
Louis Dureuil
ef6fa10f7a
Remove IndexOperation::DocumentDeletion 2023-11-06 12:16:15 +01:00
Louis Dureuil
cbaa54cafd
Fix clippy issues 2023-11-06 11:19:31 +01:00
Clément Renault
e507ef5932
Slow the logging down 2023-11-01 13:49:32 +01:00
Clément Renault
13416ccbf7
Introduce a new meilitool to help the cloud team 2023-10-30 14:30:20 +01:00
Clément Renault
dfab6293c9
Use an LMDB database to store the external documents ids 2023-10-30 11:41:23 +01:00
Louis Dureuil
652ac3052d
use new iterator in batch 2023-10-30 11:41:22 +01:00
Louis Dureuil
c534a1b687
Stop using delete documents pipeline in batch runner 2023-10-30 11:41:22 +01:00
Louis Dureuil
cf8dad1ca0
index_scheduler.features() is no longer fallible 2023-10-23 10:38:56 +02:00
bwbonanno
dd619913da Use RwLock to never persist cli state to db 2023-10-19 12:45:57 -07:00