meilisearch

mirror of https://github.com/meilisearch/meilisearch.git synced 2025-02-20 17:45:54 +08:00

Go to file

meili-bors[bot] 33b7c574ea

4090: Diff indexing r=ManyTheFish a=ManyTheFish

This pull request aims to reduce the indexing time by computing a difference between the data added to the index and the data removed from the index before writing in LMDB.

## Why focus on reducing the writings in LMDB?

The indexing in Meilisearch is split into 3 main phases:
1) The computing or the extraction of the data (Multi-threaded)
2) The writing of the data in LMDB (Mono-threaded)
3) The processing of the prefix databases (Mono-threaded)

see below:
![Capture d’écran 2023-09-28 à 20 01 45](https://github.com/meilisearch/meilisearch/assets/6482087/51513162-7c39-4244-978b-2c6b60c43a56)


Because the writing is mono-threaded, it represents a bottleneck in the indexing, reducing the number of writes in LMDB will reduce the pressure on the main thread and should reduce the global time spent on the indexing.

## Give Feedback

We created [a dedicated discussion](https://github.com/meilisearch/meilisearch/discussions/4196) for users to try this new feature and to give feedback on bugs or performance issues.

## Technical approach
### Part 1: merge the addition and the deletion process
This part:
a) Aims to reduce the time spent on indexing only the filterable/sortable fields of documents, for example:
  - Updating the number of "likes" or "stars" of a song or a movie
  - Updating the "stock count" or the "price" of a product

b) Aims to reduce the time spent on writing in LMDB which should reduce the global indexing time for the highly multi-threaded machines by reducing the writing bottleneck.

c) Aims to reduce the average time spent to delete documents without having to keep the soft-deleted documents implementation

- [x] Create a preprocessing function that creates the diff-based documents chuck (`OBKV<fid, OBKV<AddDel, value>>`)
  - [x] and clearly separate the faceted fields and the searchable fields in two different chunks
- Change the parameters of the input extractor by taking an `OBKV<fid, OBKV<AddDel, value>>` instead of  `OBKV<fid, value>`.
  - [x] extract_docid_word_positions
  - [x] extract_geo_points
  - [x] extract_vector_points
  - [x] extract_fid_docid_facet_values
- Adapt the searchable extractors to the new diff-chucks
  - [x] extract_fid_word_count_docids
  - [x] extract_word_pair_proximity_docids
  - [x] extract_word_position_docids
  - [x] extract_word_docids
- Adapt the facet extractors to the new diff-chucks
  - [x] extract_facet_number_docids
  - [x] extract_facet_string_docids
  - [x] extract_fid_docid_facet_values
  - [x] FacetsUpdate
- [x] Adapt the prefix database extractors ⚠️ ⚠️ 
- [x] Make the LMDB writer remove the document_ids to delete at the same time the new document_ids are added
- [x] Remove document deletion pipeline
  - [x] remove `new_documents_ids` entirely and `replaced_documents_ids`
  - [x] reuse extracted external id from transform instead of re-extracting in `TypedChunks::Documents`
  - [x] Remove deletion pipeline after autobatcher
  - [x] remove autobatcher deletion pipeline
    - [x] everything uses `IndexOperation::DocumentOperation`
    - [x] repair deletion by internal id for filter by delete
    - [x] Improve the deletion via internal ids by avoiding iterating over the whole set of external document ids.  
- [x] Remove soft-deleted documents

#### FIXME

- [x] field distribution is not correctly updated after deletion
- [x] missing documents in the tests of tokenizer_customization

### Part 2: Only compute the documents field by field
This part aims to reduce the global indexing time for any kind of partial document modification on any size of machine from the mono-threaded one to the highly multi-threaded one.

- [ ] Make the preprocessing function only send the fields that changed to the extractors
- [ ] remove the `word_docids` and `exact_word_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] replace the `word_pair_proximity_docids` database with a `word_pair_proximity_fid_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] Adapt the prefix database extractors ⚠️ ⚠️

## Technical Concerns
- The part 1 implementation could increase the indexing time for the smallest machines (with few threads) by increasing the extracting time (multi-threaded) more than the writing time (mono-threaded)
- The part 2 implementation needs to change the databases which could have a significant impact on the search performances
- The prefix databases are a bit special to process and may be a pain to adapt to the difference-based indexing

Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>

2023-11-21 09:44:38 +00:00

.github

Add the benchmarck name to the bot message

2023-11-15 13:56:54 +01:00

assets

Introduce a PROFILING.md tutorial to profile Meilisearch

2023-07-18 17:38:13 +02:00

benchmarks

Use more efficient method for deletion in benchmarks

2023-11-09 16:13:15 +01:00

dump

Remove unused snapshots

2023-10-31 10:12:49 +01:00

file-store

Upgrade the compatible versions of the dependencies

2023-04-24 17:50:52 +02:00

filter-parser

Use the unescaper crate to unescape any char sequence

2023-09-06 13:59:45 +02:00

flatten-serde-json

Update criterion to 0.5.1 to remove the atty dependency

2023-07-03 18:51:42 +02:00

fuzzers

upgrade fastrand = "2.0.0"

2023-08-10 18:09:02 +02:00

index-scheduler

Merge #4090

2023-11-21 09:44:38 +00:00

json-depth-checker

Update criterion to 0.5.1 to remove the atty dependency

2023-07-03 18:51:42 +02:00

meili-snap

enable the multi-snapshot attribute in insta. This will let us use insta in loops

2023-08-08 16:28:38 +02:00

meilisearch

Slow the logging down

2023-11-01 13:49:32 +01:00

meilisearch-auth

implement the snapshots on demand

2023-09-11 12:35:57 +02:00

meilisearch-types

Remove soft-deleted related methods from Index

2023-10-30 11:41:22 +01:00

milli

Make into_del_add_obkv parameters more human readable

2023-11-20 16:10:39 +01:00

permissive-json-pointer

Refactor empty arrays/objects should return empty instead of null

2023-09-11 15:56:15 +03:00

.dockerignore

Revert "Improve docker cache"

2023-05-25 11:48:26 +02:00

.gitignore

edit gitignore to ignore .idea and .vscode folders

2023-02-10 11:42:19 +04:00

.rustfmt.toml

Introduce a rustfmt file

2022-10-27 11:35:05 +02:00

bors.toml

Remove macos-latest and windows-latest usages

2022-12-20 11:10:09 +01:00

Cargo.lock

Cleanup TOML

2023-11-01 14:03:04 +01:00

Cargo.toml

Update version for the next release (v1.4.1) in Cargo.toml

2023-10-10 09:01:45 +00:00

CODE_OF_CONDUCT.md

Create CODE_OF_CONDUCT.md

2020-04-30 20:16:02 +02:00

config.toml

Merge branch 'main' into tmp-release-v1.2.0

2023-06-05 18:36:28 +02:00

CONTRIBUTING.md

Update links of the docs

2023-05-03 19:14:57 +02:00

Cross.toml

Cross build with action-rs

2021-10-10 02:21:30 +08:00

Dockerfile

Revert "Improve docker cache"

2023-05-25 11:48:26 +02:00

download-latest.sh

Update links of the docs

2023-05-03 19:14:57 +02:00

LICENSE

Update LICENSE

2022-02-15 15:54:45 +01:00

PROFILING.md

Update the PROFILING.md file

2023-10-13 13:11:30 +02:00

README.md

Update README.md

2023-11-02 17:40:18 +01:00

SECURITY.md

docs(security): Fix Supported

2022-05-31 14:21:34 -05:00

README.md

Website | Roadmap | Meilisearch Cloud | Blog | Documentation | FAQ | Discord

⚡ A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow 🔍

Meilisearch helps you shape a delightful search experience in a snap, offering features that work out-of-the-box to speed up your workflow.

🔥 Try it! 🔥

✨ Features

Search-as-you-type: find search results in less than 50 milliseconds
Typo tolerance: get relevant matches even when queries contain typos and misspellings
Filtering and faceted search: enhance your user's search experience with custom filters and build a faceted search interface in a few lines of code
Sorting: sort results based on price, date, or pretty much anything else your users need
Synonym support: configure synonyms to include more relevant content in your search results
Geosearch: filter and sort documents based on geographic data
Extensive language support: search datasets in any language, with optimized support for Chinese, Japanese, Hebrew, and languages using the Latin alphabet
Security management: control which users can access what data with API keys that allow fine-grained permissions handling
Multi-Tenancy: personalize search results for any number of application tenants
Highly Customizable: customize Meilisearch to your specific needs or use our out-of-the-box and hassle-free presets
RESTful API: integrate Meilisearch in your technical stack with our plugins and SDKs
Easy to install, deploy, and maintain

📖 Documentation

You can consult Meilisearch's documentation at https://www.meilisearch.com/docs.

🚀 Getting started

For basic instructions on how to set up Meilisearch, add documents to an index, and search for documents, take a look at our Quick Start guide.

You may also want to check out Meilisearch 101 for an introduction to some of Meilisearch's most popular features.

⚡ Supercharge your Meilisearch experience

Say goodbye to server deployment and manual updates with Meilisearch Cloud. No credit card required.

🧰 SDKs & integration tools

Install one of our SDKs in your project for seamless integration between Meilisearch and your favorite language or framework!

Take a look at the complete Meilisearch integration list.

⚙️ Advanced usage

Experienced users will want to keep our API Reference close at hand.

We also offer a wide range of dedicated guides to all Meilisearch features, such as filtering, sorting, geosearch, API keys, and tenant tokens.

Finally, for more in-depth information, refer to our articles explaining fundamental Meilisearch concepts such as documents and indexes.

📊 Telemetry

Meilisearch collects anonymized data from users to help us improve our product. You can deactivate this whenever you want.

To request deletion of collected data, please write to us at privacy@meilisearch.com. Don't forget to include your Instance UID in the message, as this helps us quickly find and delete your data.

If you want to know more about the kind of data we collect and what we use it for, check the telemetry section of our documentation.

📫 Get in touch!

Meilisearch is a search engine created by Meili, a software development company based in France and with team members all over the world. Want to know more about us? Check out our blog!

🗞 Subscribe to our newsletter if you don't want to miss any updates! We promise we won't clutter your mailbox: we only send one edition every two months.

💌 Want to make a suggestion or give feedback? Here are some of the channels where you can reach us:

For feature requests, please visit our product repository
Found a bug? Open an issue!
Want to be part of our Discord community? Join us!

Thank you for your support!

👩‍💻 Contributing

Meilisearch is, and will always be, open-source! If you want to contribute to the project, please take a look at our contribution guidelines.

📦 Versioning

Meilisearch releases and their associated binaries are available in this GitHub page.

The binaries are versioned following SemVer conventions. To know more, read our versioning policy.

Differently from the binaries, crates in this repository are not currently available on crates.io and do not follow SemVer conventions.