4655: Remove `exportPuffinReport` experimental feature r=Kerollmops a=Kerollmops
This PR fixes#4605 by removing every trace of Puffin. Puffin is a great tool, but we use a better approach to measuring performance.
Co-authored-by: Clément Renault <clement@meilisearch.com>
4646: Reduce `Transform`'s disk usage r=Kerollmops a=Kerollmops
This PR implements what is described in #4485. It reduces the number of disk writes and disk usage.
Co-authored-by: Clément Renault <clement@meilisearch.com>
4633: Allow to mark vectors as "userProvided" r=Kerollmops a=dureuill
# Pull Request
## Related issue
Fixes#4606
## What does this PR do?
[See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb)
- Extends the shape of the special `_vectors` field in documents.
- previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings.
- In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields:
1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings.
2. `userProvided`: a boolean indicating if the vector was provided by the user.
- The previous form `embedder_or_array_of_embedders` is semantically equivalent to:
```json
{
"embeddings": embedder_or_array_of_embedders,
"userProvided": true
}
```
- During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template.
- This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change.
- The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them.
### Tests
This PR adds the following tests
- Long-needed hybrid search tests of a simple hf embedder
- Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content.
- Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
4644: Revert "Stream documents" and keep heed+arroy to the latest verion r=Kerollmops a=irevoire
Reverts meilisearch/meilisearch#4544
Fixes https://github.com/meilisearch/meilisearch/issues/4641
I didn’t realize that some http clients were not handling chunked http requests like you would expect (if you ask the body, it gives you the body), which made the previous PR breaking.
There is no way to provide a good fix to the issue we initially wanted to fix without breaking meilisearch and that’s not planned for now.
Co-authored-by: Tamo <irevoire@protonmail.ch>
Co-authored-by: Tamo <tamo@meilisearch.com>
4544: Stream documents r=curquiza a=irevoire
# Pull Request
## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4383
### Perf
2M hackernews:
main:
Time to retrieve: 7s
RAM consumption: 2+GiB
stream:
Time to retrieve: 4.7s
RAM consumption: Too small
Co-authored-by: Tamo <tamo@meilisearch.com>