Update the README

2024-11-26 12:05:05 +08:00 · 2020-06-28 12:40:08 +02:00 · 2020-06-28 12:40:08 +02:00 · 8453828a65
commit 8453828a65
parent 63cbeca64e
1 changed files with 59 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -1,2 +1,61 @@
 # mega-mini-indexer
 A prototype of concurrent indexing, only contains postings ids
+
+## Introduction
+
+This engine is a prototype, do not use it in production.
+This is one of the most advanced search engine I have worked on.
+It currently only supports the proximity criterion.
+
+### Compile all the binaries
+
+```bash
+cargo build --release --bins
+```
+
+## Indexing
+
+It can index mass documents in no much time, I already achieved to index:
+ - 109m songs (song and artist name) in 21min and take 29GB on disk.
+ - 12m cities (name, timezone and country ID) in 3min13s and take 3.3GB on disk.
+
+All of that on a 39$/month machine with 4cores.
+
+### Index your documents
+
+You first need to split your csv yourself, the engine is currently not able to split it itself.
+The bigger the split size is the faster the engine will index your documents but the higher the RAM usage will be too.
+
+Here we use [the awesome xsv tool](https://github.com/BurntSushi/xsv) to split our big dataset.
+
+```bash
+cat my-data.csv | xsv split -s 2000000 my-data-split/
+```
+
+Once your data is ready you can feed the engine with it, it will spawn one thread by CSV part up to one by number of core.
+
+```bash
+./target/release/indexer --db my-data.mmdb ../my-data-split/*
+```
+
+## Querying
+
+The engine is designed to handle very frequent words like any other word frequency.
+This is why you can search for "asia dubai" (the most common timezone) in the countries datasets in no time (59ms) even with 12m documents.
+
+We haven't modified the algorithm to handle queries that are scattered over multiple attributes, this is an open issue (#4).
+
+### Exposing a website to request the database
+
+Once you've indexed the dataset you will be able to access it with your brwoser.
+
+```bash
+./target/release/serve -l 0.0.0.0:8700 --db my-data.mmdb
+```
+
+## Gaps
+
+There is many ways to make the engine search for too long and consume too much CPU.
+This can for example be achieved by querying the engine for "the best of the do" on the songs and subreddits datasets.
+
+There is plenty of way to improve the algorithms and there is and will be new issues explaining potential improvements.