From e68e6056c3cf7218007ed7bde383719dace551b0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Cl=C3=A9ment=20Renault?= Date: Sun, 21 Oct 2018 18:21:04 +0200 Subject: [PATCH] doc: Add a deep dive in Pentium --- README.md | 20 +++++++++-- deep-dive.md | 70 ++++++++++++++++++++++++++++++++++++++ misc/doc-indexes.png | Bin 0 -> 5664 bytes src/rank/ranked_stream.rs | 1 + 4 files changed, 89 insertions(+), 2 deletions(-) create mode 100644 deep-dive.md create mode 100644 misc/doc-indexes.png diff --git a/README.md b/README.md index 361d183af..a7c408ad1 100644 --- a/README.md +++ b/README.md @@ -2,10 +2,26 @@ A search engine based on the [blog posts serie](https://blog.algolia.com/inside-the-algolia-engine-part-1-indexing-vs-search/) of the great Algolia company. +If you want to be involved in the project you can [read the deep dive](deep-dive.md). + This is a library, this means that binary are not part of this repository but since I'm still nice I have made some examples for you in the `examples/` folder. -## Usage + + +## Performances + +We made some tests on remote machines and found that we can handle, on a server that cost 5$/month with 1vCPU and 1GB of ram and on the same index and with a simple query: + +- near 190 users with an average response time of 90ms +- 150 users with an average response time of 70ms +- 100 users with an average response time of 45ms + +Network is mesured, servers are located in amsterdam and tests are made between two different datacenters. + + + +## Usage and examples Pentium work with an index like most of the search engines. So to test the library you can create one by indexing a simple csv file. @@ -15,7 +31,7 @@ cargo build --release --example csv-indexer time ./target/release/examples/csv-indexer --stop-words misc/en.stopwords.txt misc/kaggle.csv ``` -The `en.stopwords.txt` file here is a simple file that contains one stop word by line (e.g. or, and...). +The `en.stopwords.txt` file here is a simple file that contains one stop word by line (e.g. or, and). Once the command finished indexing you will have 3 files that compose the index: - The `xxx.map` represent the fst map. diff --git a/deep-dive.md b/deep-dive.md new file mode 100644 index 000000000..a5d9618e5 --- /dev/null +++ b/deep-dive.md @@ -0,0 +1,70 @@ +# A deep dive in pentium + +On the 21 of october 2018. + +Pentium is a full text search engine based on a final state transducer named [fst](https://github.com/BurntSushi/fst) and a key-value store named [RocksDB](https://github.com/facebook/rocksdb). The goal of a search engine is to store data and to respond to queries as accurate and fast as possible. To achieve this it must save the data as an [inverted index](https://en.wikipedia.org/wiki/Inverted_index). + + + +## What is an index ? + +For pentium, an index is composed of a final state transducer, a document indexes file and some key-values. + +### The final state transducer + +This is the first entry point of the engine, you can read more about how it work with the beautiful blog post of burntsushi [Index 1,600,000,000 Keys with Automata and Rust](https://blog.burntsushi.net/transducers/). + +To make it short it is a powerful way to store all the words that are present in the indexed documents. You construct it by giving all the words you want to index associated with a value that, for the moment, can only be an `u64`. When you want to search in it you can provide any automaton you want, in pentium [a custom levenshtein automaton](https://github.com/tantivy-search/levenshtein-automata/) is used. + +Note that the number under each word is auto-incremental, each new word have a new number that is greater than the prevous one. + +Another powerful feature of `fst` is that it can nearly avoid using RAM and be streamed to disk, the problem is that the keys must be always added in lexicographic order, so you must sort them before, for the moment pentium uses a [BTreeMap](https://github.com/Kerollmops/raptor-rs/blob/8abdb0a228e2808fe1814a6a0641a4b72d158579/src/metadata/doc_indexes.rs#L107-L112). + +### The document indexes + +As it has been specified, the `fst` can only store a number under a word an `u64` but the goal of the search engine is to retrieve a match in a document when a query is made. You want it to return so sort of position in an attribute in a document, an information about where the given word match. + +To make it possible, a custom datastructure have been developped, the document indexes are stored in a file. this file is composed of two arrays , the first represent a range (i.e. start and end) that gives a view of where to read all the [DocIndexes]() corresponding to this number/word. The datastructure is pretty simple [to construct](https://github.com/Kerollmops/raptor-rs/blob/8abdb0a228e2808fe1814a6a0641a4b72d158579/src/metadata/doc_indexes.rs#L152-L200) and [to read](https://github.com/Kerollmops/raptor-rs/blob/8abdb0a228e2808fe1814a6a0641a4b72d158579/src/metadata/doc_indexes.rs#L48-L104). Another advantage is that the slices are accessible in `O(1)` when you know the word associated number. + +![doc-indexes](misc/doc-indexes.png) + +### The key-value file + +When the engine handle a query the result that the requester want is a document, not only the [match](https://github.com/Kerollmops/pentium/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/lib.rs#L51-L79) associated to it, fields of the original document must be returned too. + +So pentium is backed by a key-value store named [RocksDB](https://github.com/facebook/rocksdb). At index time, the key-values of the documents are stored (if marked to be stored) using key structure of the form `{document id}-{field name}`. We wanted the index to be manipulable, RocksDB have a [file format](https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files) that allow us to compute the index in advance. + +The SST file have the same disadvantage as the fst, it needs its keys to be ordered. + + + +## How a query is handled ? + +Now that we have our index we are able to return results based on a query, in the pentium universe a query is single string. + +As we described it above, the logic imbrication of datastructures is schematized as the fst is queried with an automaton, this automaton returns words associated with a number and this number gives us documents indexes. We will not talk about the key-value store here. + +### Query lexemes + +The first step to be able to query to the underlying structures is to split the query in words, for that we use a [custom tokenizer](https://github.com/Kerollmops/pentium/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/tokenizer/mod.rs) that is not finished for the moment, [there is an open issue](https://github.com/Kerollmops/pentium/issues/3). Note that a tokenizer is based on a specific language, this is hard. + +### Automatons and query index + +So to query the fst we need an automaton, in pentium we use a [levenshtein automaton](https://en.wikipedia.org/wiki/Levenshtein_automaton), this automaton is constructed using a string and a maximum distance. According to the [Algolia's blog post](https://blog.algolia.com/inside-the-algolia-engine-part-3-query-processing/#algolia%e2%80%99s-way-of-searching-for-alternatives) we [create the DFAs](https://github.com/Kerollmops/pentium/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/automaton.rs#L62-L75) with different settings. + +Thanks to the power of the fst library it is possible to union multiple automatons on the same index, it will allow us to know which [automaton returns a word according to its index](https://github.com/Kerollmops/pentium/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/metadata/ops.rs#L111). The `Stream` is able to return all the numbers associated to the words in the fst. + +We use the number to [find the whole list of `DocIndexes` associated](https://github.com/Kerollmops/pentium/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/metadata/ops.rs#L129-L131) and [do a set operation](https://github.com/Kerollmops/pentium/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/metadata/ops.rs#L135). For the moment, the only one that is used is the union of all the `DocIndexes` (all set operations are supported by `sdset`). It means that only positive indexes are supported not negative ones. + +With all these informations it is possible to reconstruct a list of all the [DocIndexes](https://github.com/Kerollmops/pentium/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/lib.rs#L25-L40) associated with the words queried. + +### Sort by criteria + +Know that we are able to get a big list of `DocIndexes` it is not enough to sort them by criteria, we need more informations like the levenshtein distance, the fact that the word match exactly. So we stuff it a little bit, and [aggregate all these Matches for each document](https://github.com/Kerollmops/pentium/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/rank/ranked_stream.rs#L55-L78). This way it will be easy to sort a simple vector of document using a bunch of functions. + +With this big list of documents and associated matches we are able to sort only the part of the slice that we want using bucket sorting, [currently the algorithm is not optimal](https://github.com/Kerollmops/pentium/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/rank/ranked_stream.rs#L84-L101). Each [criterion](https://github.com/Kerollmops/pentium/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/rank/criterion/mod.rs#L62-L72) is evaluated on each subslice without copy, thanks to [GroupByMut](https://github.com/Kerollmops/group-by/blob/master/src/lib.rs#L177-L182) which, I hope, [will soon be merged](https://github.com/rust-lang/rfcs/pull/2477). + + + +🎉 pentium work is over 🎉 + diff --git a/misc/doc-indexes.png b/misc/doc-indexes.png new file mode 100644 index 0000000000000000000000000000000000000000..2e6330a73dfa803f2c52787122f8c9d9227fe99d GIT binary patch literal 5664 zcmc&&cT`i`mxc&R5g#3q77#=P5~WFK22klmdM_G!2_(VL2{Zji{(V zdcbuo9SHaoS({7(zNpbg8fsLP1l~p9hTcch5=}+L$VPcnQ)S}M11gMeCKebAJza%+ zNH2&Z0(sX7;_u}HNK;WM`6~dIUQQTCu)mk5H(J48S?EMU0l20tLxsR6A{Y;4Aqzc2 zusRau1eS%!K&}d@Fo3~eB^1J0!RY4gztn*{Wg%A##zz4P#bU7#tTY6Pa)Dlh!C=s< zQcx)=NkBpp9pH^|^q2HTU;NX^KkVFeLf=EV`C#0T-e8Jd$Gb>BjIxjr#nC@sf5wS% zbN-hnZ}eZV0Dw@+4)hx2D)gVW0aYc+s)CUd8tLgrF>mVahEb7HIuZW+^1r|ZMXzV>%{1C*N+fRExv<=S8J|GxHjc|%tu1_^MDa=WMHjd4N&&i>N= zM}Ysc#9y|QpcK0PjlMr~bFvB~M}oc+=3eVaxoF*0#_;ta3>Z!f?N$%}TbmrWMAjoV?3Lm0K5az?KCBj(?4R3OWhKlO0T$j5HVmY*B&I7@5fuqD55y zk8Z->r5T!NWSFP)g+ z_I|zVnBRgqV29W6@wS;)$VbJYH(SP^7YVTB|H3DNUAF$<>pbBc{PA|XE8YVQM*7?5 zvWeyH-fkS5kv&hhZ zOmzduZXh(e%;L##B)ZeE=BroA0=rrcY`0ki>6B>KmqIGe19v~+lP9$gJI;o<>(db*xs{!!)QWDmI#3!5a^ zae^uA+c=SDm&d{P{Arqn0K_H&AI-t<|`8MbPz@yP;vfo}iO+kQo1jttC36t~~iw z`!hR06&cXoT)wxJUDa^3NqRGIv?h@DVp?1ka&>9qW5&IjXrI{!rW?1aKIkbL$PgD} z!8z!_KR60P5ze%ZBw@N5_MVJGVe`t;XXu=<3LWDF#y9~ZxWE7#%fMBc=V%7J0 z?VwSY4KI(irk@m3k~tS=nqffG^$OzB5u>7J8^ObN50d&I=^9ZsgtPHghoZl3h={ z4dIURij{VW`Rt6ad~2nUHC7Xm75Z}}A}gBB@`|BdWo3ifKu%RxoGG%k6htlS`#e)| zd{}Br!^`z+X+Mk9)q&bLnaaw+@Bv!HE+VVT1Wc_u@v)bxO-MgI_8fZTIcF zi+OU=`GgTbCfD)RVPUo2S)UqTH&$BLFPTF{_B!u+mni$)Y<~!nk`Q_0r{U1?v7?*V zy98BZ)2EL+_FYMQ*sJ&XLs1rl?T@>Ov=GlU#uxp5I$W3}4%6R1tq)O7&Ww*!r%F6H zNaqVPHonn-ZZ+_ zx2LQ_`nqW&py4@sIbUWIW!@T zi#K9=A5npBv6jSi9c|d`z3F?`)>=MWZ{90&ir()GyR98(>+j$Sm)%oeNYC`?Ne$jG z_mse2PCLD#1LV*HFAfGV`cM>8@&FN}9P~m5a>ur5i3M_gwA)%cXN*`Hc5K~p9E$FH zdV{m@U=yY2ie~W&-u5_tC%(aT>9O}>+9Xe z={I`kiu6XVIh5E`pCowJ)0-(*3LHMf?v=<$qkS< zv*xAbWx3iHB{YrIIR=Sm_Ss=QwJ#o}w+{GDZ@wPfM*PUajz+TV|JUq%A91z*!CI>T zt}bJ;J=}fQTe+wAQkYchZnnZKT?jtWri)i)Duk(^i;msCJ$glyjbvnvr7<{2JTA%W zM5ga*_p^{mtn+LJC6=#8n*;W754Q*-Ro2x3aCj1oTm4z@=sc*XKS!r{ryR0XbS6`a zjoBx$om$mn|=Y)Ps((Qh2~T5$Br z4eAVYu)V3N`#T!hfa)W)?;M&*qH}KJckb6v%iC}V+PObHCV+}bcmj@H|kCJ zEzL_uU& z^ML5ivt0K7nS3-DI`i$68Y}5&Z+!bli=H1{oC>7YF(z^UrGW+r)@AK{u4%{2)V9Jv zJni7RCqHuntzn$9nWR<6g5@8WOM2^ZaQC@%zqXd^TZV6{bqE_fi`*?+$S-bre3HE^ z<}~J=mK_eLJ4eWRwU@QA9`v0y>dJz;zKW=` zX;IqJRV>q)e8`&ca0+o%a@U4&zBhpRayjn!spEiZWl()IPgsUUqw}$BeJ9tkc<=zm za!l~*enYgS`BGWQOK#?c^38T+Ze_&#q{NCZd&l*!U5<}stMC#Z^B=2- z*c^%HL!@;e>%U>5SZ+UYSUAq!tof|{VGLqm>``Z5W~RN|oDE0JncIBgY`r_8sjn%9 z2g5b^hsGn4>?6q89$#fvX63>|4n}pLjTOVAT;WNqdo$I9R`rAKY#Y2=eU(QclC-|B zNM!RpXy272(JWZ_2DoD_TIEy`jT1D=}a8 z9Z&<0ioogm!X%i=qDAwMPl^}Pn(u@XJNNl6)s=%_v-l?}>D{$G0ta(lrQ>!ZgTmZ) z&J${5(p!1?w4`_r%Ndq<9n2WXH&M=JtKsX ztxtax>u;kb_e1s5i-Ww5%NF3raT4BQI?g@nW=?J;e#@iNSaI$)aGCn6Mq4=NZc>WA zT)@+Yv*oGQFHvEY)tflyYhgdA&b6wx-wwBEUfeG@{&>XHm!`3n$-}F7MgE7zk@|j* zghGf{KcncgD%&VR&gUj`J$wl#)XcbO;Zf4(tdP0@U4la!uD~~`JO>?jM2@j(PbtLD zZ_Fy)CmhA)ou)%@Xjabfo`NR97LDlteY1C?zmFy&Iv0_< zeV8{NM>kN5)!!xyudtgi#nzb*yK#IyT{IF{pt-|+6|R?AP<0CYd4Ff03wV6U+rx!y z4h8r(1z_@~P39uTC^P}9nbS_Mp@UE6$f^C{6a1|#wl7VwO|Y}vG424}SjfQk*coCQPzt$dP|~5iBks9U1Dx^Ua8O&KS&wXg zb&lR$hJdnjB7|CySSpg|!f01);jqQ0C|o*iL^xuA(=B(bK3!MPpxjE_xaaWw={}#- zg_Q#np>OkbJRUr|*Gix23dSo}8#*VN+h8*`4-gCi4|D|uCJZ>cbVFbD=E*{}YkSDi z?>w=B2pT$U-O8Gnced;3wI{TnfhCStpGhLZCRb2X`S}0wi3u6bl z?eZ=O)XkLcc6ui)18Us!KvW>7#EvUyp|_gx(nA6)-|?F`!8_cv_F~R!?Xgp2rz89s zg~UrKaqtoL&91_=uJ6ig0)ZB;zZW_V$`J7AvDZO&y<)?9+VSjmSvNHBeUm#YCwseax&stf5!*e@Hveh_DzJ}BIHN+Aq;!0lm8QFrr_4$qu*DV2Ai#}Ei+i2GFG$Q ziq`Pj6_3S6q6>!xv)o1UWfVKRRO#f^z#G1P&jVYOszhu39=~OPdC`=YnS2PcM`WfE!XfeBreI>3G zJMpw^WXh9cLkl=g8p2gZ*G0^mLj(VM85&?8}UatmQ9MK2enf#sreG{wKqLud={oU_3D8C#A~=`WOd3a@vLF+d!AR$(-kwev@RgZ5W4_!iQ~k*j6BkTCxUi9RK`433P@;nB z7hcPL7Ow?ArvZ2`=KvEjwnYdTG$bENQCk7MeCB`LZ^DvWJ7CE;-Z;Rug*aTrWjI~l zg*D+l3s9#Xv^`|qeb)cQdXDs>4qlqR=Nmu?t`?Lu_=-Q8`|d2BTRH9);9r6(AvkVF za^Nx>xqrb|;?J=O5FsG(uL%EUgM_{m67`Qp}4E`se{{w(< z&b{o0zef}>>zopWcHWlf1j^~94d6of+TUC-Jyptn&vH2u4h&0F3GcO-*9+i-*~-e3 zm9sL@^5(oQ+P#v@Q8!U*9sHETkPWh_?`L~5O<|43Mg)#Wn_}-+Qsw~RNlrJ49rgquHL6)irFU>m?xP|-*P%Ju?Y++=IOhBoK#vRE11yu zqN#+!AH`Kiqte?6zLS$+3Y-LUKdw{?*~~r0IQewZ^Gzqk+Mtcwk+db+j+vy%rd0UO tmjg9KN3ECbNs5d%q(7NZ&`2=vLGM^ET{>+&PD2^nYTeSmS*hj_@!tXV6z~85 literal 0 HcmV?d00001 diff --git a/src/rank/ranked_stream.rs b/src/rank/ranked_stream.rs index c0e39f36f..d7c6c2dee 100644 --- a/src/rank/ranked_stream.rs +++ b/src/rank/ranked_stream.rs @@ -81,6 +81,7 @@ impl<'m, C, F> RankedStream<'m, C, F> { impl<'a, C, F> RankedStream<'a, C, F> where C: Criterion { + // TODO don't sort to much documents, we can skip useless sorts pub fn retrieve_documents(mut self, range: Range) -> Vec { let mut documents = self.retrieve_all_documents(); let mut groups = vec![documents.as_mut_slice()];