Inverted_index

反向索引

Elasticsearch uses a structure called an inverted index which is designedto allow very fast full text searches. An inverted index consists of a listof all the unique words that appear in any document, and for each word, a listof the documents in which it appears.

For example, let"s say we have two documents, each with a content fieldcontaining:

  1. ``The quick brown fox jumped over the lazy dog""
  2. ``Quick brown foxes leap over lazy dogs in summer""

To create an inverted index, we first split the content field of eachdocument into separate words (which we call terms or tokens), create asorted list of all the unique terms, then list in which document each termappears. The result looks something like this:

Term      Doc_1  Doc_2-------------------------Quick   |       |  XThe     |   X   |brown   |   X   |  Xdog     |   X   |dogs    |       |  Xfox     |   X   |foxes   |       |  Xin      |       |  Xjumped  |   X   |lazy    |   X   |  Xleap    |       |  Xover    |   X   |  Xquick   |   X   |summer  |       |  Xthe     |   X   |------------------------

Now, if we want to search for "quick brown" we just need to find thedocuments in which each term appears:

Term      Doc_1  Doc_2-------------------------brown   |   X   |  Xquick   |   X   |------------------------Total   |   2   |  1

Both documents match, but the first document has more matches than the second.If we apply a naive similarity algorithm which just counts the number ofmatching terms, then we can say that the first document is a better match --is more relevant to our query -- than the second document.

But there are a few problems with our current inverted index:

  1. "Quick" and "quick" appear as separate terms, while the user probablythinks of them as the same word.

  2. "fox" and "foxes" are pretty similar, as are "dog" and "dogs"-- they share the same root word.

  3. "jumped" and "leap", while not from the same root word, are similarin meaning -- they are synonyms.

With the above index, a search for "+Quick +fox" wouldn"t match anydocuments. (Remember, a preceding + means that the word must be present).Both the term "Quick" and the term "fox" have to be in the same documentin order to satisfy the query, but the first doc contains "quick fox" andthe second doc contains "Quick foxes".

Our user could reasonably expect both documents to match the query. We can dobetter.

If we normalize the terms into a standard format, then we can find documentsthat contain terms that are not exactly the same as the user requested, butare similar enough to still be relevant. For instance:

  1. "Quick" can be lowercased to become "quick".

  2. "foxes" can be stemmed -- reduced to its root form -- tobecome "fox". Similarly "dogs" could be stemmed to "dog".

  3. "jumped" and "leap" are synonyms and can be indexed as just thesingle term "jump".

Now the index looks like this:

Term      Doc_1  Doc_2-------------------------brown   |   X   |  Xdog     |   X   |  Xfox     |   X   |  Xin      |       |  Xjump    |   X   |  Xlazy    |   X   |  Xover    |   X   |  Xquick   |   X   |  Xsummer  |       |  Xthe     |   X   |  X------------------------

But we"re not there yet. Our search for "+Quick +fox" would still fail,because we no longer have the exact term "Quick" in our index. However, ifwe apply the same normalization rules that we used on the content field toour query string, it would become a query for "+quick +fox", which wouldmatch both documents!

IMPORTANT: This is very important. You can only find terms that actually exist in yourindex, so: both the indexed text and and query string must be normalizedinto the same form.

This process of tokenization and normalization is called analysis, which wediscuss in the next section.

文章导航