KMaster知识管理平台
  • 使用Elasticsearch进行中文搜索优化

    2022年02月24日

    If you have played with Elasticsearch, you already know that analyzing and tokenization are the most important steps while indexing content, and without them your pertinency is going to be bad, your users unhappy and your results poorly sorted.

    如果您玩过Elasticsearch,您已经知道在对内容建立索引时,分析和标记化是最重要的步骤,而没有它们,您的针对性将很差,用户不满意,您的结果排序也不容易。

    Even with English content you can lose pertinence with a bad stemming, miss some documents when not performing proper elision and so on. And that’s worse if you are indexing another language; the default analyzers are not all-purpose.

    即使使用英文内容,也可能因词干不好而失去针对性,在未进行适当的省略时会丢失一些文档,等等。 而且,如果您正在索引另一种语言,那就更糟了。 默认的分析器不是万能的。

    When dealing with Chinese documents, everything is even more complex, even by considering only Mandarin which is the official language in China and the most spoken worldwide. Let’s dig into Chinese content tokenization and expose the best ways of doing it with Elasticsearch.

    处理中文文档时,即使仅考虑普通话(这是中国的官方语言,也是世界上使用最多的语言),一切都变得更加复杂。 让我们深入研究中文内容标记化,并展示使用Elasticsearch的最佳方法。

    logo (6)

    Chinese characters are logograms, they represents a word or a morpheme (the smallest meaningful unit of language). Put together, their meaning can change and represent a whole new word. Another difficulty is that there is no space between words or sentences, making it very hard for a computer to know where a word starts or ends.

    汉字是徽标,它们代表一个单词或一个词素(最小的有意义的语言单位)。 放在一起,它们的含义可以改变并代表一个全新的词。 另一个困难是单词或句子之间没有空格,这使计算机很难知道单词的开头或结尾。

    There are tens of thousands of Chinese characters, even if in practice, written Chinese requires a knowledge of between three and four thousand. Let’s see an example: the word “volcano” (火山) is in fact the combination of:

    有成千上万个汉字,即使在实践中,写中文也需要三到四千个知识。 让我们看一个例子:“火山”(volcano)一词实际上是以下内容的组合:

    • 火: fire

      火:火
    • 山: mountainsky

    Our tokenizer must be clever enough to avoid separating those two logograms, because the meaning is changed when they are not together.

    我们的分词器必须足够聪明,以避免将这两个徽标分开,因为当它们不在一起时,其含义会发生变化。

    Another difficulty is the spelling variants used:

    另一个困难是使用的拼写变体:

    • simplified Chinese: 书法 ;

      简体中文: 书法 ;

    • traditional Chinese, more complex and richer: 書法 ;

      中国传统的,更复杂,更丰富: 书法 ;

    • and pinyin, a Romanized form of Mandarin: shū fǎ.

      拼音(汉语的罗马字母形式): shūfǎ 。

    分析中文内容 (Analyzing Chinese content)

    At the time of this writing, here are the solutions available with Elasticsearch:

    在撰写本文时,以下是Elasticsearch可用的解决方案:

    These analyzers are very different and we will compare how well they perform with a simple test word: 手机. It means “Cell phone” and is composed of two logograms, which mean “hand” and “machine” respectively. The 机 logogram also composes a lot of other words:

    这些分析仪有很大的不同,我们将用一个简单的测试词“手机”来比较它们的性能。 它的意思是“手机”,由两个标志组成,分别表示“手”和“机器”。 机徽标也由许多其他词组成:

    • 机票: plane ticket

      机票
    • 机器人: robot

      机器人
    • 机枪: machine gun

      机枪
    • 机遇: opportunity

      机遇

    Our tokenization must not split those logograms, because if I search for “Cell phone”, I do not want any documents about Rambo owning a machine gun and looking bad-ass.

    我们的标记化操作不能拆分这些徽标,因为如果我搜索“手机”,我不希望有任何有关Rambo拥有机枪并且看上去很糟的文件。

    rambo

    We are going to test our solutions with the great _analyze API:

    我们将使用出色的_analyze API测试解决方案:

    curl -XGET 'http://localhost:9200/chinese_test/_analyze?analyzer=paoding_analyzer1' -d '手机'

    Also, did I mention this awesome cheat sheet for Elasticsearch yet?

    另外,我是否为Elasticsearch提到了这个很棒的备忘单 

    默认中文分析仪 (The default Chinese analyzer)

    Already available on your Elasticsearch instance, this analyzer uses the ChineseTokenizer class of Lucene, which only separates all logograms into tokens. So we are getting two tokens:  and .

    该分析器已在您的Elasticsearch实例上可用,它使用Lucene的ChineseTokenizer类,该类仅将所有徽标分隔为令牌。 因此,我们得到了两个令牌:  

    The Elasticsearch standard analyzer produces the exact same output. For this reason, Chinese is deprecated and soon to be replaced by standard, and you should avoid it.

    Elasticsearch 标准分析仪产生完全相同的输出。 因此,不赞成使用中文 ,并且很快将其替换为standard ,因此应避免使用它。

    paoding插件 (The paoding plugin)

    Paoding is almost an industry standard and is known as an elegant solution. Sadly, the plugin for Elasticsearch is unmaintained and I only managed to make it work on version 1.0.1, after some modifications. Here is how to install it manually:

    涂漆几乎是行业标准,被称为优雅的解决方案。 遗憾的是,Elasticsearch的插件没有维护,经过一些修改 ,我只设法使其在版本1.0.1上运行。 以下是手动安装的方法:

    1. git clone git@github.com:damienalexandre/elasticsearch-analysis-paoding.git /tmp/elasticsearch-analysis-paoding
    2. cd /tmp/elasticsearch-analysis-paoding
    3. mvn clean package
    4. sudo /usr/share/elasticsearch/bin/plugin -url file:/tmp/elasticsearch-analysis-paoding/target/releases/elasticsearch-analysis-paoding-1.2.2.zip -install elasticsearch-analysis-paoding
    5. # Copy all the dic config files to the ES config path - make sure to set the permissions rights, ES needs to write in /etc/elasticsearch/config/paoding!
    6. sudo cp -r config/paoding /etc/elasticsearch/config/

    After this clumsy installation process (to be done on all your nodes), we now have a new paoding tokenizer and two collectors: max_word_len and most_word. No analyzer is exposed by default so we have to declare a new one:

    经过这个笨拙的安装过程(将在所有节点上完成),我们现在有了一个新的paoding令牌生成器和两个收集器: max_word_lenmost_word 。 默认情况下,没有公开任何分析器,因此我们必须声明一个新的分析器:

    1. PUT /chinese_test
    2. {
    3. "settings": {
    4. "number_of_shards": 1,
    5. "number_of_replicas": 0,
    6. "analysis": {
    7. "tokenizer": {
    8. "paoding1": {
    9. "type": "paoding",
    10. "collector": "most_word"
    11. },
    12. "paoding2": {
    13. "type": "paoding",
    14. "collector": "max_word_len"
    15. }
    16. },
    17. "analyzer": {
    18. "paoding_analyzer1": {
    19. "type": "custom",
    20. "tokenizer": "paoding1",
    21. "filter": ["standard"]
    22. },
    23. "paoding_analyzer2": {
    24. "type": "custom",
    25. "tokenizer": "paoding2",
    26. "filter": ["standard"]
    27. }
    28. }
    29. }
    30. }
    31. }

    Both configurations provide good results, with a clean and unique token. Behavior is also very good with more complex sentences.

    两种配置均提供了良好的结果,并带有干净且唯一的令牌。 对于更复杂的句子,行为也非常好。

    CJK分析仪 (The cjk analyzer)

    Very straightforward analyzer, it only transforms any text into a bi-gram. “Batman” becomes a list of meaningless tokens: Ba, at, tm, ma, an. For Asian languages, this tokenizer is a good and very simple solution at the price of a bigger index and sometime not perfectly relevant results.

    非常简单的分析器,它仅将任何文本转换为二元语法。 “蝙蝠侠”成为无意义标记的列表:Ba,at,tm,ma,an。 对于亚洲语言,此分词器是一个很好且非常简单的解决方案,但代价是索引更大,有时结果并不完全相关。

    In our case, a two-logogram word, only 手机 is indexed, which is looking good, but if we take a longer word like 元宵节 (Lantern festival), two tokens are generated: 元宵 and 宵节, meaning respectively lantern and Xiao Festival.

    在我们的例子中,两简写字,只有手机是索引,这看起来不错,但如果我们采取像元宵节长字(元宵节),生成两个标记: 元宵宵节 ,意分别灯笼肖节日 

    智能中文插件 (The smart chinese plugin)

    Very easy to install thanks to the guys at Elasticsearch maintaining it:

    由于Elasticsearch的人员对其进行了维护,因此非常易于安装:

    bin/plugin -install elasticsearch/elasticsearch-analysis-smartcn/2.3.0

    It exposes a new smartcn analyzer, as well as as the smartcn_tokenizer tokenizer, using the SmartChineseAnalyzer from Lucene.

    它使用Lucene的SmartChineseAnalyzer公开了一个新的smartcn分析器以及smartcn_tokenizer标记器。

    It operates a probability suite to find an optimal separation of words, using the Hidden Markov model and a big number of training texts. So there is already a training dictionary embedded which is quite good on common text – our example is properly tokenized.

    它使用隐马尔可夫模型和大量训练文本来操作概率套件,以找到单词的最佳分隔。 因此,已经嵌入了一个训练字典,该字典可以很好地处理普通文本-我们的示例已正确标记。

    ICU插件 (The ICU plugin)

    Another official plugin. Elasticsearch supports the “International Components for Unicode” libraries.

    另一个官方插件。 Elasticsearch支持“ Unicode的国际组件”库。

    bin/plugin -install elasticsearch/elasticsearch-analysis-icu/2.4.1

    This plugin is also recommended if you deal with any language other than English, I use it all the time for French content!

    如果您使用英语以外的其他任何语言, 也建议使用此插件,我一直将其用于法语内容!

    It exposes an icu_tokenizer tokenizer that we will use, as well as a lot of great analysis tools like icu_normalizer, icu_folding, icu_collation, etc.

    它展示了我们将使用的icu_tokenizer标记程序,以及许多很棒的分析工具,例如icu_normalizer  icu_folding  icu_collat​​ion等。

    It works with a dictionary for Chinese and Japanese texts, containing information about word frequency to deduce logogram groups. On 手机, everything is fine and works as expected, but on 元宵节, two tokens are produced: 元宵 and  – that’s because lantern and festival are more common than Lantern festival.

    它与中文和日文文本字典配合使用,其中包含有关单词频率的信息,以推断出语标组。 在手机 ,一切都很好,并且按预期工作,但在元宵节时,产生两个标记: 元宵 -这是因为灯笼节日元宵节更常见。

    结果明细 (Results breakdown)

    Analyzer 手机 (cell phone) 元宵节 (Lantern festival) 元宵節 (Lantern festival with traditional)
    chinese [手] [机] [元] [宵] [节] [元] [宵] [節]
    paoding most_word [手机] [元宵] [元宵节] [元宵] [節]
    paoding max_word_len [手机] [元宵节] [元宵] [節]
    cjk [手机] [元宵] [宵节] [元宵] [宵節]
    smartcn [手机] [元宵节] [元宵] [節]
    icu_tokenizer [手机] [元宵] [节] [元宵節]
    分析仪 手机(手机) 元宵节(元宵节) 元宵节(传统元宵节)
    中文 [手] [机] [元] [宵] [节] [元] [宵] [节]
    most记most_word [手机] [元宵节] [元宵] [节]
    交易max_word_len [手机] [元宵节] [元宵] [节]
    j [手机] [元宵] [宵节] [元宵] [宵节]
    智慧网 [手机] [元宵节] [元宵] [节]
    icu_tokenizer [手机] [元宵] [节] [元宵节]

    These tests have been done with Elasticsearch 1.3.2 except for Paoding under ES 1.0.1.

    这些测试已使用Elasticsearch 1.3.2完成,但ES 1.0.1下的Paoding除外 

    From my point of view, paoding and smartcn get the best results. The chinese tokenizer is very bad and the icu_tokenizer is a bit disappointing on 元宵节, but handles traditional Chinese very well.

    在我看来, paodingsmartcn会获得最佳效果。 中国的分词器是非常糟糕和icu_tokenizer是有点令人失望的元宵节 ,但手柄中国传统非常好。

    支持繁体中文 (Support for traditional Chinese)

    As stated in the introduction, you may have to deal with traditional Chinese either from your documents or from users’ search requests. You need a normalization step to translate those traditional inputs into modern Chinese, because plugins like smartcn or paoding can’t manipulate it correctly.

    如导言所述,您可能必须从文档或用户的搜索请求中处理繁体中文。 您需要标准化步骤才能将这些传统输入转换成现代中文,因为诸如smartcnpaoding之类的插件无法正确操作它。

    You can do so from your application or try to handle it inside Elasticsearch directly with the elasticsearch-analysis-stconvert plugin. It can transform both words in traditional and modern Chinese, both-ways. Sadly, you will have to compile it manually, much like the paoding plugin shown above.

    您可以从您的应用程序中执行此操作,或尝试直接使用elasticsearch-analysis-stconvert插件在Elasticsearch中处理它。 它可以双向转换传统和现代中文中的单词。 可悲的是,您将不得不手动编译它,就像上面显示的paoding插件一样。

    The last solution is to use cjk: if you can’t tokenize input correctly, you still have good chances of catching the documents you need, and then improve pertinency with a signal based on the icu_tokenizer, which is quite good too.

    最后一种解决方案是使用cjk :如果您不能正确地标记化输入,您仍然有很大的机会捕获所需的文档,然后使用基于icu_tokenizer的信号来提高相关 ,这也非常好。

    与中文更进一步? (Going further with Chinese?)

    There is no perfect one-size-fits-all solution for analyzing with Elasticsearch, regardless of the content you deal with, and that’s true for Chinese as well. You have to compose and build your own analyzer with the information you get. For example, I’m going with cjk and smartcn tokenization on my search fields, using multi-fields and the multi-match query.

    无论您要处理的内容如何,​​都没有完美的“一刀切”解决方案来使用Elasticsearch进行分析,中文也是如此。 您必须使用所获得的信息来构成和构建自己的分析器。 例如,我将在搜索字段上使用cjksmartcn标记化,使用多字段multi-match查询 

    To learn more about Chinese I recommand Chineasy which is a great way to get some basic reading skills! Learning such a rich language is not easy and you should also read this article before going for it, just so you know what’s you’re getting into! 快乐编码!

    要了解更多关于中文的信息,我建议Chineasy ,这是获得一些基本阅读技能的好方法! 学习如此丰富的语言并非易事,您还应该在通读这篇文章之前先这篇文章,以确保您了解要学什么! 快乐编码 !

    翻译自: https://www.sitepoint.com/efficient-chinese-search-elasticsearch/

联系我们

  • 咨询热线 400-001-5661
  • 邮箱 31998589@qq.com
  • 地址 北京市朝阳区白家庄商务中心1号楼A座