ElasticSearch基础入门学习笔记

时间:2020-02-20 Ryan.Miao 人气:1

前言

本笔记的内容主要是在从0开始学习ElasticSearch中，按照官方文档以及自己的一些测试的过程。

安装

由于是初学者，按照官方文档安装即可。前面ELK入门使用主要就是讲述了安装过程，这里不再赘述。

学习教程

找了很久，文档大多比较老。即使是官方文档也是基于2.x介绍的，官网最新已经演进到6了。不过基础入门还是可以的。接下来将参照官方文档来学习。

安装好ElasticSearch和Kibana之后. 打开localhost:5601, 选择Dev Tools。

索引(存储)雇员文档

测试的数据源是公司雇员的信息列表。其中，每个雇员的信息叫做一个文档，添加一条信息叫做索引一个文档。

在console里输入

PUT /megacorp/employee/1
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

megacorp 是索引名称
employee 是类型名称
1 是id，同样是雇员的id

光标定位到第一行，点击绿色按钮执行。

这个是简化的存入快捷方式, 其本质还是通过ES提供的REST API来实现的。上述可以用postman或者curl来实现，域名为ES的地址，即localhost:9200。对于postman，get方法不允许传body，用post也可以。

这样就将一个文档存入了ES。接下来，多存储几个

PUT /megacorp/employee/2
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}

PUT /megacorp/employee/3
{
    "first_name" :  "Douglas",
    "last_name" :   "Fir",
    "age" :         35,
    "about":        "I like to build cabinets",
    "interests":  [ "forestry" ]
}

然后，我们可以去查看，点击Management，Index Patterns，Configure an index pattern，输入megacorp，确定。

点击Discover, 就可以看到我们存储的信息了。

检索文档

存入数据后，想要查询出来。查询id为1的员工。

GET /megacorp/employee/1

返回：
{
  "_index": "megacorp",
  "_type": "employee",
  "_id": "1",
  "_version": 5,
  "found": true,
  "_source": {
    "first_name": "John",
    "last_name": "Smith",
    "age": 25,
    "about": "I love to go rock climbing",
    "interests": [
      "sports",
      "music"
    ]
  }
}

区别于保存一条记录，只是http method不同。

put 添加
get 获取
delete 删除
head 查询是否存在
想要更新，再次put即可

轻量搜索

我们除了findById，最常见就是条件查询了。

先来查看所有：

GET /megacorp/employee/_search

对了，可以查看记录个数count

GET /megacorp/employee/_count

想要查看last_name是Smith的

GET /megacorp/employee/_search?q=last_name:Smith

加一个参数q，字段名:Value的形式查询。

查询表达式

Query-string 搜索通过命令非常方便地进行临时性的即席搜索，但它有自身的局限性（参见轻量搜索）。Elasticsearch 提供一个丰富灵活的查询语言叫做查询表达式，它支持构建更加复杂和健壮的查询。

领域特定语言（DSL），指定了使用一个 JSON 请求。我们可以像这样重写之前的查询所有 Smith 的搜索

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

更复杂的查询

继续修改上一步的查询

GET /megacorp/employee/_search
{
    "query" : {
        "bool": {
            "must": {
                "match" : {
                    "last_name" : "smith" 
                }
            },
            "filter": {
                "range" : {
                    "age" : { "gt" : 30 } 
                }
            }
        }
    }
}

多了一个range过滤，要求age大于30.

结果

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "first_name": "Jane",
          "last_name": "Smith",
          "age": 32,
          "about": "I like to collect rock albums",
          "interests": [
            "music"
          ]
        }
      }
    ]
  }
}

全文检索

截止目前的搜索相对都很简单：单个姓名，通过年龄过滤。现在尝试下稍微高级点儿的全文搜索--一项传统数据库确实很难搞定的任务。

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}

结果

{
  "took": 32,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.53484553,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "1",
        "_score": 0.53484553,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "2",
        "_score": 0.26742277,
        "_source": {
          "first_name": "Jane",
          "last_name": "Smith",
          "age": 32,
          "about": "I like to collect rock albums",
          "interests": [
            "music"
          ]
        }
      }
    ]
  }
}

有个排序，以及是分数_score。可以看到只有一个字母匹配到的也查出来了. 如果我们想完全匹配, 换一个种查询.

match_phrase 会完全匹配短语.

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}

我们百度搜索的时候, 命中的关键字还会高亮, es也可以返回高亮的位置.

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        },
        "highlight": {
          "about": [
            "I love to go <em>rock</em> <em>climbing</em>"
          ]
        }
      }
    ]
  }
}

聚合计算Group by

在sql里经常遇到统计的计算, 比如sum, count, avg. es可以这样:

GET /megacorp/employee/_search
{
  "aggs": {
    "all_interests": {
      "terms": { "field": "interests" }
    }
  }
}

aggs表示聚合, all_interests是返回的变量名称, terms 表示count计算. 这个语句的意思是, 对interests进行count统计. 然后, es可能会返回:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "megacorp",
        "node": "iqHCjOUkSsWM2Hv6jT-xUQ",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
        }
      }
    ],
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    }
  },
  "status": 400
}

意思是,对字符的统计, 需要开启一个设置fielddata=true.

这就需要修改index设置了, 相当于修改关系型数据库表结构.

修改index mapping

我们先来查看一个配置:

GET /megacorp/employee/_mapping

结果:

{
  "megacorp": {
    "mappings": {
      "employee": {
        "properties": {
          "about": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "age": {
            "type": "long"
          },
          "first_name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "interests": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "last_name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

简单可以看出是定义了各个字段类型. 上个问题是需要增加一个配置

"fielddata": true

更新方法如下:


PUT /megacorp/employee/_mapping
{
        "properties": {
          "about": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "age": {
            "type": "long"
          },
          "first_name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "interests": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            },
            "fielddata": true
          },
          "last_name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }

{
  "acknowledged": true
}

表示更新成功了. 然后可以继续我们之前的聚合计算了.

聚合计算 group by count

对于sql类似于

select interests, count(*) from index_xxx
where last_name = 'smith'
group by interests.

在es里可以这样查询:

GET /megacorp/employee/_search
{
  "_source": false,
  "query": {
    "match": {
      "last_name": "smith"
    }
  },
    "size": 0,
  "aggs": {
    "all_interests": {
      "terms": {
        "field": "interests"
      }
    }
  }
}

_source=false 是为了不返回hit命中的item的属性, 默认true.

"size": 0,表示不返回hits. 默认会返回所有的行, 我们不需要, 我们只要返回统计结果.

aggs表示一个聚合操作.

all_interests是自定义的一个变量名称, 可以随便写一个.

terms 表示进行count操作, 对应的字段是interests.

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "all_interests": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "music",
          "doc_count": 2
        },
        {
          "key": "sports",
          "doc_count": 1
        }
      ]
    }
  }
}

可以得到需要的字段的count. 同样可以计算sum, avg.



GET /megacorp/employee/_search
{
    "_source": false, 
    "size": 0, 
    "aggs" : {
        "avg_age" : {
            "avg" : { "field" : "age" }
        },
        "sum_age" : {
            "sum" : { "field" : "age" }
        }
    }
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "avg_age": {
      "value": 30.666666666666668
    },
    "sum_age": {
      "value": 92
    }
  }
}

总结

上述是官方文档的第一节, 基础入门. 这里只是摘抄和实现了一遍. 没做更多的突破,但增加了个人理解. 可以知道es基本怎么用了. 更多更详细的语法后面慢慢来.

参考

https://www.elastic.co/guide/cn/elasticsearch/guide/current/_search_with_query_dsl.html

加载全部内容