Elasticsearch7.6学习笔记1 Getting start with Elasticsearch
Ryan.Miao 人气:1Elasticsearch7.6学习笔记1 Getting start with Elasticsearch
前言
权威指南中文只有2.x, 但现在es已经到7.6. 就安装最新的来学下.
安装
这里是学习安装, 生产安装是另一套逻辑.
win
es下载地址:
https://artifacts.elastic.cohttps://img.qb5200.com/download-x/downloads/elasticsearch/elasticsearch-7.6.0-windows-x86_64.zip
kibana下载地址:
https://artifacts.elastic.cohttps://img.qb5200.com/download-x/downloads/kibana/kibana-7.6.0-windows-x86_64.zip
官方目前最新是7.6.0, 但下载速度惨不忍睹. 使用迅雷下载速度可以到xM.
bin\elasticsearch.bat
bin\kibana.bat
双击bat启动.
docker安装
对于测试学习,直接使用官方提供的docker镜像更快更方便。
安装方法见: https://www.cnblogs.com/woshimrf/phttps://img.qb5200.com/download-x/docker-es7.html
以下内容来自:
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/getting-started.html
Index some documents 索引一些文档
本次测试直接使用kibana, 当然也可以通过curl或者postman访问localhost:9200.
访问localhost:5601, 然后点击Dev Tools.
新建一个客户索引(index)
PUT /{index-name}/_doc/{id}
PUT /customer/_doc/1
{
"name": "John Doe"
}
put
是http method, 如果es中不存在索引(index) customer
, 则创建一个, 并插入一个数据, id
为,
name=John`.
如果存在则更新. 注意, 更新是覆盖更新, 即body json是什么, 最终结果就是什么.
返回如下:
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "1",
"_version" : 7,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 6,
"_primary_term" : 1
}
_index
是索引名称_type
唯一为_doc
_id
是文档(document)的主键, 也就是一条记录的pk_version
是该_id
的更新次数, 我这里已经更新了7次_shards
表示分片的结果. 我们这里一共部署了两个节点, 都写入成功了.
在kibana上设置-index manangement里可以查看index的状态. 比如我们这条记录有主副两个分片.
保存记录成功后可以立马读取出来:
GET /customer/_doc/1
返回
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "1",
"_version" : 15,
"_seq_no" : 14,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "John Doe"
}
}
_source
就是我们记录的内容
批量插入
当有多条数据需要插入的时候, 我们可以批量插入. 下载准备好的文档, 然后通过http请求导入es.
创建一个索引bank: 由于shards(分片)和replicas(副本)创建后就不能修改了,所以要先创建的时候配置shards. 这里配置了3个shards和2个replicas.
PUT /bank
{
"settings": {
"index": {
"number_of_shards": "3",
"number_of_replicas": "2"
}
}
}
文档地址: https://gitee.com/mirrors/elasticsearch/raw/masterhttps://img.qb5200.com/download-x/docs/src/test/resources/accounts.json
下载下来之后, curl命令或者postman 发送文件请求过去
curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json"
curl "localhost:9200/_cat/indices?v"
每条记录格式如下:
{
"_index": "bank",
"_type": "_doc",
"_id": "1",
"_version": 1,
"_score": 0,
"_source": {
"account_number": 1,
"balance": 39225,
"firstname": "Amber",
"lastname": "Duke",
"age": 32,
"gender": "M",
"address": "880 Holmes Lane",
"employer": "Pyrami",
"email": "amberduke@pyrami.com",
"city": "Brogan",
"state": "IL"
}
}
在kibana monitor中选择self monitor. 然后再indices中找到索引bank。可以看到我们导入的数据分布情况。
可以看到, 有3个shards分在不同的node上, 并且都有2个replicas.
开始查询
批量插入了一些数据后, 我们就可以开始学习查询了. 上文知道, 数据是银行职员表, 我们查询所有用户,并根据账号排序.
类似 sql
select * from bank order by account_number asc limit 3
Query DSL
GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
],
"size": 3,
"from": 2
}
_search
表示查询query
是查询条件, 这里是所有size
表示每次查询的条数, 分页的条数. 如果不传, 默认是10条. 在返回结果的hits
中显示.from
表示从第几个开始
返回:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "2",
"_score" : null,
"_source" : {
"account_number" : 2,
"balance" : 28838,
"firstname" : "Roberta",
"lastname" : "Bender",
"age" : 22,
"gender" : "F",
"address" : "560 Kingsway Place",
"employer" : "Chillium",
"email" : "robertabender@chillium.com",
"city" : "Bennett",
"state" : "LA"
},
"sort" : [
2
]
},
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "3",
"_score" : null,
"_source" : {
"account_number" : 3,
"balance" : 44947,
"firstname" : "Levine",
"lastname" : "Burks",
"age" : 26,
"gender" : "F",
"address" : "328 Wilson Avenue",
"employer" : "Amtap",
"email" : "levineburks@amtap.com",
"city" : "Cochranville",
"state" : "HI"
},
"sort" : [
3
]
},
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "4",
"_score" : null,
"_source" : {
"account_number" : 4,
"balance" : 27658,
"firstname" : "Rodriquez",
"lastname" : "Flores",
"age" : 31,
"gender" : "F",
"address" : "986 Wyckoff Avenue",
"employer" : "Tourmania",
"email" : "rodriquezflores@tourmania.com",
"city" : "Eastvale",
"state" : "HI"
},
"sort" : [
4
]
}
]
}
}
返回结果提供了如下信息
took
es查询时间, 单位是毫秒(milliseconds)timed_out
search是否超时了_shards
我们搜索了多少shards
, 成功了多少, 失败了多少, 跳过了多少. 关于shard, 简单理解为数据分片, 即一个index里的数据分成了几片,可以理解为按id进行分表。max_score
最相关的记录(document)的分数
接下来可可以尝试带条件的查询。
分词查询
查询address中带mill
和lane
的地址。
GET /bank/_search
{
"query": { "match": { "address": "mill lane" } },
"size": 2
}
返回
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 19,
"relation" : "eq"
},
"max_score" : 9.507477,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "136",
"_score" : 9.507477,
"_source" : {
"account_number" : 136,
"balance" : 45801,
"firstname" : "Winnie",
"lastname" : "Holland",
"age" : 38,
"gender" : "M",
"address" : "198 Mill Lane",
"employer" : "Neteria",
"email" : "winnieholland@neteria.com",
"city" : "Urie",
"state" : "IL"
}
},
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "970",
"_score" : 5.4032025,
"_source" : {
"account_number" : 970,
"balance" : 19648,
"firstname" : "Forbes",
"lastname" : "Wallace",
"age" : 28,
"gender" : "M",
"address" : "990 Mill Road",
"employer" : "Pheast",
"email" : "forbeswallace@pheast.com",
"city" : "Lopezo",
"state" : "AK"
}
}
]
}
}
- 我设置了返回2个,但实际上命中的有19个
完全匹配查询
GET /bank/_search
{
"query": { "match_phrase": { "address": "mill lane" } }
}
这时候查的完全符合的就一个了
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 9.507477,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "136",
"_score" : 9.507477,
"_source" : {
"account_number" : 136,
"balance" : 45801,
"firstname" : "Winnie",
"lastname" : "Holland",
"age" : 38,
"gender" : "M",
"address" : "198 Mill Lane",
"employer" : "Neteria",
"email" : "winnieholland@neteria.com",
"city" : "Urie",
"state" : "IL"
}
}
]
}
}
多条件查询
实际查询中通常是多个条件一起查询的
GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "age": "40" } }
],
"must_not": [
{ "match": { "state": "ID" } }
]
}
}
}
bool
用来合并多个查询条件must
,should
,must_not
是boolean查询的子语句,must
,should
决定相关性的score,结果默认按照score排序must not
是作为一个filter,影响查询的结果,但不影响score,只是从结果中过滤。
还可以显式地指定任意过滤器,以包括或排除基于结构化数据的文档。
比如,查询balance在20000和30000之间的。
GET /bank/_search
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}
聚合运算group by
按照省份统计人数
按sql的写法可能是
select state AS group_by_state, count(*) from tbl_bank limit 3;
对应es的请求是
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"size": 3
}
}
}
}
size=0
是限制返回内容, 因为es会返回查询的记录, 我们只想要聚合值aggs
是聚合的语法词group_by_state
是一个聚合结果, 名称自定义terms
查询的字段精确匹配, 这里是需要分组的字段state.keyword
state是text
类型, 字符类型需要统计和分组的,类型必须是keywordsize=3
限制group by返回的数量,这里是top3, 默认top10, 系统最大10000,可以通过修改search.max_buckets
实现, 注意多个shards会产生精度问题, 后面再深入学习
返回值:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound" : 26,
"sum_other_doc_count" : 928,
"buckets" : [
{
"key" : "MD",
"doc_count" : 28
},
{
"key" : "ID",
"doc_count" : 23
},
{
"key" : "TX",
"doc_count" : 21
}
]
}
}
}
hits
命中查询条件的记录,因为设置了size=0, 返回[]
.total
是本次查询命中了1000条记录aggregations
是聚合指标结果group_by_state
是我们查询中命名的变量名doc_count_error_upper_bound
没有在这次聚合中返回、但是可能存在的潜在聚合结果.键名有「上界」的意思,也就是表示在预估的最坏情况下沒有被算进最终结果的值,当然doc_count_error_upper_bound的值越大,最终数据不准确的可能性越大,能确定的是,它的值为 0 表示数据完全正确,但是它不为 0,不代表这次聚合的数据是错误的.sum_other_doc_count
聚合中没有统计到的文档数
值得注意的是, top3是否是准确的呢. 我们看到doc_count_error_upper_bound
是有错误数量的, 即统计结果很可能不准确, 并且得到的top3分别是28,23,21. 我们再来添加另个查询参数来比较结果:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"size": 3,
"shard_size": 60
}
}
}
}
-----------------------------------------
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 915,
"buckets" : [
{
"key" : "TX",
"doc_count" : 30
},
{
"key" : "MD",
"doc_count" : 28
},
{
"key" : "ID",
"doc_count" : 27
}
]
}
}
shard_size
表示每个分片计算的数量. 因为agg聚合运算是每个分片计算出一个结果,然后最后聚合计算最终结果. 数据在分片分布不均衡, 每个分片的topN并不是一样的, 就有可能最终聚合结果少算了一部分. 从而导致doc_count_error_upper_bound
不为0. es默认shard_size
的值是size*1.5+10
, size=3对应就是14.5, 验证shar_size=14.5时返回值确实和不传一样. 而设置为60时, error终于为0了, 即, 可以保证这个3个绝对是最多的top3. 也就是说, 聚合运算要设置shard_size尽可能大, 比如size的20倍.
按省份统计人数并计算平均薪酬
我们想要查看每个省的平均薪酬, sql可能是
select
state, avg(balance) AS average_balance, count(*) AS group_by_state
from tbl_bank
group by state
limit 3
在es可以这样查询:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"size": 3,
"shard_size": 60
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
},
"sum_balance": {
"sum": {
"field": "balance"
}
}
}
}
}
}
- 第二个
aggs
是计算每个state的聚合指标 average_balance
自定义的变量名称, 值为相同state的balanceavg
运算sum_balance
自定义的变量名称, 值为相同state的balancesum
运算
结果如下:
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 915,
"buckets" : [
{
"key" : "TX",
"doc_count" : 30,
"sum_balance" : {
"value" : 782199.0
},
"average_balance" : {
"value" : 26073.3
}
},
{
"key" : "MD",
"doc_count" : 28,
"sum_balance" : {
"value" : 732523.0
},
"average_balance" : {
"value" : 26161.535714285714
}
},
{
"key" : "ID",
"doc_count" : 27,
"sum_balance" : {
"value" : 657957.0
},
"average_balance" : {
"value" : 24368.777777777777
}
}
]
}
}
}
按省份统计人数并按照平均薪酬排序
agg terms默认排序是count降序, 如果我们想用其他方式, sql可能是这样:
select
state, avg(balance) AS average_balance, count(*) AS group_by_state
from tbl_bank
group by state
order by average_balance
limit 3
对应es可以这样查询:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"order": {
"average_balance": "desc"
},
"size": 3
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
返回结果的top3就不是之前的啦:
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound" : -1,
"sum_other_doc_count" : 983,
"buckets" : [
{
"key" : "DE",
"doc_count" : 2,
"average_balance" : {
"value" : 39040.5
}
},
{
"key" : "RI",
"doc_count" : 5,
"average_balance" : {
"value" : 36035.4
}
},
{
"key" : "NE",
"doc_count" : 10,
"average_balance" : {
"value" : 35648.8
}
}
]
}
}
参考
- 中文社区:https://elasticsearch.cn/
- es官方文档: https://www.elastic.co/guide/en/elasticsearch/reference/currenthttps://img.qb5200.com/download-x/documents-indices.html
- es官方文档: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index.html
- terms 聚合计算不准确: https://www.dongwm.com/post/elasticsearch-terms-agg-is-not-accurate/
加载全部内容