【Elasticsearch7.0】文档接口之termvectors接口

基本语法

返回特定文档字段中术语的信息和统计信息，文档可以存储在索引中，也可以由用户人工提供，默认情况下，Term vectors是实时的，而不是接近实时的，可以设置realtime=false来改变，示例如：

curl -XGET "http://127.0.0.1:9200/test/_termvectors/1?pretty"

可以选择使用url中的参数指定检索信息的字段，如

curl -XGET "http://127.0.0.1:9200/test/_termvectors/1?pretty&fields=message"

返回值

可以请求三个类型的值：term信息, term统计和field统计，默认返回term信息和field统计，不返回term统计。

term信息

1、term频次（总数返回）
2、term位置（positions设置为true）
3、开始坐标和结束坐标（offsets设置为true）
4、term负载（payloads设置为true）
如果请求的信息没有存储在索引中，那么它将在可能的情况下动态计算。此外，甚至可以为索引中不存在的文档计算term向量，而是由用户提供。

term统计

设置term_statistics为true，结果中会返回，默认是false，
1、总term频次（一个term在所有文档中出现的频率是多少）
2、文档频次（包含当前term的文档数量）
默认情况下，这些值不会返回，因为term统计数据可能会对性能产生严重影响。

属性统计

设置field_statistics为false表示关闭，默认是true。
1、文档计数（有多少文档包含这个字段）
2、文档频率和（此字段中所有term的文档频率之和）
3、term频率和（该字段中每个term的总项频率之和）

term过滤

使用参数筛选器，还可以根据tf-idf分数对返回的term进行筛选，这对于找出一个好的文档特征向量是很有用的，该特性的工作方式类似于第二个阶段的查询。
支持以下子参数:
max_num_terms：每个字段必须返回的最大term数，默认是25。
min_term_freq：忽略源文档中小于此频率的单词，默认是1。
max_term_freq：忽略源文档中频率超过此频率的单词，默认为无限。
min_doc_freq：忽略至少在这么多文档中没有出现的term，默认为1。
max_doc_freq：忽略出现在很多文档中的单词，默认为无限。
min_word_length：将被忽略的单词的最小单词长度。默认值为0。
max_word_length：超过该长度的单词将被忽略，默认为unbounded(0)。

行为

term和字段统计是不准确的，删除的文档将不被考虑，仅为所请求的文档所在的切分检索信息。因此，term和字段统计仅作为相对的度量有用，而绝对值在这方面没有意义。默认情况下，在请求 term vectors时，会随机选择一个碎片来获取统计信息，只使用路由命中特定的碎片。

示例：返回存储term向量

首页创建一个索引，存储了term向量，payloads等，如：

curl -XPUT "http://127.0.0.1:9200/twitter/?pretty" -H "Content-Type:application/json" -d'
{ "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "term_vector": "with_positions_offsets_payloads",
        "store" : true,
        "analyzer" : "fulltext_analyzer"
       },
       "fullname": {
        "type": "text",
        "term_vector": "with_positions_offsets_payloads",
        "analyzer" : "fulltext_analyzer"
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}'

返回值为：

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "twitter"
}

第二步，添加一些文档，如

curl -XPUT "http://127.0.0.1:9200/twitter/_doc/1?pretty" -H "Content-Type:application/json" -d'
{
  "fullname" : "John Doe",
  "text" : "twitter test test test "
}'

curl -XPUT "http://127.0.0.1:9200/twitter/_doc/2?pretty" -H "Content-Type:application/json" -d'
{
  "fullname" : "Jane Doe",
  "text" : "Another twitter test ..."
}'

最后，下面的请求返回文档1中字段文本的所有信息和统计信息，如

curl -XGET "http://127.0.0.1:9200/twitter/_termvectors/1?pretty" -H "Content-Type:application/json" -d'
{
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}'

返回值为：

{
  "_index" : "twitter",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 0,
  "term_vectors" : {
    "text" : {
      "field_statistics" : {
        "sum_doc_freq" : 6,
        "doc_count" : 2,
        "sum_ttf" : 8
      },
      "terms" : {
        "test" : {
          "doc_freq" : 2,
          "ttf" : 4,
          "term_freq" : 3,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 8,
              "end_offset" : 12,
              "payload" : "d29yZA=="
            },
            {
              "position" : 2,
              "start_offset" : 13,
              "end_offset" : 17,
              "payload" : "d29yZA=="
            },
            {
              "position" : 3,
              "start_offset" : 18,
              "end_offset" : 22,
              "payload" : "d29yZA=="
            }
          ]
        },
        "twitter" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 7,
              "payload" : "d29yZA=="
            }
          ]
        }
      }
    }
  }
}

示例：动态生成term向量

没有显式存储在索引中的term向量将自动动态计算，下面的请求返回文档1中字段的所有信息和统计信息，即使这些术语没有显式地存储在索引中，注意，对于字段文本，不重新生成术语。

curl -XGET "http://127.0.0.1:9200/twitter2/_termvectors/1?pretty" -H "Content-Type:application/json" -d'
{
  "fields" : ["text", "name"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}'

返回值为：

{
  "_index" : "twitter2",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 26,
  "term_vectors" : {
    "text" : {
      "field_statistics" : {
        "sum_doc_freq" : 6,
        "doc_count" : 2,
        "sum_ttf" : 8
      },
      "terms" : {
        "test" : {
          "doc_freq" : 2,
          "ttf" : 4,
          "term_freq" : 3,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 8,
              "end_offset" : 12,
              "payload" : "d29yZA=="
            },
            {
              "position" : 2,
              "start_offset" : 13,
              "end_offset" : 17,
              "payload" : "d29yZA=="
            },
            {
              "position" : 3,
              "start_offset" : 18,
              "end_offset" : 22,
              "payload" : "d29yZA=="
            }
          ]
        },
        "twitter" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 7,
              "payload" : "d29yZA=="
            }
          ]
        }
      }
    }
  }
}

示例：虚拟文档

term向量可以生成虚拟文档，这是针对索引中不存在的文档。如果打开动态映射(默认)，将动态创建原始映射中没有的文档字段。

curl -XGET "http://127.0.0.1:9200/twitter/_termvectors?pretty" -H "Content-Type:application/json" -d'
{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "twitter test test test"
  }
}'

返回值为：

{
  "_index" : "twitter",
  "_type" : "_doc",
  "_version" : 0,
  "found" : true,
  "took" : 11,
  "term_vectors" : {
    "fullname" : {
      "field_statistics" : {
        "sum_doc_freq" : 4,
        "doc_count" : 2,
        "sum_ttf" : 4
      },
      "terms" : {
        "doe" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 8
            }
          ]
        },
        "john" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 4
            }
          ]
        }
      }
    },
    "text" : {
      "field_statistics" : {
        "sum_doc_freq" : 6,
        "doc_count" : 2,
        "sum_ttf" : 8
      },
      "terms" : {
        "test" : {
          "term_freq" : 3,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 8,
              "end_offset" : 12
            },
            {
              "position" : 2,
              "start_offset" : 13,
              "end_offset" : 17
            },
            {
              "position" : 3,
              "start_offset" : 18,
              "end_offset" : 22
            }
          ]
        },
        "twitter" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 7
            }
          ]
        }
      }
    }
  }
}

每个属性分析

此外，可以使用per_field_analyzer参数提供不同字段的分析器，这个是很有用的，来生成term向量，特别是对虚拟文档，当为已经存储term向量的字段提供分析器时，将重新生成term向量。

curl -XGET "http://127.0.0.1:9200/twitter/_termvectors?pretty" -H "Content-Type:application/json" -d'
{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "twitter test test test"
  },
  "fields": ["fullname"],
  "per_field_analyzer" : {
    "fullname": "keyword"
  }
}'

返回值为：

{
  "_index" : "twitter",
  "_type" : "_doc",
  "_version" : 0,
  "found" : true,
  "took" : 2,
  "term_vectors" : {
    "fullname" : {
      "field_statistics" : {
        "sum_doc_freq" : 4,
        "doc_count" : 2,
        "sum_ttf" : 4
      },
      "terms" : {
        "John Doe" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 8
            }
          ]
        }
      }
    }
  }
}

示例：term过滤

返回的term可以根据tf-idf分数进行筛选。在下面的示例中，我们从具有给定“plot”字段值的虚拟文档中获得三个最“有趣”的关键字，注意，关键字“Tony”或任何停止词都不是响应的一部分，因为它们的tf-idf太低。

curl -XGET "http://127.0.0.1:9200/twitter/_termvectors?pretty" -H "Content-Type:application/json" -d'
{
    "doc": {
      "fullname": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
    },
    "term_statistics" : true,
    "field_statistics" : true,
    "positions": false,
    "offsets": false,
    "filter" : {
      "max_num_terms" : 3,
      "min_term_freq" : 1,
      "min_doc_freq" : 1
    }
}'

返回值为:

{
  "_index" : "twitter",
  "_type" : "_doc",
  "_version" : 0,
  "found" : true,
  "took" : 8,
  "term_vectors" : {
    "fullname" : {
      "field_statistics" : {
        "sum_doc_freq" : 4,
        "doc_count" : 2,
        "sum_ttf" : 4
      },
      "terms" : {
        "evil." : {
          "term_freq" : 1,
          "score" : 1.4054651
        },
        "fight" : {
          "term_freq" : 1,
          "score" : 1.4054651
        },
        "to" : {
          "term_freq" : 3,
          "score" : 4.2163954
        }
      }
    }
  }
}

本文为博主原创文章，未经博主允许不得转载。

更多内容请访问：IT源点

注意：本文归作者所有，未经作者允许，不得转载

【Elasticsearch7.0】文档接口之termvectors接口

基本语法

返回值

term信息

term统计

属性统计

term过滤

行为

示例：返回存储term向量

示例：动态生成term向量

示例：虚拟文档

每个属性分析

示例：term过滤

全部评论: 0 条

本文目录

热门标签

程序员导航

热门文章

阿里云新老用户最新优惠

最新发布

最新评论