结构化数据
数据查询方法
索引分类
正向索引:
“文档1”的ID > 【 {单词1:次数,位置列表},{单词2:次数,位置列表}…】
“文档2”的ID > 【 {单词1:次数,位置列表},{单词2:次数,位置列表}…】
倒排索引:
“关键词1”:【{“文档1”的ID,次数,位置列表},{“文档2”的ID,次数,位置列表} …】
“关键词2”:【{“文档1”的ID,次数,位置列表},{“文档2”的ID,次数,位置列表} …】
创建索引的步骤:原始文档 > 获取文档信息 > 构建文档对象 > Lucene对文档对象进行分词 > 创建索引
搜素文档的步骤:获得查询信息 > 创建查询 > 执行查询 > 渲染结果
文档对象的创建
文档对象由多个域(Field)组成,每个Document 可以有多个Field,不同的Document 可以有不同的Field。同一个Document 可以有相同的Field(域名和域值都相同)每个文档都有一个唯一的编号(ID)。
分词的过程
对文档中的域进行分析:提取单词,将字母转为小写,去除标点,去除停用词生成最终的语汇单元。
原始:Lucene is a Java full-text search engine
分词:lucene java full search engine
每个单词叫做一个Term ,不同的域中拆分出来的相同的单词是不同的Term。Term 中包含两部分一部分是文档的域名, 另一部分是单词的内容。
public void testCreateIndex() throws Exception {
// 创建分词器,采用标准分词器对英文支持好,中文支持差
Analyzer analyzer = new StandardAnalyzer();
// 索引库位置,FSDirectory会根据当前环境使用合适的方式打开例如NIOFSDirectory、MMapDirectory等
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 配置IndexWriter
IndexWriterConfig writerConfig = new IndexWriterConfig(analyzer);
// 创建IndexWriter
IndexWriter indexWriter = new IndexWriter(directory, writerConfig);
// 创建文档对象
List<Document> documents = new ArrayList<>();
File files = new File("E:/searchsource");
for (File file : files.listFiles()) {
Document document = new Document();
document.add(new TextField("fileName", file.getName(), Store.YES));
document.add(new TextField("filePath", file.getPath(), Store.YES));
document.add(new LongPoint("fileSize", file.length()));
document.add(new StoredField("fileSize",file.length()));
document.add(new TextField("fileContent", FileUtils.readFileToString(file), Store.NO));
documents.add(document);
}
// 写入文档对象
indexWriter.addDocuments(documents);
// 关闭indexWriter
indexWriter.close();
}
注意:
byte[]
建立正排索引SortedSet<byte[]>
建立正排索引long
建立正排索引SortedSet<long>
建立正排索引DocValues其实是Lucene在构建倒排索引时,会额外建立一个有序的正排索引(基于document => field value的映射列表)
正排索引用于排序、聚合、分组、高亮等
DocValues只允许有一个相同域名的域【字段是非数组eg: price:80】,DocValuesSet可以设置多个相同域名不同域值【字段是数组eg: price:[100, 80]】
@Test
public void testTermSearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 创建indexReader
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSerch = new IndexSearcher(indexReader);
// 精确查询条件
Query query = new TermQuery(new Term("fileContent", "apache"));
// 查询五条
TopDocs topDocs = indexSerch.search(query,5);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSerch.doc(scoreDoc.doc);
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
indexReader.close();
}
@Test
public void testTermRangeSearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 创建indexReader
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSerch = new IndexSearcher(indexReader);
// 范围查询条件(单词开头大于'b'结束小于'e'),不可以用于数字类型
Query query = new TermRangeQuery("fileName",new BytesRef("b"),new BytesRef("e"),true,true);
// 查询二十条
TopDocs topDocs = indexSerch.search(query,20);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSerch.doc(scoreDoc.doc);
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
indexReader.close();
}
@Test
public void testNumericalRangeSearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 创建indexReader
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSerch = new IndexSearcher(indexReader);
// 数字范围查询。newRangeQuery包含开头结尾
// 如果想不包含请使用 Math.addExact(lowerValue[i], 1)或Math.addExact(upperValue[i], -1)处理
Query query = LongPoint.newRangeQuery("fileSize", 100L, 800L);
// 查询二十条
TopDocs topDocs = indexSerch.search(query,20);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSerch.doc(scoreDoc.doc);
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
indexReader.close();
}
@Test
public void testPrefixSearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 创建indexReader
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSerch = new IndexSearcher(indexReader);
// 前缀查询。含有前缀apa的关键词
Query query = new PrefixQuery(new Term("fileName","apa"));
// 查询二十条
TopDocs topDocs = indexSerch.search(query,20);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSerch.doc(scoreDoc.doc);
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
indexReader.close();
}
@Test
public void testWildcardSearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 创建indexReader
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSerch = new IndexSearcher(indexReader);
// 通配符查询,*代码任意个字符,?代表一位占位符
Query query = new WildcardQuery(new Term("fileName","?a*"));
//查询二十条
TopDocs topDocs = indexSerch.search(query,20);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSerch.doc(scoreDoc.doc);
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
indexReader.close();
}
@Test
public void testBooleanSearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 创建indexReader
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSerch = new IndexSearcher(indexReader);
// Occur.MUST 必须 相当于 AND
// Occur.MUST_NOT 必须不 相当于 !
// Occur.SHOULD 应该 相当于OR
// Occur.FILTER 和MUST功能相同除了不参与计分,用于过滤替代了6.0之前的Filter
Builder queryBuilder = new BooleanQuery.Builder();
Query fileNameQuery = new TermQuery(new Term("fileName", "apache"));
Query fileContentQuery = new TermQuery(new Term("fileContent", "java"));
queryBuilder.add(fileNameQuery, Occur.MUST);
queryBuilder.add(fileContentQuery, Occur.SHOULD);
// 查询二十条
TopDocs topDocs = indexSerch.search(queryBuilder.build(),20);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSerch.doc(scoreDoc.doc);
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
indexReader.close();
}
@Test
public void testPhraseSearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 创建indexReader
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSerch = new IndexSearcher(indexReader);
// apache和solr相差十个单词以内
Query query = new PhraseQuery(10,"fileName","apache","solr");
// 查询二十条
TopDocs topDocs = indexSerch.search(query,20);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSerch.doc(scoreDoc.doc);
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
indexReader.close();
}
@Test
public void testFuzzySearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 创建indexReader
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSerch = new IndexSearcher(indexReader);
// 允许aahe有两个字母不同(包括增加减少)例如可以匹配apache
Query query = new FuzzyQuery(new Term("fileName","aahe"),2);
// 查询二十条
TopDocs topDocs = indexSerch.search(query,20);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSerch.doc(scoreDoc.doc);
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
indexReader.close();
}
@Test
public void testQueryParserSearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 创建indexReader
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSerch = new IndexSearcher(indexReader);
// QueryParser需要传入分词器
QueryParser queryParser = new QueryParser("fileName",new StandardAnalyzer());
Query query = queryParser.parse("fileName:apache AND fileContent:java");
// 查询二十条
TopDocs topDocs = indexSerch.search(query,20);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSerch.doc(scoreDoc.doc);
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
indexReader.close();
}
注意:queryParser.setAllowLeadingWildcard(false); 当为true时才可使用通配符和前缀查询
QueryParser查询语法:
符号 | 作用 | 例子 | 描述 |
---|---|---|---|
+ | 相当于MUST 必须满足 | +fileName:apache AND fileName:solr | 必须满足第一个条件,第二个可以不满足(第二个条件此种情况下无效,不论AND\OR) |
# | 相当于FILTER必须满足不计分 | #fileName:apache AND fileName:solr | 必须满足第一个条件,第二个可以不满足(第二个条件此种情况下无效,不论AND\OR) |
- | 相当于MUST NOT 必须不满足 | -fileName:apache AND fileName:solr | 必须不满足第一个条件,第二个必须不满足(第二个条件此种情况下必须满足不论AND\OR) |
AND | 并且,左右两边条件必须满足 | fileName:apache AND fileName:solr | 左右两个条件必须满足 |
OR | 或者,左右两边条件满足一个 | fileName:apache OR fileName:solr | 可以使用空格达到同样的效果 |
“语句” | 必须完全匹配语句 | fileName:”apache lucene.txt” | 必须安全匹配apache lucene |
*/? | 通配符匹配 | fileName:apa* | 文件名单词含有apa开头的 |
~ | 短语匹配 | fileName:”apache solr” ~10 | apache和solr相差十个单词以内 |
~ | 模糊匹配 | fileName:apecha~0.4 | 模糊匹配,默认相似的0.5 |
[x TO y] | 字符串范围匹配 | fileName:[a TO b] | 使用的比较规则是字符串比较规则,先比较首字母然后依次类推,不包含使用{} |
^ | 改变提升因子 | fileName:apache^4 | fileName为apache的分数上升 |
字段:(+索引 +索引) | 字段分组 | fileName:(+apache +solr) | fileName既包含apache又包含solr |
注:DirectoryReader.open(Directory directory)打开之后并不会跟随IndexWriter的删除而感知到。DirectoryReader.openIfChanged(DirectoryReader oldReader)如果索引文件发生变化返回一个新的IndexReader,如果索引文件没有变化返回null可以进行感知;
@Test
public void testSortSearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser("fileName",new StandardAnalyzer());
Query query = queryParser.parse("*:* fileName:apache");
// 按照分数排序,倒数第二个参数是是否计算分数,false不计算结果为NAN
// 最后一个参数为是否计算最高分
TopDocs topDocs = indexSearcher.search(query,20,Sort.RELEVANCE,true,false);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSearcher.doc(scoreDoc.doc);
System.out.println("id:"+doc.get("id"));
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
indexReader.close();
}
Sort的默认有两种
Sort的自定义实现
//使用Long类型比较,第三个参数是否翻转排序结果
//自定义排序需要在Field上存储对应的DocValuesField才能排序,目的是提高效率
SortField sortField = new SortField("fileSize", Type.LONG, false);
Sort sort = new Sort(sortField);
TopDocs topDocs = indexSearcher.search(query,20,sort,true,false);
使用自定义比较规则
public class MyCoustomFieldComparatorSource extends FieldComparatorSource {
/**
* 创建自定义比较器:
* fieldname 字段名
* numHits 查出总记录数
* reversed 是否翻转
*/
@Override
public FieldComparator<?> newComparator(String fieldname, int numHits, int sortPos, boolean reversed) {
return new MyCoustomFieldComparator(fieldname,numHits);
}
private class MyCoustomFieldComparator extends SimpleFieldComparator<String> {
private String values[];
private String fieldName;
private String top;
private String bottom;
private LeafReaderContext leafReaderContext;
public MyCoustomFieldComparator(String fieldname,int numHits) {
this.fieldName = fieldname;
this.values = new String[numHits];
}
// 设置LeafReaderContext
@Override
protected void doSetNextReader(LeafReaderContext context) throws IOException {
this.leafReaderContext = context;
}
// 存储所有值slot当前下标,doc文档ID
@Override
public void copy(int slot, int doc) throws IOException {
values[slot] = leafReaderContext.reader().document(doc).getField(fieldName).stringValue();
}
// 比较两个值,slot1,slot2下标
@Override
public int compare(int slot1, int slot2) {
return Integer.compare(values[slot1].length(), values[slot2].length());
}
// 得到值,slot下标
@Override
public String value(int slot) {
return values[slot];
}
// 设置最小值,slot最小值下标
@Override
public void setBottom(int slot) throws IOException {
bottom = values[slot];
}
// 设置最大值,value最大值
@Override
public void setTopValue(String value) {
this.top = value;
}
// 和最小值比较,doc当前要比较的文档ID
@Override
public int compareBottom(int doc) throws IOException {
String currentFileName = leafReaderContext.reader().document(doc).getField(fieldName).stringValue();
return Integer.compare(bottom.length(), currentFileName.length());
}
// 和最大值比较,doc当前要比较的文档ID
@Override
public int compareTop(int doc) throws IOException {
String currentFileName = leafReaderContext.reader().document(doc).getField(fieldName).stringValue();
return Integer.compare(top.length(), currentFileName.length());
}
}
}
@Test
public void testSortSearch2() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser("",new StandardAnalyzer());
Query query = queryParser.parse("*:*");
// 创建自定义比较器
MyCoustomFieldComparatorSource myCoustomFieldComparatorSource = new MyCoustomFieldComparatorSource();
// 创建SortField使用自定义比较器
SortField sortField = new SortField("fileName", myCoustomFieldComparatorSource);
Sort sort = new Sort(sortField);
TopDocs topDocs = indexSearcher.search(query,20,sort,true,false);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSearcher.doc(scoreDoc.doc);
System.out.println("id:"+doc.get("id"));
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
indexReader.close();
}
@Test
public void testhighlighterSearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser queryParser = new QueryParser("fileName",analyzer);
Query query = queryParser.parse("fileName:apache");
// 高亮相关部分创建
QueryScorer queryScorer = new QueryScorer(query);
// 简单的分段器
Fragmenter fragmenter = new SimpleFragmenter();
// 格式
SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color='red'>","</font>");
// 高亮查询器
Highlighter highlighter = new Highlighter(simpleHTMLFormatter, queryScorer);
// 设置分段,会将长内容拿出含有关键字的部分
highlighter.setTextFragmenter(fragmenter);
TopDocs topDocs = indexSearcher.search(query, 20);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSearcher.doc(scoreDoc.doc);
System.out.println("id:"+doc.get("id"));
String fileName = doc.get("fileName");
System.out.println("fileName:"+ fileName);
if(fileName != null) {
// 得到对应的TokenStream
TokenStream tokenStream = analyzer.tokenStream("fileName", fileName);
// 拿出得分最高的段
String highLightText = highlighter.getBestFragment(tokenStream,fileName);
System.out.println(highLightText);
}
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
String fileContent = doc.get("fileContent");
if(fileContent != null) {
// 得到对应的TokenStream
TokenStream tokenStream = analyzer.tokenStream("fileContent", fileContent);
// 拿出得分最高的段
String highLightText = highlighter.getBestFragment(tokenStream, fileContent);
System.out.println(highLightText);
}
}
indexReader.close();
}
/**
* 使用FastVectorHighlighter前置条件是字段创建时存储相关信息,牺牲存储换取时间
* FieldType fileNameFieldType = new FieldType();
* fileNameFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
* fileNameFieldType.setTokenized(true);
* fileNameFieldType.setStored(true);
* fileNameFieldType.setStoreTermVectorOffsets(true); //记录相对增量
* fileNameFieldType.setStoreTermVectorPositions(true); //记录位置信息
* fileNameFieldType.setStoreTermVectors(true); //存储向量信息
* fileNameFieldType.freeze(); //阻止改动信息
* Field fileNameField = new Field("fileName", file.getName(), fileNameFieldType);
* document.add(fileNameField);
*/
@Test
public void testfastVectorHighlighterSearch() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
DirectoryReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser queryParser = new QueryParser("fileName",analyzer);
Query query = queryParser.parse("fileName:apache");
TopDocs topDocs = indexSearcher.search(query, 20);
// 高亮相关部分创建
FragListBuilder fragListBuilder = new SimpleFragListBuilder();
FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder(
BaseFragmentsBuilder.COLORED_PRE_TAGS,
BaseFragmentsBuilder.COLORED_POST_TAGS);
FastVectorHighlighter fastVectorHighlighter = new FastVectorHighlighter(
true, true,fragListBuilder,fragmentsBuilder);
FieldQuery fieldquery = fastVectorHighlighter.getFieldQuery(query);
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSearcher.doc(scoreDoc.doc);
System.out.println("id:"+doc.get("id"));
String fileName = doc.get("fileName");
System.out.println("fileName:"+ fileName);
if(fileName != null) {
String highLightText = fastVectorHighlighter.getBestFragment(
fieldquery, indexSearcher.getIndexReader(),
scoreDoc.doc, "fileName",100);
System.out.println(highLightText);
}
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
String fileContent = doc.get("fileContent");
if(fileContent != null) {
String highLightText = fastVectorHighlighter.getBestFragment(
fieldquery, indexSearcher.getIndexReader(),
scoreDoc.doc, "fileContent", 100);
System.out.println(highLightText);
}
}
indexReader.close();
}
TopDocs topDocs = indexSerch.search(query,200);
// 获取第十条到第十五条记录
int startPos =10;
int endPos =15;
// 查询的总数
System.out.println("totalHits:"+topDocs.totalHits);
for (int i = startPos; i < endPos; i++) {
ScoreDoc scoreDoc = topDocs.scoreDocs[i];
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSerch.doc(scoreDoc.doc);
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
ScoreDoc currentScoreDoc = null;
// 每页数量
int pageSize = 5;
// 要查询第几页
int nextPage = 3;
if(nextPage != 1) {
// 获取上一页的最后是第多少条
int num = pageSize*(nextPage-1);
TopDocs td = indexSerch.search(query, num);
currentScoreDoc = td.scoreDocs[num-1];
}
// 在最后一页的基础上在查几条
TopDocs topDocs = indexSerch.searchAfter(currentScoreDoc, query, pageSize);
System.out.println("totalHits:"+topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
// 获得分数
System.out.println("score:"+scoreDoc.score);
// 获得文档
Document doc = indexSerch.doc(scoreDoc.doc);
System.out.println("fileName:"+doc.get("fileName"));
System.out.println("filePath:"+doc.get("filePath"));
System.out.println("fileSize:"+doc.get("fileSize"));
System.out.println("fileContent:"+doc.get("fileContent"));
}
@Test
public void testDeleteIndex() throws Exception {
Analyzer analyzer = new StandardAnalyzer();
// 索引库位置
Directory directory = NIOFSDirectory.open(Paths.get("E:/luceneIndex"));
// 配置IndexWriter
IndexWriterConfig writerConfig = new IndexWriterConfig(analyzer);
// 创建IndexWriter
IndexWriter indexWriter = new IndexWriter(directory, writerConfig);
// 删除所有索引
indexWriter.deleteAll();
// 提交删除,会清空所有
indexWriter.commit();
indexWriter.close();
}
@Test
public void testRangeDeleteIndex() throws Exception {
Analyzer analyzer = new StandardAnalyzer();
// 索引库位置
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 配置IndexWriter
IndexWriterConfig writerConfig = new IndexWriterConfig(analyzer);
// 创建IndexWriter
IndexWriter indexWriter = new IndexWriter(directory, writerConfig);
Query query = LongPoint.newRangeQuery("fileSize", 1000, 100000);
// 删除
indexWriter.deleteDocuments(query);
// 提交删除,会将删除的内容存放与“回收站中”
indexWriter.commit();
// 在未关闭indexWriter的时间段,可以查询到总条数,总可查询条数,总删除条数
Thread.sleep(100000L);
indexWriter.close();
}
System.out.println("maxDoc:"+indexReader.maxDoc());
System.out.println("numDocs:"+indexReader.numDocs());
System.out.println("numDeletedDocs:"+indexReader.numDeletedDocs());
修改索引的本质是删除匹配的索引添加新的索引
@Test
public void testModifyIndex() throws Exception {
Analyzer analyzer = new StandardAnalyzer();
// 索引库位置
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
// 配置IndexWriter
IndexWriterConfig writerConfig = new IndexWriterConfig(analyzer);
// 创建IndexWriter
IndexWriter indexWriter = new IndexWriter(directory, writerConfig);
// 要修改的Term
Term termOld = new Term("fileName","apache");
// 要添加的Document
Document newDoc = new Document();
newDoc.add(new TextField("fileName", "GOGOGO", Store.YES));
newDoc.add(new TextField("fileContent", "GOGOGO WOWOWO", Store.YES));
// 更新
indexWriter.updateDocument(termOld, newDoc);
// 提交
indexWriter.commit();
indexWriter.close();
}
TokenStream主要存储的元素信息:
@Test
public void testNearRealTimeSearching() throws Exception {
Directory directory = FSDirectory.open(Paths.get("E:/luceneIndex"));
StandardAnalyzer standardAnalyzer = new StandardAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(standardAnalyzer);
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
// 创建SearchManager实现近实时搜索
SearcherManager searcherManager = new SearcherManager(indexWriter,true,true,new SearcherFactory());
// 将SearcherManager给ControlledRealTimeReopenThread管理用于自动更新索引
ControlledRealTimeReopenThread<IndexSearcher> CRTReopenThread =
new ControlledRealTimeReopenThread<IndexSearcher>(indexWriter, searcherManager,5.0, 0.025) ;
// 设置为后台进程
CRTReopenThread.setDaemon(true);
CRTReopenThread.setName("lucene后台刷新服务");
CRTReopenThread.start();
//从SearchMananger得到IndexSearch
IndexSearcher indexSearcher = searcherManager.acquire();
}
IndexWriter下的,不要手动调用,lucene会在合适的时候自己调用优化结构
读取各种各样的文档会使用对应的工具类例如,先可以使用Tika来读取各种文档信息,Tika实现了可视化界面用于查看信息官网提供了tika-app-版本号.jar
的jar包下载
Tika使用方式一
static private Tika tika = new Tika();
static public String fileToTxt(File file,Metadata metadata) throws IOException, TikaException {
// Metadata 元数据信息
return tika.parseToString(new FileInputStream(file),metadata);
}
Tika使用方式二
static private Parser parser;
static private ContentHandler contentHandler;
static private ParseContext parseContext;
static {
// 自动解析器
parser = new AutoDetectParser();
// 创建ContentHandler,解析结果为XML格式,只解析内容
contentHandler = new BodyContentHandler();
// 创建Context
parseContext = new ParseContext();
parseContext.set(Parser.class,parser);
}
public static String fileToTxt(File file,Metadata metadata) throws IOException, SAXException, TikaException {
parser.parse(new FileInputStream(file),contentHandler, metadata,parseContext);
return contentHandler.toString();
}