Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6965495
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T16:04:08+00:00 2026-05-27T16:04:08+00:00

I’m currently indexing webpage using lucene. The aim is to be able to quickly

  • 0

I’m currently indexing webpage using lucene. The aim is to be able to quickly extract which page contain a certain expression (usually 1, 2 or 3 words), and which other words (or group of 1to 3 of them) are also in the page.
This will be used to build / enrich / alter a thesaurus (fixed vocabulary).

From the articles I found, it seems the problem is to find n-grams (or shingle).

Lucene has a ShingleFilter, a ShingleMatrixFilter, and a ShingleAnalyzerWrapper, which seem related to this task.

From this presentation, I learned that Lucene can also search for terms separated by a fixed number of words (called slops). An example is provided here.

However, I don’t understand clearly the difference between those approach? Are they fundamentally different, or is it a performance / index size choice that you have to make?

What is the difference between ShingleMatrixFilter and ShingleFilter?

Hope a Lucene guru will FIND this question, and and answer 😉 !

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T16:04:08+00:00Added an answer on May 27, 2026 at 4:04 pm

    The differences between using phrase versus shingle mainly involve performance and scoring.

    When using phrase queries (say “foo bar”) in the typical case where single words are in the index, phrase queries have to walk the inverted index for “foo” and for “bar” and find the documents that contain both terms, then walk their positions lists within each one of those documents to find the places where “foo” appeared right before “bar”.

    This has some cost to both performance and scoring:

    1. Positions (.prx) must be indexed and searched, this is like an additional “dimension” to the inverted index which will increase indexing and search times
    2. Because only individual terms appear in the inverted index, there is no real “phrase IDF” computed (this might not affect you). So instead this is approximated based on the sum of the term IDFs.

    On the other hand, if you use shingles, you are also indexing word n-grams, in other words, if you are shingling up to size 2, you will also have terms like “foo bar” in the index. This means for this phrase query, it will be parsed as a simple TermQuery, without using any positions lists. And since its now a “real term”, the phrase IDF will be exact, because we know exactly how many documents this “term” exists.

    But using shingles has some costs as well:

    1. Increased term dictionary, term index, and postings list sizes, though this might be a fair tradeoff especially if you completely disable positions entirely with Field.setIndexOptions.
    2. Some additional cost during the analysis phase of indexing: although ShingleFilter is optimized nicely and is pretty fast.
    3. No obvious way to compute “sloppy phrase queries” or inexact phrase matches, although this can be approximated, e.g. for a phrase of “foo bar baz” with shingles of size 2, you will have two tokens: foo_bar, bar_baz and you could implement the search via some of lucene’s other queries (like BooleanQuery) for an inexact approximation.

    In general, indexing word-ngrams with things like Shingles or CommonGrams is just a tradeoff (fairly expert), to reduce the cost of positional queries or to enhance phrase scoring.

    But there are real-world use cases for this stuff, a good example is available here:
    http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

That's pretty much it. I'm using Nokogiri to scrape a web page what has
I would like to run a str_replace or preg_replace which looks for certain words
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I want to count how many characters a certain string has in PHP, but
I am trying to understand how to use SyndicationItem to display feed which is
I used javascript for loading a picture on my website depending on which small
Basically, what I'm trying to create is a page of div tags, each has
I am currently running into a problem where an element is coming back from
I want use html5's new tag to play a wav file (currently only supported
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.