glimpse

GLIMPSE, which stands for GLobal IMPlicit SEarch, provides indexing and query schemes for file systems. The novelty of glimpse is that it uses a very small index – in most cases 2-4% of the size of the text – and still allows very flexible full-text retrieval including Boolean queries, approximate matching (i.e., allowing misspelling), and even searching for regular expressions. In a sense, glimpse extends agrep to entire file systems, while preserving most of its functionality and simplicity. Query times are typically slower than with inverted indexes, but they are still fast enough for many applications. For example, it took 5 seconds of CPU time to find all 19 occurrences of Usenix AND Winter in a file system containing 69MB of text spanning 4300 files. Glimpse is particularly designed for personal information, such as one’s own file system. The main characteristic of personal information is that it is non-uniform and includes many types of documents. An information retrieval system for personal information should support many types of queries, flexible interaction, low overhead, and customization, All these are important features of glimpse.

http://dl.acm.org/citation.cfm?id=1267078

GLIMPSE uses and takes a great deal of inspiration from Agrep, which was also developed at the University of Arizona, but GLIMPSE uses a high level index whereas Agrep parses all the text each time.

http://en.wikipedia.org/wiki/GLIMPSE

  • Strengths
    1. Very small index.
    2. Approximate matching is supported.
    3. Fast index construction.
    4. No need to define document boundaries ahead of time. It can be done at query time.
    5. Easy to customize to user preferences.
    6. Easy to adapt to a parallel computer (different blocks can be searched by different processors).
    7. Easy to modify the index due to its small size. Therefore, dynamic texts can be supported.
    8. No need to extract stems. Subword queries are supported automatically (even subwords that appear in the middle of words).
    9. Queries with wildcards, classes of characters, and even regular expressions are supported.
  • Weaknesses
    1. Slower compared to inverted indexes for some queries. Not suitable for applications where speed is the predominant concern.
    2. Too slow, at this stage (but we’re working on it), for very large texts (more than 500MB).
    3. Boolean queries containing common words are slow.

http://webglimpse.net/pubs/glimpse.pdf
— The original Glimpse paper

Also, check out my blog about why I us glimpse as my personal search engine.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s