2008年3月31日 星期一

[Reading] Lecture 07 - Similarity Search in High Dimensions via Hashing

Finding the k-nearest neighbors from a database is useful for many applications, especially in a query system. However, if the searching algorithm is not well designed, it becomes a extremely time-consuming process when applied to a large database, and may make the applications impractical to be used.

Although there are some data structures such as k-d trees developed to speed up solving the k-nearest problem, evidence shows that it is no much better than using the brute-force method when the dimensions are high. Researchers use the slang "curse of dimensionality" to describe this annoying phenomenon.

The main idea of this paper is that obtaining the exact answer is not always necessary, an approximate similarity search is usually enough. In other words, the searching speed is usally more important than the accurary of the searching results.

Instead of using space partitioning (e.g. k-d trees), this paper uses locality-sensitive hashing (LSH) to accelerate the searching speed. Specifically, the authors use two levels of hashing - 1. An LSH function maps points to bucket. 2. A standard hash function that maps the contents of buckets into a hash table. The searching methods are primarily composed of two phases, a preprocessing phase and a query phase. The hash tables are built in the preprocessing phases, and later used in the query phase.

The author further analysize the ability of LSH by both mathematical and experimental ways. Of course, they show their method considerable speed-up the query process. An additional advatange of their method is that the running time is determined in advanced.

Comment:
I depreciate the way how they explain their technical thing. The symbols and the equations used in the paper are ambiguous (especially when they explain Locality-Sensitive Hashing). I remember that KT has presented a very similar topic before, she did a much better job than this paper does!


Reference :

Gionis, Indyk, and Motwani, "Similarity Search in High Dimensions via Hashing," VLDB: International Conference on Very Large Data Bases, Morgan Kaufmann Publishers, 1999.

沒有留言: