2008年3月24日 星期一

[Reading] Lecture 06 - Probabilistic latent semantic indexing

Before PLSA, some people use LSI to find the relationship between words and documents. An appealing characteristic of LSI is that it automatically find the hidden relevance among two different words (documents) or between a word and a document using the statistical information. Just like PCA, formulating a problem as an optimization problem in linear algebra is theoretical and more convincing than heuristic approaches such as word matching in different documents.

However, LSI lacks a statistical foundation and hence harder to be used by combining it with other model. To overcome this deficit, this paper introduce an aspect model to analysize the problem based on the probabilistic principle. Latent class (similar to the hidden state in HMM) is used to model the implicit relationship among documents and words. The author says that the class-conditional multinomial distribution over the vocabulary in the aspect model which can be represented as points of all possible multinomials. That is, PLSA can be thought of in terms of dimensionality reduction to a probabilistic latent semantic space.

With strong probabilistic foundation, PLSA take advantage of statistical standard methods for model fitting, overfitting control, and model combination. Experiments proofs that it achieves significant gains in precision over both standard term matching and LSI.

Reference:
T. Hoffman, "Probabilistic latent semantic indexing," Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, ACM Press, 1999.

沒有留言: