2008年3月31日 星期一

[Reading] Lecture 07 - Algorithms for Fast Vector Quantization

As in the last paper, this paper studies on finding the closest data point in a high dimensional space. They introduced three searching algorithms and made the experiments based on these methods.

The first two algorithms make refinement on the well-known data structure - kdtree. The first one uses incremental distance calculation to make the complexity of transversing the kdtree reduce to only O(1) (independent of dimension), which is tricky and very useful. The second one uses a priority quene to determine the order of searching. It makes the query terminates earlier.


The third algorithm is based on neighborhood graphs. Any two connected nodes in the neighborhood graphs must satisfy some local criterion, and can be used to accelerate the qurey speed. However, the preprocess

Reference:
S. Arya and D. M. Mount, "Algorithms for fast vector quantization." Proceedings of DCC 93: Data Compression Conference, pp. 381-390, IEEE Press, 1993.

[Reading] Lecture 07 - Similarity Search in High Dimensions via Hashing

Finding the k-nearest neighbors from a database is useful for many applications, especially in a query system. However, if the searching algorithm is not well designed, it becomes a extremely time-consuming process when applied to a large database, and may make the applications impractical to be used.

Although there are some data structures such as k-d trees developed to speed up solving the k-nearest problem, evidence shows that it is no much better than using the brute-force method when the dimensions are high. Researchers use the slang "curse of dimensionality" to describe this annoying phenomenon.

The main idea of this paper is that obtaining the exact answer is not always necessary, an approximate similarity search is usually enough. In other words, the searching speed is usally more important than the accurary of the searching results.

Instead of using space partitioning (e.g. k-d trees), this paper uses locality-sensitive hashing (LSH) to accelerate the searching speed. Specifically, the authors use two levels of hashing - 1. An LSH function maps points to bucket. 2. A standard hash function that maps the contents of buckets into a hash table. The searching methods are primarily composed of two phases, a preprocessing phase and a query phase. The hash tables are built in the preprocessing phases, and later used in the query phase.

The author further analysize the ability of LSH by both mathematical and experimental ways. Of course, they show their method considerable speed-up the query process. An additional advatange of their method is that the running time is determined in advanced.

Comment:
I depreciate the way how they explain their technical thing. The symbols and the equations used in the paper are ambiguous (especially when they explain Locality-Sensitive Hashing). I remember that KT has presented a very similar topic before, she did a much better job than this paper does!


Reference :

Gionis, Indyk, and Motwani, "Similarity Search in High Dimensions via Hashing," VLDB: International Conference on Very Large Data Bases, Morgan Kaufmann Publishers, 1999.

2008年3月24日 星期一

[Reading] Lecture 06 - Algorithms and Applications for the approximate Nonnegative Matrix Factorization

NMF (Nonnegative Matrix Factorization) is another useful algorithm to factorize a matrix A. More specifically, NMF tries to find two matrices W and H, such that WH~A, where ~ means roughly the same here. Very often the data to be processed (A) is nonnegative, to avoid contradicting phsyical meanings, an important constraint imposed to NMF is that all the low rank data (W and H) must be comprised of nonnegative values. In this paper, the cost function is defines as:

0.5 A-WH^2.

In fact, the cost function can be defined by some other ways, too.

The paper introduces three approaches to solve the NMF problem - Multiplicative Update Algorithm, Gradient Descent Algorithm, and Alternating Least Square Algorithm. These algorithms have to face some difficult challenges, such as the existence of local minima and lack of unique soluction. None of these algorithms guarantee the answer is a global minimum in cost function, but in many data mining applications a local minimum is enough to be useful. Each algorithm has its own disadvtanges such as low speed or convergence problem. However, in practice researchers often sacrifices convergence theory to speed up the process. Therefore, ALS is a favored algorithm among the three. After introducing the three algorithms, the paper goes more into detail on how to set the constraints or adding penalty terms.

Then, the paper gives two example applications of NMF, one is text mining and the other is spectral data analysis. In text mining, NMF uses a document-term matrix constructed with the weights of various terms from a set of documents. Then, the matrix is factored into a term-feature and a feature-document matrix similar to LSI. As a matter of fact, NMF is equivalent to PLSA when it is obtained by minimizing the Kullback–Leibler divergence.

Reference:
Michael W. Berry, Murray Browne, Amy Nicole Langville, V. Paul Pauca, and Robert J. Plemmons, "Algorithms and applications for approximate nonnegative matrix factorization," Computational Statistics & Data Analysis, 52(1), pp. 155-173, 2007.

[Reading] Lecture 06 - Probabilistic latent semantic indexing

Before PLSA, some people use LSI to find the relationship between words and documents. An appealing characteristic of LSI is that it automatically find the hidden relevance among two different words (documents) or between a word and a document using the statistical information. Just like PCA, formulating a problem as an optimization problem in linear algebra is theoretical and more convincing than heuristic approaches such as word matching in different documents.

However, LSI lacks a statistical foundation and hence harder to be used by combining it with other model. To overcome this deficit, this paper introduce an aspect model to analysize the problem based on the probabilistic principle. Latent class (similar to the hidden state in HMM) is used to model the implicit relationship among documents and words. The author says that the class-conditional multinomial distribution over the vocabulary in the aspect model which can be represented as points of all possible multinomials. That is, PLSA can be thought of in terms of dimensionality reduction to a probabilistic latent semantic space.

With strong probabilistic foundation, PLSA take advantage of statistical standard methods for model fitting, overfitting control, and model combination. Experiments proofs that it achieves significant gains in precision over both standard term matching and LSI.

Reference:
T. Hoffman, "Probabilistic latent semantic indexing," Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, ACM Press, 1999.

2008年3月17日 星期一

[Reading] Lecture 05 - Shape matching and object recognition using shape contexts

This paper provides an approach to measure the similarity (distance) between two shapes. They uses several points sampled from the contours to represent the shape. Each sample point has a descriptor to describe the coarse distribution of the rest parts of the shape by using histogram of the angle and distance of other sample points. From the point descriptors, the problem of finding the correspondence between two shapes becomes the problem of finding for each sample point on another shape that has the most similar shape context.

After obtaining the correspondences between points on two shapes, an aligning transformation (warping) from one shape to the other is estimated. The magnitude of the aligning transform is viewed as a measure of shape similarity. The overall similarity between the two shapes is computed as the sum of the matching errors between corresponding points and the magnitude of the aligning transform.

K-medoids (a variant version of K-means) is adopted to select the prototypes that should be stored, and a K-NN classifier is used in the recognition job. By making experiments on 3D object database, they have shown that their algorithm is invariant to several image transformations.

This paper provides a method to evaluate the distance between two shapes in a very dissimilar way compared with previous works. By using shape context descriptor, they do not require any key-point (e.g., maxima of curvature or inflection points) or the grayscale values inside the silhouette to describe the shape.

Reference:
S. Belongie, J. Malik and J. Puzicha, "Shape Matching and Object Recognition Using Shape Contexts." IEEE Trans. Pattern Anal. Mach. Intell, 24(4), pp. 509-522, 2002.

[Reading] Lecture 05 - A Boundary-Fragment-Model for Object Detection

There are many different features that can be used to detect the objects, such as texture, shape, and color. Sicne they can be used altogether to get superior performance, we cannot say any of them is the best one. In this paper, the authors introduce a system which uses boundary fragment as the feature to enhance the capability of shape information. The reason that they don't use the entire object boundary as the feature is because of the limitation in edge detection. It's really hard to obtain a complete ojbect boundary from a noisy image (or because of the occlusion).

Like most of the object detectors, the BFM (Boundary-Fragment-Model) system needs a traning phase beforehand. The BFM system first detect edges by Canny detector from the training images. After finding the edges, several fragments are chosen accoding to the scores tested on the validation images. Since each fragment has the information of the object centroid, the system can estimate the object position from the matched fragments in the detection phase. The fragment candidates are used alone or combined together to form weak classifiers. From the weak classifiers, A strong classifier is obtained by the classic AdaBoost algorithm.

In the detection phase, the strong classifer is used to decide the existence of an object. If the object exists, Mean Shift is applied on the Hough voting space to get the centroid of the object. And then, the silhouette of this object is obtained by backprojection of each weak classifer.

In their experiment, they prove that their performance is competitive to other state-of-the-art detectors'. They also show their detector's ability to distingush between different objects with similar contours.

At the end of the paper, they discuss about the invariance to scale, rotation, and viewpoint. However, I think this part is explained ambiguously, and thus I have no more comments about this section.

Reference:
A. Opelt and A. Pinz and A. Zisserman, "A Boundary-Fragment-Model for Object Detection ." European Conference on Computer Vision, p. II: 575-588, 2006.

2008年3月10日 星期一

[Reading] Lecture 04 - Nonlinear Dimensionality Reduction by Locally Linear Embedding

The idea used in this paper is very simple. Imagine that the feature points are scattered on the manifolds (locally linear patch), that is, a feature point can be described (linear combination) by its neiborhoods. Under this assumption, the low-dimensional description (weights) of each feature point is firstly obtained by searching its kNN and minimizing an error function. Given those weights, choose y (less than d) d-dimensional coordinates to minimize the sum of the reconstruction cost for every feature point.

My idea:
I think a major trick used in LLE is that the (rough) low-dimensional descriptions are found first. By this way, some information (local relationship in LLE) are preserved when the embedding coordinates are computed. My point is: If we obtain the weights by other methods (e.g., choose the neighbors that belong to the same class instead of using kNN), something other than local relationship (e.g., in-class relationship) can be preserved after the dimension reduction.

Reference:
S. T. Roweis and L. K. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding." Science, Vol. 290, pp. 2323-2326, 2000.

[Reading] Lecture 04 - Eigenfaces for Recognition & Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection

Since the two papers address on the same problem and use similar techniques, the discussion of them are put together here.

About one and half years ago, I did the face recognition job on a music video, trying to find out the relationship among the characters shown in the video. Because I was very bad at paper reading, I surfed the internet to collect the information about how to do face recognition. I found a very good website about face recognition:
http://www.face-rec.org/algorithms/
Many face databases and related researches can be found from this website.

If you click the link, you will find PCA is put on the top of the page. In fact, that was the reason why I used PCA to do face recognition in my research at that time. I didn't read the paper [1] before, but I used exactly the same skills (including how to reduce the complexity from solving PCA on a N2xN2 matrix to a MxM matrix, where N is the image height or width and M is the number of image) to do the job. That's because the approach is so old and classical that it had been made into tutorials by people.

The following steps describe my experiment before:
  1. Collect some face database and do (size, illuminace, contrast) normalization on these pictures. Run PCA on these normalized pictured to get the eigenfaces, and discarded the k (I forget the exact value in my experiment) most significant pricinpal components. The reason why we have to discard them has been discussed in [2].
  2. Use OpenCV to detect faces from the MV of "在世界中心呼喊愛情" downloaded from YouTube (so the video quality is very poor..., http://www.youtube.com/watch?v=bijIQ5vb39c).
  3. Calculate the eigenfaces components (project to face spaces) of each detected face. The eigenvectors used here are the ones obtained in step 1.
  4. Choose one face by myself, and obtain the nearest neighors of that face. See if the faces are from the same person.

Since the scene often changes in a music video, the experiment shows that the lighting condition dominates the results. However, if the lighting condition is roughly the same, PCA more or less can distinguish the people. If I knew Fisherface at that time, maybe I'll try FLD to see if it can solve the lighting problem.

Summarizaton:

Paper [1]:

A face image is treated as a featrue vector. However, the dimension of an image is too large to be handled, so they do dimension reduction by PCA. An important thing mentioned in this paper is that the complexity of solveing PCA can be reduced when the number of vectors is less than the dimensions.

Paper [2]:

They argued that PCA preserves all the difference between feature vectors, which is inadequate for face recognition. FLD is more suitable because it diminish the in-class difference when it does dimension reduction.

Comments:

PCA provides a systematic way to obtain the most distinguishable vectors (principal components) from a collection of feature vectors. However, it is very sensitive to "noise" because PCA does not know what differences can be ignored when it is applied to the application of classification. FLD enlarges the difference between classes and diminishes the in-class difference when it does dimension reduction. PCA performs better for feature reconstruction; FLD is more suitable for classification.

In my opinion, we cannot say that FLD is better than PCA because their main goals are different. Besides, a prerequisite of FLD is that the data must be supervised, that is, each vector must be labeled beforehand. Conversely, PCA can be directly applied to unsupervised data.


Reference:
[1] M. Turk and A. Pentland, "Eigenfaces for Recognition." Journal of Cognitive Neuro Science, 3(1), pp. 71-86, 1991.
[2] D. J. Kriegman and J. P. Hespanha and P. N. Belhumeur, "Eigenfaces vs. Fisherfaces: Recognition Using Class-Specific Linear Projection." European Conference on Computer Vision, p. I:43-58, 1996.

2008年3月3日 星期一

[Reading] Lecture 03 - Scale & Affine Invariant Interest Point Detectors

Unlike Lowe's paper trying to address on each component (from point detection, descryption, indexing, to matching) of a feature point matching system, this paper mainly focus on the part of interesting point detection.


In Lowe's paper, Lowe has claimed that none of previous approaches are yet fully affine invariant, as they start with initial featurescales and locations selected in a non-affine-invariant manner due to the prohibitive cost of exploring the full affine space. Although this paper use affine-variant method to choose candidate positions in the very beginning, it uses affine-invariant method to refine the point position and the shape of its neighborhood. Therefore, the feature is very close to an Affine Invariant Detectors according to Lowe's argument.


The authors first introduce a scale invariant interesting point detection method called Harris-Laplace detector. This detector uses Harris detector and Laplacian-of-Gaussians (LoG) interchangeably to find the best position for a candidate point in 3D space. And the 3D space is composed of image space (x, y) and scale space. In other words, this detector is similar to the interesting point detector used in SIFT since they both are scale-invariant. In addition to the iterative approach, the author give a simplified Harris-Laplace detector to reduce the computation complexity.


And then, the affine invariant interest point detector is introduced. In essential, affine transformation can be viewed as different scale change in each direction. The second moment matrix is used to estimate the anisotropic shape of a local image structure. The author gives several examples to compare the results using scale invariant detectors and affine invariant detectors. From those figures, it can be seem easily that the regions covered by each interesting point are much less variant to the viewpoint.

An experiment is given to test the repeatability of different detectors. Not surprisely, their affine detector performs much better than other detectors in the case of strong affine deformations. A picture matching application is given in the end of the paper.

Reference:

K. Mikolajczyk and C. Schmid, “Scale & Affine InvariantInterest Point Detectors,” Int’l J. Computer Vision, vol. 1, no. 60,pp. 63-86, 2004.