2008年6月10日 星期二

[Reading] Lecture 16 - Support Vector Learning for Ordinal Regression

This paper poposed a machine learning method to solve the problem - ordinal regression. The job is similar to doing classification, but there exists ordering relationships among different classes. The authors solve the problem by modifying an ordinary SVM framework. This paper is hard to read for me since I am not familiar with the mathematical parts of SVM.

Reference:
Ralf Herbrich, Thore Graepel, and Klaus Obermayer, "Support Vector Learning for Ordinal Regression" , May 20 1999.

2008年6月2日 星期一

[Reading] Lecture 15 - The PageRank Citation Ranking : Bringing Order to the Web

This paper provides a method to evaluates the importance of a webpage according to its backlinks. Using the backlinks as the indication of goodness is not a new idea. In academic area, people see the number of citations, which can be viewed as the backlink of paper, to evaluate the importance of paper. The quality of the citation is under control (you must have the ability to publish a paper before you can cite), so it is worthy to believe that a paper has only a few citations won't be to important. However, it's different in the network. Building a website is easy, so does putting a forelink on the webpage. Linking by Yahoo is more valuable than linking by a personal webpage. Therefore, we should not only see the number of backlink. Intead, we should also consider the quality of the backlink provider.

After understanding the spirit of its method, you can easily know what it tries to do during the propagation. It makes an important webpage contributes more than a common webpage in each iteration. People in that era really likes to use propagation...

There may be some problems using the propagation approaches. For instance, there is a rank sink problem if there exists a local loop in the model. The author solve this problem by providing a source of rank. It can ensure that each page may contribute at least a little instead of zero. In a random walk of view, you can think that the source of rank is a random surfer. It will jump to a webpage that is not linked by the current webpage.

Indeed, PageRank provides a good way to evalute the importance according to the backlinks. However, I think it still ignores the forelinks. A webpage provides high-quality forelinks won't benefit much (although might be some if there exist a loop that includes the webpage and its forelinked-page) from its contribution.

Reference:
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd, "The PageRank Citation Ranking: Bringing Order to the Web", Number SIDL-WP-1999-0120, November 1999.

2008年5月27日 星期二

[Reading] Lecture 14 - Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

This paper proposed a efficient algorithm for finding frequency item sets. It utilized a data structure called FP tree to do this job. There are two advantages of this algorithm. First, the FP tree can very compactly represent the data without any information loss. That's because it didn't use any approximation. The solution will be exactly as same as the right answer. Second, the query is efficient. You only have to scan the database once, which is better than those apriori-based method. A deficincy of this work is the memory problem.

Reference:
"Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach," Han, Data Mining and Knowledge Discovery, 2004.

2008年5月20日 星期二

[Reading] Lecture 13 - On Spectral Clustering: Analysis and an algorithm

A big problem of spectral clustering is that it is very slow when we solve the eigenvector problem and the Laplacian matrix is big. Therefore, it is a little wasteful to exploit only one (the second) eigenvector and abandon the others to generate only one normalized cuts at a a time. In this paper the methods how to use the 3rd, 4th, ... eigenvectors are presented. According to their algorithm, after the eigenvector is obtained and represented as column in a matrix M, each row of M is normalized and can be clustered by Kmeans. I've tried to understand the physical meaning of the step of normalization in row but in vain. I think maybe it is just a trick operated in the mathematics and not easy to be interpreted by the physical meaning.

Reference:
Lie Lu, and Alan Hanjalic,"Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval," to appear IEEE Trans. on Multimedia, 2007-2008.

[Reading] Lecture 13 - Normalized Cuts and Image Segmentation

Spectral clustering is not easy to understand. Although I present this topics, I dare not to say I am familiar with it.

Usually it is easier to understand an algorithm from its physcial meaning. In this paper, feature points clustering is first modeled as a Graph G=(V,E). And the action of clustering is interpreted as grouping the nodes into different groups. A simple method to evaluate the quality of grouping is to calculate the sum of edges that are linked between different groups (Cut(A,B)). Smaller Cut(A,B) is prefered since it means the similarity between the groups is small. However, minimize cut usually causes clustering that has little number of members. To solve this problem, the author proposed a method to evaluate the cut quality, and it is Normlized Cut (NCut(A,B)).

The idea of NCut is that it penalizes a clustering that has few edges connected to the nodes out of its group. However, it makes the problem becomes NP-hard. The author found that the NCut value can be represented by some matrix form consisting of the Laplacian matrix L and the vector f determined by the clustering result. That is, you can minimze NCut by minimize the matrix form. However, it is unreasonable since once we get f, we get the NCut value. In other words, minimizing the matrix form is equivalent to using brute force. In order to solve the matrix form, the author relieves some constraint of f. And then the problem can be solved as an eigenvector problem because of the approximation.

The paper mainly focus on how to cluster the data into two groups and does not put much emphasize on howto cluster the data into several groups simutaneously.

The authors also compared the normalized cut algorithm with other algorithms. There is no suprise that their method performs superior than other methods under some different settings.

Reference:
Jianbo Shi and Jitendra Malik
, "Normalized Cuts and Image Segmentation," Trans. PAMI 2000.

2008年5月12日 星期一

[Reading] Lecture 12 - A Comparison of Algorithms for Inference and Learning in Probabilistic Graphical Models

This paper demonstrates several inferences and learning algorithms by a simple computer vision application - occlusion. The graphical models that have been introduced in the paper includes the Bayesian Network, the Markov Random Field, and the Factor Graph. Since the occlusion application is simple, we could easily learn the pros and cons of each graphical model from the examples. Bayesian Network is good at indicating marginal independence while Markov Random Field is good at indicating conditional independence.

The rest of the paper discuss many different algorithms for inference, including exact and approximate inferences. Since the algorithms for exact inference is often intractable, we are interested more on the characteristics of the algorithms for approximate inferences. ICM reaches lower energy more quickly, but converges at a higher energy compared to other algorithms.

Reference :
B. J. Frey and N. Jojic, "A Comparison of Algorithms for Inference and Learning in Probabilistic Graphical Models," IEEE Trans. Pattern Analysis and Machine Intelligence, 27(9), pp. 1392-1416, September 2005.

2008年5月9日 星期五

[Whisper] Humble webpage

My very humble webpage which records my project results in some courses.
http://mpac.ee.ntu.edu.tw/~sutony
or
http://su.sutony.googlepages.com/sutony

2008年5月7日 星期三

[Reading] Lecture 11 - Learning Low-Level Vision

This paper models some computer vision problems by the Markov random field (MRF) model. The target solution depedent to the application is modeled as the hidden states in MRF, and the input image is modeled as the observation in MRF. After the modeling, our goal is to solve the MRF model - to obtain the "most" possible states of the hidden states.

The algorithm used in this paper to solve MRF is belief-propagation. It is easy to understand the physical meaning of the algorihm, and some papers have given the proof of its property such as its ability of convergence.

They give three application examples by using MRF, the first is "super-resolution," the second is "shaing and reflectance estimation", and the last one is "motion estimation." The idea used in super-resolution is quite novel, but it is a little impractical. If the database is not huge, it's unresonable to assume the high-resolution patch can be found from the database.

The second application is smiliar to the first one, so does their weakness in the assumption. They are both ill-posed problem. Their results are very easy to be biasd by database.

I don't think anyone will adopt their method in the third application. It is impractical to sepnd so many time estimating the motion.

Reference:
William T. Freeman and Egon C. Pasztor, "Learning Low-Level Vision," ICCV, pp. 1182-1189, 1999.

2008年5月5日 星期一

[Reading] Lecture 11 - An Introduction to Graphical Models

This paper is easy to read. When I first time hear "graphical model," I surf the internet and found this paper is most suitable for a beginner like me.

The theory supported behind the graphical model is the probability theory, in other words, you can use the mathematical language to describe any graphical model. The paper starts by a example (Fig. 1) to explain how to use a graphical model to represent a simple Bayesian model. After that, many directed graphical models and undirected graphical models are introduced. We can learn from these examples how to translate our probability problem into a graphical problem.

Since how to solve a graphical model is not the main topic in this paper, the paper only introduces a few skills how to simplify the problem. The two skills "variable elimination" and "dynamic programming" is also useful when we do programming other than solve the graphical model. For example, in our homework 2, we can use variable elimination to elliminate some redundent calculations.

I cannot understand well about the rest sections of this paper, especially "Learning." I guess it is a seldom-used technique. The example given in the application is not in vogue, maybe it is because the paper is published in early days. I think using a simplier application example as the end the paper would be better.

I think what I benefit most from this paper is more understanding the relationship between different graphical models. To my surprise, Kalman filter can be translated into a graphical model too. I didn't know that before.

Reference:
Michael I. Jordan, Zoubin Ghahramani, and Tommi S. Jaakkola, "An introduction to variational methods for graphical models," Learning in Graphical Models, Kluwer Academic Publishers, 1999.

2008年4月29日 星期二

[Reading] Lecture 10 - Transductive Inference for Text Classification using Support Vector Machines

The paper does not try to propose new methods to modify the core technique of SVM, instead, it gives a way to extend SVM to the cases of semi-supervised learning. SVM does classification by dividing the feature space into different regions according to the training data. If the feature space is high-dimensional and the number of training data is small (it is usually less than the dimension of feature space), you can expect that the support vector will be bias (overfitting) to the training data because it doesn't care about the test (unlabeled) data at all. This is a traditional SVM approach - 'inductive learning (IL)'.

Therefore, this paper tries to incorporate the information coming from the unlabeled data during the model training - transductive learning (TL). TSVM tries to separate not only the training data but also the test data in the feature space. However, since test data is unlabeled, their labels must be guessed and updated during the training process. This paper tells us how this can be done.

Their experiment gives a good side-information of the difference between SVM and TSVM. Evidence shows that when the number of training data is rather small (compared to the number of test data), TSVM greatly outperforms SVM. If the number of training data and test are roughly the same, their performance is roughly the same since the information coming from the test data is not much.

Reference:
T. Joachims, "Transductive Inference for Text Classification using Support Vector Machines," Proc. International Conference on Machine Learning (ICML), 1999.

[Reading] Lecture 10 - Combining Labeled and Unlabeled Data with Co-Training

This paper gives a way to do semi-supervised learning, and it focus on the problem of how to combine different kinds of information in the learning procedure. In a more mathematical way to interpret the same problem, it is eqivalent to how to training a classifier from two feature sets with limited labeled data and large number of unlabeled sample data.

The main idea of co-training is trying to let one classifer (trained by feature set 1) to teach the other classifier (trained by feature set 2) and viece versa. One of the benefit is that the data which is agreed to be the same label by the two classifiers is more likely to be correct, and can be used as a labeled data in the next iteration. The other benifit is that any of the classifiers can learn something that can not be learnt by itself from other classifiers, which makes up the deficiency that the classifiers were trained totally separately.

Comment:
After I understood the idea of this paper, I think there is a question that must be answered. When should we separate the two feature vector (f1, f2) to train two classifiers? Why don't we just put them together to form another feature vector? The authors did give an assumption that f1 and f2 are (conditional) indepdent, is that reasonable? or how can I know whether f1 and f2 are indepdent?

Reference:
Blum and Mitchell, "Combining Labeled and Unlabeled Data with Co-training," COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1998.

2008年4月23日 星期三

[Reading] Lecture 09 - Improved Boosting Algorithms using Confidence-Rated Predictions

This paper proposes a more general AdaBoost algorithm than the one proposed by Freund and Schapire. For example, the weak hypotheses does not have to range within [-1, +1], instead, it can range over all of R. The paper provides different ways to determine alpha, and shows that Freund and Schapire's version can be derived as a special case of their method. And then, the paper gives a criterion for finding weak hypotheses, which was not an issue in the original version. A generalization error is also proposed, it removes the restriction that minimization training error is the only concern and goal. The last half part of this paper is quite important too, the authors give the ways to solve multiclass and multi-label classification problem by adaboost. In Freund and Schapire's algorithm, adaBoost can only be used to solve a binary (two-class), one label problem.

This is a good paper to help us understand adaBoost more deeply. If I have to implement adaBoost in the future (homework!?) and the performance matters, I will refer to this paper to seek for better approaches.

Reference:
Schapire and Singer, "Improved Boosting Algorithms Using Confidence-rated Predictions," Machine Learning, Vol. 37, 1999.

2008年4月21日 星期一

[Reading] Lecture 09 - Rapid Object Detection using a Boosted Cascade of Simple Features

The paper propose an algorithm to train an object detection model, which is very fast and its detection power is competitive to the state-of-art at then. There are three main contribution of their works:

1. Integral Image
The complexity of computing the integral image is low - O(image size). And more importantly, it can be used to obtain any haar-like feature in constant time. This speeds up the detection process. Rare features can be extracted as fast as it.
Integral image has another usage. We can get the variance within any rectangle in the image in constant time if we have to integral image, with one accummulating the intensity value and the other accumulating the square of intensity value.

2. Training by Adaboost
There are tons of haar-like feature in an image, but many of those features are useless. By treating each weak classifier as an indicator of the usefulness of a haar-like feature, AdaBoost can be used to select a set of features and train the classifier simutaneously.
Since feature selection is done automatically, their work can be easily extended to objects other than faces.

3. Attentional Cascade
This part is designed for accelerating the detection speed rather than the accuracy. The idea is simple - sacrificing the false positive rate to achieve very high detection rate in the beginning to thrown out as many unlikely candidate region as possible. Since most candidate regions are removed in the beginning, the number of operations in the latter stage is decreased.

I think this is a very classic work in object detect. It can be used to detect objects other than faces without any prior knowledge as long as you have enough training data. Of course, it doesn't quarantee to perform well on all objects. Haar-like feature is a kind of appearance feature. If objects' appearance often change (e.g. t-shirt), the method proposed in the paper would easily fail.

Reference:
P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features." Proc. CVPR, Vol. 1, pp. 511-518, 2001.

2008年4月8日 星期二

[Reading] Lecture 08 - Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

This is a pioneer work that makes the machine learn to auto-annotate the image segmentation (blob). We say auto-annotate here, because they don't need precise annotation database. All they need is images with words associate with them. The annotation work is then modeled as a machine translation job, and can be solved by EM algorithm. After giving the detail of using EM to estimate the model, the paper provides several experiment result and the discussion, but they are not quite convincing to me.

Before this week, I had no idea how to classify/recognize/label an object using unsupervised method before. Now, at least I know one or two of these approaches. It is very interesting, since for many realistic cases, only unsupervised data is available.

Reference:
P. Duygulu, K. Barnard, J. F. G. de Freitas, and D. A. Forsyth, "Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary," European Conference on Computer Vision, p. IV: 97 ff., 2002.

2008年4月7日 星期一

[Reading] Lecture 08 - Names and Faces in the News Abstract

This paper demonstrates an unsupervised face annotation algorithm.

As I mentioned in [Reading] lecture 4, a major difference between PCA and LDA (I used the term FLD at that time) is that the LDA exploits supervised data set to do dimension reduction. It makes LDA perform better than PCA for classification, but at the cost of the supervision beforehand. However, this paper still can utilize LDA without labeling beforhand! The reason is that a lexicon of names can be extracted automatically from the news captains. Those names may not be accurate for all images, but are already enough for using LDA to do dimension reduction (or we can say obtaining distinctive property).

They didn't restrict the dimension reduction method to LDA, Kernel-PCA is also recommended. However, their database is too large to do Kernel-PCA, so they use Nystrom Approxiamtion to make this job easier.

After doing dimension reduction, clustering is done by a modified K-Means method. This can be interpreted as a face labeling process. Since the lexicon of names obtained at the beginning is noisy, they futher prunes and merges the clusters to make the result better.

The paper gives a "face recognition" algorithm which is much different to the previous works. It utilizes the captain information to automatically generate the supervised dataset (and also the test dataset). However, I think their system can only recognize faces that appear several times.

Reference:
T. L. Berg and A. C. Berg and J. Edwards and M. Maire and R. White and Y. W. Teh and E. Learned Miller and D. A. Forsyth, "Names and faces in the news," IEEE Computer Vision and Pattern Recognition or CVPR, p. II: 848-854, 2004.

2008年3月31日 星期一

[Reading] Lecture 07 - Algorithms for Fast Vector Quantization

As in the last paper, this paper studies on finding the closest data point in a high dimensional space. They introduced three searching algorithms and made the experiments based on these methods.

The first two algorithms make refinement on the well-known data structure - kdtree. The first one uses incremental distance calculation to make the complexity of transversing the kdtree reduce to only O(1) (independent of dimension), which is tricky and very useful. The second one uses a priority quene to determine the order of searching. It makes the query terminates earlier.


The third algorithm is based on neighborhood graphs. Any two connected nodes in the neighborhood graphs must satisfy some local criterion, and can be used to accelerate the qurey speed. However, the preprocess

Reference:
S. Arya and D. M. Mount, "Algorithms for fast vector quantization." Proceedings of DCC 93: Data Compression Conference, pp. 381-390, IEEE Press, 1993.

[Reading] Lecture 07 - Similarity Search in High Dimensions via Hashing

Finding the k-nearest neighbors from a database is useful for many applications, especially in a query system. However, if the searching algorithm is not well designed, it becomes a extremely time-consuming process when applied to a large database, and may make the applications impractical to be used.

Although there are some data structures such as k-d trees developed to speed up solving the k-nearest problem, evidence shows that it is no much better than using the brute-force method when the dimensions are high. Researchers use the slang "curse of dimensionality" to describe this annoying phenomenon.

The main idea of this paper is that obtaining the exact answer is not always necessary, an approximate similarity search is usually enough. In other words, the searching speed is usally more important than the accurary of the searching results.

Instead of using space partitioning (e.g. k-d trees), this paper uses locality-sensitive hashing (LSH) to accelerate the searching speed. Specifically, the authors use two levels of hashing - 1. An LSH function maps points to bucket. 2. A standard hash function that maps the contents of buckets into a hash table. The searching methods are primarily composed of two phases, a preprocessing phase and a query phase. The hash tables are built in the preprocessing phases, and later used in the query phase.

The author further analysize the ability of LSH by both mathematical and experimental ways. Of course, they show their method considerable speed-up the query process. An additional advatange of their method is that the running time is determined in advanced.

Comment:
I depreciate the way how they explain their technical thing. The symbols and the equations used in the paper are ambiguous (especially when they explain Locality-Sensitive Hashing). I remember that KT has presented a very similar topic before, she did a much better job than this paper does!


Reference :

Gionis, Indyk, and Motwani, "Similarity Search in High Dimensions via Hashing," VLDB: International Conference on Very Large Data Bases, Morgan Kaufmann Publishers, 1999.

2008年3月24日 星期一

[Reading] Lecture 06 - Algorithms and Applications for the approximate Nonnegative Matrix Factorization

NMF (Nonnegative Matrix Factorization) is another useful algorithm to factorize a matrix A. More specifically, NMF tries to find two matrices W and H, such that WH~A, where ~ means roughly the same here. Very often the data to be processed (A) is nonnegative, to avoid contradicting phsyical meanings, an important constraint imposed to NMF is that all the low rank data (W and H) must be comprised of nonnegative values. In this paper, the cost function is defines as:

0.5 A-WH^2.

In fact, the cost function can be defined by some other ways, too.

The paper introduces three approaches to solve the NMF problem - Multiplicative Update Algorithm, Gradient Descent Algorithm, and Alternating Least Square Algorithm. These algorithms have to face some difficult challenges, such as the existence of local minima and lack of unique soluction. None of these algorithms guarantee the answer is a global minimum in cost function, but in many data mining applications a local minimum is enough to be useful. Each algorithm has its own disadvtanges such as low speed or convergence problem. However, in practice researchers often sacrifices convergence theory to speed up the process. Therefore, ALS is a favored algorithm among the three. After introducing the three algorithms, the paper goes more into detail on how to set the constraints or adding penalty terms.

Then, the paper gives two example applications of NMF, one is text mining and the other is spectral data analysis. In text mining, NMF uses a document-term matrix constructed with the weights of various terms from a set of documents. Then, the matrix is factored into a term-feature and a feature-document matrix similar to LSI. As a matter of fact, NMF is equivalent to PLSA when it is obtained by minimizing the Kullback–Leibler divergence.

Reference:
Michael W. Berry, Murray Browne, Amy Nicole Langville, V. Paul Pauca, and Robert J. Plemmons, "Algorithms and applications for approximate nonnegative matrix factorization," Computational Statistics & Data Analysis, 52(1), pp. 155-173, 2007.

[Reading] Lecture 06 - Probabilistic latent semantic indexing

Before PLSA, some people use LSI to find the relationship between words and documents. An appealing characteristic of LSI is that it automatically find the hidden relevance among two different words (documents) or between a word and a document using the statistical information. Just like PCA, formulating a problem as an optimization problem in linear algebra is theoretical and more convincing than heuristic approaches such as word matching in different documents.

However, LSI lacks a statistical foundation and hence harder to be used by combining it with other model. To overcome this deficit, this paper introduce an aspect model to analysize the problem based on the probabilistic principle. Latent class (similar to the hidden state in HMM) is used to model the implicit relationship among documents and words. The author says that the class-conditional multinomial distribution over the vocabulary in the aspect model which can be represented as points of all possible multinomials. That is, PLSA can be thought of in terms of dimensionality reduction to a probabilistic latent semantic space.

With strong probabilistic foundation, PLSA take advantage of statistical standard methods for model fitting, overfitting control, and model combination. Experiments proofs that it achieves significant gains in precision over both standard term matching and LSI.

Reference:
T. Hoffman, "Probabilistic latent semantic indexing," Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, ACM Press, 1999.

2008年3月17日 星期一

[Reading] Lecture 05 - Shape matching and object recognition using shape contexts

This paper provides an approach to measure the similarity (distance) between two shapes. They uses several points sampled from the contours to represent the shape. Each sample point has a descriptor to describe the coarse distribution of the rest parts of the shape by using histogram of the angle and distance of other sample points. From the point descriptors, the problem of finding the correspondence between two shapes becomes the problem of finding for each sample point on another shape that has the most similar shape context.

After obtaining the correspondences between points on two shapes, an aligning transformation (warping) from one shape to the other is estimated. The magnitude of the aligning transform is viewed as a measure of shape similarity. The overall similarity between the two shapes is computed as the sum of the matching errors between corresponding points and the magnitude of the aligning transform.

K-medoids (a variant version of K-means) is adopted to select the prototypes that should be stored, and a K-NN classifier is used in the recognition job. By making experiments on 3D object database, they have shown that their algorithm is invariant to several image transformations.

This paper provides a method to evaluate the distance between two shapes in a very dissimilar way compared with previous works. By using shape context descriptor, they do not require any key-point (e.g., maxima of curvature or inflection points) or the grayscale values inside the silhouette to describe the shape.

Reference:
S. Belongie, J. Malik and J. Puzicha, "Shape Matching and Object Recognition Using Shape Contexts." IEEE Trans. Pattern Anal. Mach. Intell, 24(4), pp. 509-522, 2002.

[Reading] Lecture 05 - A Boundary-Fragment-Model for Object Detection

There are many different features that can be used to detect the objects, such as texture, shape, and color. Sicne they can be used altogether to get superior performance, we cannot say any of them is the best one. In this paper, the authors introduce a system which uses boundary fragment as the feature to enhance the capability of shape information. The reason that they don't use the entire object boundary as the feature is because of the limitation in edge detection. It's really hard to obtain a complete ojbect boundary from a noisy image (or because of the occlusion).

Like most of the object detectors, the BFM (Boundary-Fragment-Model) system needs a traning phase beforehand. The BFM system first detect edges by Canny detector from the training images. After finding the edges, several fragments are chosen accoding to the scores tested on the validation images. Since each fragment has the information of the object centroid, the system can estimate the object position from the matched fragments in the detection phase. The fragment candidates are used alone or combined together to form weak classifiers. From the weak classifiers, A strong classifier is obtained by the classic AdaBoost algorithm.

In the detection phase, the strong classifer is used to decide the existence of an object. If the object exists, Mean Shift is applied on the Hough voting space to get the centroid of the object. And then, the silhouette of this object is obtained by backprojection of each weak classifer.

In their experiment, they prove that their performance is competitive to other state-of-the-art detectors'. They also show their detector's ability to distingush between different objects with similar contours.

At the end of the paper, they discuss about the invariance to scale, rotation, and viewpoint. However, I think this part is explained ambiguously, and thus I have no more comments about this section.

Reference:
A. Opelt and A. Pinz and A. Zisserman, "A Boundary-Fragment-Model for Object Detection ." European Conference on Computer Vision, p. II: 575-588, 2006.

2008年3月10日 星期一

[Reading] Lecture 04 - Nonlinear Dimensionality Reduction by Locally Linear Embedding

The idea used in this paper is very simple. Imagine that the feature points are scattered on the manifolds (locally linear patch), that is, a feature point can be described (linear combination) by its neiborhoods. Under this assumption, the low-dimensional description (weights) of each feature point is firstly obtained by searching its kNN and minimizing an error function. Given those weights, choose y (less than d) d-dimensional coordinates to minimize the sum of the reconstruction cost for every feature point.

My idea:
I think a major trick used in LLE is that the (rough) low-dimensional descriptions are found first. By this way, some information (local relationship in LLE) are preserved when the embedding coordinates are computed. My point is: If we obtain the weights by other methods (e.g., choose the neighbors that belong to the same class instead of using kNN), something other than local relationship (e.g., in-class relationship) can be preserved after the dimension reduction.

Reference:
S. T. Roweis and L. K. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding." Science, Vol. 290, pp. 2323-2326, 2000.

[Reading] Lecture 04 - Eigenfaces for Recognition & Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection

Since the two papers address on the same problem and use similar techniques, the discussion of them are put together here.

About one and half years ago, I did the face recognition job on a music video, trying to find out the relationship among the characters shown in the video. Because I was very bad at paper reading, I surfed the internet to collect the information about how to do face recognition. I found a very good website about face recognition:
http://www.face-rec.org/algorithms/
Many face databases and related researches can be found from this website.

If you click the link, you will find PCA is put on the top of the page. In fact, that was the reason why I used PCA to do face recognition in my research at that time. I didn't read the paper [1] before, but I used exactly the same skills (including how to reduce the complexity from solving PCA on a N2xN2 matrix to a MxM matrix, where N is the image height or width and M is the number of image) to do the job. That's because the approach is so old and classical that it had been made into tutorials by people.

The following steps describe my experiment before:
  1. Collect some face database and do (size, illuminace, contrast) normalization on these pictures. Run PCA on these normalized pictured to get the eigenfaces, and discarded the k (I forget the exact value in my experiment) most significant pricinpal components. The reason why we have to discard them has been discussed in [2].
  2. Use OpenCV to detect faces from the MV of "在世界中心呼喊愛情" downloaded from YouTube (so the video quality is very poor..., http://www.youtube.com/watch?v=bijIQ5vb39c).
  3. Calculate the eigenfaces components (project to face spaces) of each detected face. The eigenvectors used here are the ones obtained in step 1.
  4. Choose one face by myself, and obtain the nearest neighors of that face. See if the faces are from the same person.

Since the scene often changes in a music video, the experiment shows that the lighting condition dominates the results. However, if the lighting condition is roughly the same, PCA more or less can distinguish the people. If I knew Fisherface at that time, maybe I'll try FLD to see if it can solve the lighting problem.

Summarizaton:

Paper [1]:

A face image is treated as a featrue vector. However, the dimension of an image is too large to be handled, so they do dimension reduction by PCA. An important thing mentioned in this paper is that the complexity of solveing PCA can be reduced when the number of vectors is less than the dimensions.

Paper [2]:

They argued that PCA preserves all the difference between feature vectors, which is inadequate for face recognition. FLD is more suitable because it diminish the in-class difference when it does dimension reduction.

Comments:

PCA provides a systematic way to obtain the most distinguishable vectors (principal components) from a collection of feature vectors. However, it is very sensitive to "noise" because PCA does not know what differences can be ignored when it is applied to the application of classification. FLD enlarges the difference between classes and diminishes the in-class difference when it does dimension reduction. PCA performs better for feature reconstruction; FLD is more suitable for classification.

In my opinion, we cannot say that FLD is better than PCA because their main goals are different. Besides, a prerequisite of FLD is that the data must be supervised, that is, each vector must be labeled beforehand. Conversely, PCA can be directly applied to unsupervised data.


Reference:
[1] M. Turk and A. Pentland, "Eigenfaces for Recognition." Journal of Cognitive Neuro Science, 3(1), pp. 71-86, 1991.
[2] D. J. Kriegman and J. P. Hespanha and P. N. Belhumeur, "Eigenfaces vs. Fisherfaces: Recognition Using Class-Specific Linear Projection." European Conference on Computer Vision, p. I:43-58, 1996.

2008年3月3日 星期一

[Reading] Lecture 03 - Scale & Affine Invariant Interest Point Detectors

Unlike Lowe's paper trying to address on each component (from point detection, descryption, indexing, to matching) of a feature point matching system, this paper mainly focus on the part of interesting point detection.


In Lowe's paper, Lowe has claimed that none of previous approaches are yet fully affine invariant, as they start with initial featurescales and locations selected in a non-affine-invariant manner due to the prohibitive cost of exploring the full affine space. Although this paper use affine-variant method to choose candidate positions in the very beginning, it uses affine-invariant method to refine the point position and the shape of its neighborhood. Therefore, the feature is very close to an Affine Invariant Detectors according to Lowe's argument.


The authors first introduce a scale invariant interesting point detection method called Harris-Laplace detector. This detector uses Harris detector and Laplacian-of-Gaussians (LoG) interchangeably to find the best position for a candidate point in 3D space. And the 3D space is composed of image space (x, y) and scale space. In other words, this detector is similar to the interesting point detector used in SIFT since they both are scale-invariant. In addition to the iterative approach, the author give a simplified Harris-Laplace detector to reduce the computation complexity.


And then, the affine invariant interest point detector is introduced. In essential, affine transformation can be viewed as different scale change in each direction. The second moment matrix is used to estimate the anisotropic shape of a local image structure. The author gives several examples to compare the results using scale invariant detectors and affine invariant detectors. From those figures, it can be seem easily that the regions covered by each interesting point are much less variant to the viewpoint.

An experiment is given to test the repeatability of different detectors. Not surprisely, their affine detector performs much better than other detectors in the case of strong affine deformations. A picture matching application is given in the end of the paper.

Reference:

K. Mikolajczyk and C. Schmid, “Scale & Affine InvariantInterest Point Detectors,” Int’l J. Computer Vision, vol. 1, no. 60,pp. 63-86, 2004.

2008年2月29日 星期五

[Reading] Lecture 03 - Distinctive Image Features from Scale-Invariant Keypoints

SIFT is such a powerful method to extract distinctive feature points that can be used not only in CBIR, but also in object recognition, image matching, stereo vision, and so on. Since different cameras can capture the same scene in different positions, in different times, with different viewing angles, or different camera settings, a good feature describing a point should be invariant to the viewangle and the illuminace change. Although SIFT do NOT support all kinds of transformation (caused by different viewangles), its computation efficiency and stable matching power make it so popular that many researchers use it as their tool.

In "Related research" of this paper, the author gives a good review of the area of feature point matching. From the history, we can more clearly understand what is the main concern in each component of SIFT. The author also explains why SIFT do not try to cover wider affine invariance. He claims that none of previous approaches are yet fully affine invariant, as they start with initial featurescales and locations selected in a non-affine-invariant manner. Lossing some affine invariance makes SIFT more robust against changes in 3D viewpoint for non-planar surfaces, more efficient feature extraction, and the ability to identify larger numbers of features.

Since SIFT is composed of so many components, I won't give too much technical detail in the following. More focuses will be put on the what problems each component is trying to solve.

1. Detection of scale-space extrema

SIFT detects feature points using cascade filtering for efficiency. In this step, SIFT tries to find out the candidate locations that is invariant to the scale-space (In fact, also invariant to some simple transformations and illuminace changes). The candidate locations are found by finding the local maximum / minimum in the difference-of-Gaussian (DOG) function convolved with the image. The DoG function is a close approximation of the LoG function but the DoG can significantly accelerate the computation process. Since there are some parameters of the DOG function need to be determined, the author determines the parameters according to the repeatability of the candidate locations after doing some synthesized image transformtion. Some tricky implementation details are given to improve the computation efficiency.

2. Accurate keypoint localization

After the candidate positions are found, the position of the extremum is refined by solving the derivative of the Taylor expansion around the candidate position. Some more candidate positions are rejected if they are low contrast or along an edge. The author determines if a point is on an edge by checking the ratio of the two largest pricipal curvature (yes if the ratio is too large).

3. Orientation assignment

So far, the candidate positions are already found and would not be rejected in the following steps. Indeed, the methods used to find these positions are invariant to scale, rotation, and change in illuminace. However, we also need a point descriptor that is invariant to scale, rotation, and change in illuminace, too. In this step, the author determines one or more dominant directions of a point according to the gradient histogram. The author gives the flexibility to choose more than one dominant direction, since the second (third?) dominant direction may become the first dominant direction when there is image noise. The dominant direction will be used in the next step to achieve invariation to rotation.

4. The local image descriptor

The location, scale, orientation of each keypoint are already assigned by the previous steps. In this step, the main goal is to find a proper descriptor to reduce the effects caused by illumination variation and viewpoint. By adopting the gradient instead of the pixel value, the descriptor is invariant to illuminance. The effects caused by different image contrast is removed by normalizing the gradient histogram. And the effects caused by non-linear illumination changes is reduced by restricting the magnitudes of gradient (after normalization) to be less than 0.2. Again, the author tested the performance using several parameters, and chosen the one with good performance and low complexity.

Application

The author gives an object recognition application using SIFT. He suggests to use the ratio of the distance of the closest neighbor to the distance of the second-closest neighbor to achieve reliable matching. And then he recommends a nearest neighbor indexing method to more efficiently find the nearest neighbor with high probability. Since methods such as RANSAC or Least Median of Squares perform poorly when outliers dominate, Hough transformation is used to help identify clusters. At the end, the author gives the methods to obtain the solution for affine parameter.

My conclusion

This is not the first work addressing on scale-invariant, illuminant-invariant, or viewpoint-invariant features. The main contribute of this paper is that the author introduces a very complete procedure to get SIFT feature, by integrating and improving the merits of many previous works. To make SIFT more practicable, the author gives several tricks to reduce the computation cost. However, there is also some drawbacks in this paper - the application example somehow deviates from the main subject, and takes too many (7) pages.

Reference:

D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,”Int’l J. Computer Vision, vol. 2, no. 60, pp. 91-110, 2004.

2008年2月25日 星期一

[Reading] Lecture 02 - Image Retrieval: Ideas, Influences, and Trends of the New Age

I used to do research on video retrieval when I was in Eric's small group before, therefore, I thought this paper would not be too hard to read for me. However, the fact is that there are so much thing that I had never heard before. Beside, I only understand how some developing techniques can be used in CBIR, but don't know why they are brought up into this area. I'm such a frog in a wall...

Although this paper only focus on the history of CBIR after 2000, we still can understand how the mainstream of this area goes in the last decades. In the very beginning, the research subjects for CBIR were very limited, people usually tried to take only two steps - feature extraction and similarity calculation, to retrieve what they want. After several years, more and more new areas such signature extraction, clustering (metric learning), classification (SVM), and relevance feedback are thriving up. The author took some experiments by Google Scholar's search tool to proof this phenomenon.

Then, the author gives several aspects of the design of a CBIR system. Because the CBIR applications are interactive, the author usaually thinks from users' viewing angle. Some constraints such as searching speed are added to ensure the practicability of a CBIR system. Those constraints make the CBIR design becomes even more hard to design, since we cannot solve the problem by brute force.

There is always a trade-off between performance and complexity. Take features/signatures extraction for example, in order to obtain a thorough description of an image, we should also concern about the spatial information (including the shape). However, to get a more concise and meaningful spatial information, higher computation is often unavoided (Ex:segmentation). Similarly, the trade-off between performance and complexity can be seen in similarity calculation. Mallows distance, which performs well, is described by a minimzation problem. However, since the complexity to get this value is so high, some people would rather choose integrated region matching (IRM) as their metric, which can be computed significantly faster but not much inferior to Mallows distance.

Another important issue deserves to be mention is the evaluation strategies. It is hard to evalute whether the result of a CBIR is good or bad because it is very subjective, not to mentioned how to compare two different CBIR system. Therefore, researchers started to agreed on certain evaluation datasets, benchmearks, and forums.

Overall, this paper gives a comprehensive survey and also provides some possible offshoots in the future. However, we won't get to much thing from this paper if we only skim through it. The author tries to give many things so that he sacrifices some of the details. Of course, if we really try to understand a certain subject well, reading the reference listed by the author would be a good start.

Reference:
R. Datta, D. Joshi, J. Li, and J. Z. Wang, "Image retrieval: Ideas, influences, and. trends of the new age." ACM Computing Surveys, 39(65), 2007

2008年2月21日 星期四

[Reading] Lecture 02 - How to Give a Good Research Talk

First, we should understand that this article focus on the situation that the audiences are the people who may not know much about your works, and you are presenting a technical talk. Knowing your audiences is important, it determines what kind of materials should be put in your slides. To make sure that everyone can get something from your talk, give enough and clear information instead of too much details. Giving example would be a very good way to explain our idea.

The author also tells us how to prepare the slides, but I think some of them are out of date. For example, I don't think perparing the slide by typing is a bad idea, handwriting is not always good for everyone.

I do like the way introduced by the author about how to help re-orient the audience. By saying something like "This is what I have discussed so far, and now I'm going on to cover these areas," we have another change to re-catch the attention from some people who are already lost. I have used a similar skill in my talk before.

Another important concept is : avoid the temptation to conceal problems you know about in your work, get your audience to help you do your research! I think most people would try to hide thier weakness instead of doing this, so I'm a little surprised at this philosophy.

Reference:
Simon L Peyton Jones, John Hughes, and John Launchbury, "How to give a good research talk," SIGPLAN Notices 28(11) (Nov 1993).
http://research.microsoft.com/~simonpj/papers/giving-a-talk/giving-a-talk-html.html

[Reading] Lecture 02 - How to Read a Paper

Three-pass approach is recommended in this article. Here, I interpret each pass in my own way.

1. First pass : Understand whether the paper is what I need.
The most importnat parts of a paper is title, abstract, introduction, headings, and conclusion. We usually can understand if a paper is worthy to be read from these parts.
In fact, when we write a paper, we also should deal with these parts very carefully to attract reviewers' attention.

2. Second pass : Grasp the content of the paper.
Have the ability to summarize the main thrust of the paper. Not necessary to know every detail.

3. Third pass : Absorb but also doubt every detail.
In this pass, we are already sure that this paper is what we need. Therefore, pay more attention on the details. Simutaneously, challenge their work to get more novel ideas.

Reference :
Keshav, "How to Read a Paper," ACM SIGCOMM Computer Communication Review 2007
http://www.sigcomm.org/ccr/drupal/files/p83-keshavA.pdf