2008年4月29日 星期二

[Reading] Lecture 10 - Transductive Inference for Text Classification using Support Vector Machines

The paper does not try to propose new methods to modify the core technique of SVM, instead, it gives a way to extend SVM to the cases of semi-supervised learning. SVM does classification by dividing the feature space into different regions according to the training data. If the feature space is high-dimensional and the number of training data is small (it is usually less than the dimension of feature space), you can expect that the support vector will be bias (overfitting) to the training data because it doesn't care about the test (unlabeled) data at all. This is a traditional SVM approach - 'inductive learning (IL)'.

Therefore, this paper tries to incorporate the information coming from the unlabeled data during the model training - transductive learning (TL). TSVM tries to separate not only the training data but also the test data in the feature space. However, since test data is unlabeled, their labels must be guessed and updated during the training process. This paper tells us how this can be done.

Their experiment gives a good side-information of the difference between SVM and TSVM. Evidence shows that when the number of training data is rather small (compared to the number of test data), TSVM greatly outperforms SVM. If the number of training data and test are roughly the same, their performance is roughly the same since the information coming from the test data is not much.

Reference:
T. Joachims, "Transductive Inference for Text Classification using Support Vector Machines," Proc. International Conference on Machine Learning (ICML), 1999.

[Reading] Lecture 10 - Combining Labeled and Unlabeled Data with Co-Training

This paper gives a way to do semi-supervised learning, and it focus on the problem of how to combine different kinds of information in the learning procedure. In a more mathematical way to interpret the same problem, it is eqivalent to how to training a classifier from two feature sets with limited labeled data and large number of unlabeled sample data.

The main idea of co-training is trying to let one classifer (trained by feature set 1) to teach the other classifier (trained by feature set 2) and viece versa. One of the benefit is that the data which is agreed to be the same label by the two classifiers is more likely to be correct, and can be used as a labeled data in the next iteration. The other benifit is that any of the classifiers can learn something that can not be learnt by itself from other classifiers, which makes up the deficiency that the classifiers were trained totally separately.

Comment:
After I understood the idea of this paper, I think there is a question that must be answered. When should we separate the two feature vector (f1, f2) to train two classifiers? Why don't we just put them together to form another feature vector? The authors did give an assumption that f1 and f2 are (conditional) indepdent, is that reasonable? or how can I know whether f1 and f2 are indepdent?

Reference:
Blum and Mitchell, "Combining Labeled and Unlabeled Data with Co-training," COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1998.

2008年4月23日 星期三

[Reading] Lecture 09 - Improved Boosting Algorithms using Confidence-Rated Predictions

This paper proposes a more general AdaBoost algorithm than the one proposed by Freund and Schapire. For example, the weak hypotheses does not have to range within [-1, +1], instead, it can range over all of R. The paper provides different ways to determine alpha, and shows that Freund and Schapire's version can be derived as a special case of their method. And then, the paper gives a criterion for finding weak hypotheses, which was not an issue in the original version. A generalization error is also proposed, it removes the restriction that minimization training error is the only concern and goal. The last half part of this paper is quite important too, the authors give the ways to solve multiclass and multi-label classification problem by adaboost. In Freund and Schapire's algorithm, adaBoost can only be used to solve a binary (two-class), one label problem.

This is a good paper to help us understand adaBoost more deeply. If I have to implement adaBoost in the future (homework!?) and the performance matters, I will refer to this paper to seek for better approaches.

Reference:
Schapire and Singer, "Improved Boosting Algorithms Using Confidence-rated Predictions," Machine Learning, Vol. 37, 1999.

2008年4月21日 星期一

[Reading] Lecture 09 - Rapid Object Detection using a Boosted Cascade of Simple Features

The paper propose an algorithm to train an object detection model, which is very fast and its detection power is competitive to the state-of-art at then. There are three main contribution of their works:

1. Integral Image
The complexity of computing the integral image is low - O(image size). And more importantly, it can be used to obtain any haar-like feature in constant time. This speeds up the detection process. Rare features can be extracted as fast as it.
Integral image has another usage. We can get the variance within any rectangle in the image in constant time if we have to integral image, with one accummulating the intensity value and the other accumulating the square of intensity value.

2. Training by Adaboost
There are tons of haar-like feature in an image, but many of those features are useless. By treating each weak classifier as an indicator of the usefulness of a haar-like feature, AdaBoost can be used to select a set of features and train the classifier simutaneously.
Since feature selection is done automatically, their work can be easily extended to objects other than faces.

3. Attentional Cascade
This part is designed for accelerating the detection speed rather than the accuracy. The idea is simple - sacrificing the false positive rate to achieve very high detection rate in the beginning to thrown out as many unlikely candidate region as possible. Since most candidate regions are removed in the beginning, the number of operations in the latter stage is decreased.

I think this is a very classic work in object detect. It can be used to detect objects other than faces without any prior knowledge as long as you have enough training data. Of course, it doesn't quarantee to perform well on all objects. Haar-like feature is a kind of appearance feature. If objects' appearance often change (e.g. t-shirt), the method proposed in the paper would easily fail.

Reference:
P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features." Proc. CVPR, Vol. 1, pp. 511-518, 2001.

2008年4月8日 星期二

[Reading] Lecture 08 - Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

This is a pioneer work that makes the machine learn to auto-annotate the image segmentation (blob). We say auto-annotate here, because they don't need precise annotation database. All they need is images with words associate with them. The annotation work is then modeled as a machine translation job, and can be solved by EM algorithm. After giving the detail of using EM to estimate the model, the paper provides several experiment result and the discussion, but they are not quite convincing to me.

Before this week, I had no idea how to classify/recognize/label an object using unsupervised method before. Now, at least I know one or two of these approaches. It is very interesting, since for many realistic cases, only unsupervised data is available.

Reference:
P. Duygulu, K. Barnard, J. F. G. de Freitas, and D. A. Forsyth, "Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary," European Conference on Computer Vision, p. IV: 97 ff., 2002.

2008年4月7日 星期一

[Reading] Lecture 08 - Names and Faces in the News Abstract

This paper demonstrates an unsupervised face annotation algorithm.

As I mentioned in [Reading] lecture 4, a major difference between PCA and LDA (I used the term FLD at that time) is that the LDA exploits supervised data set to do dimension reduction. It makes LDA perform better than PCA for classification, but at the cost of the supervision beforehand. However, this paper still can utilize LDA without labeling beforhand! The reason is that a lexicon of names can be extracted automatically from the news captains. Those names may not be accurate for all images, but are already enough for using LDA to do dimension reduction (or we can say obtaining distinctive property).

They didn't restrict the dimension reduction method to LDA, Kernel-PCA is also recommended. However, their database is too large to do Kernel-PCA, so they use Nystrom Approxiamtion to make this job easier.

After doing dimension reduction, clustering is done by a modified K-Means method. This can be interpreted as a face labeling process. Since the lexicon of names obtained at the beginning is noisy, they futher prunes and merges the clusters to make the result better.

The paper gives a "face recognition" algorithm which is much different to the previous works. It utilizes the captain information to automatically generate the supervised dataset (and also the test dataset). However, I think their system can only recognize faces that appear several times.

Reference:
T. L. Berg and A. C. Berg and J. Edwards and M. Maire and R. White and Y. W. Teh and E. Learned Miller and D. A. Forsyth, "Names and faces in the news," IEEE Computer Vision and Pattern Recognition or CVPR, p. II: 848-854, 2004.