2008年2月29日 星期五

[Reading] Lecture 03 - Distinctive Image Features from Scale-Invariant Keypoints

SIFT is such a powerful method to extract distinctive feature points that can be used not only in CBIR, but also in object recognition, image matching, stereo vision, and so on. Since different cameras can capture the same scene in different positions, in different times, with different viewing angles, or different camera settings, a good feature describing a point should be invariant to the viewangle and the illuminace change. Although SIFT do NOT support all kinds of transformation (caused by different viewangles), its computation efficiency and stable matching power make it so popular that many researchers use it as their tool.

In "Related research" of this paper, the author gives a good review of the area of feature point matching. From the history, we can more clearly understand what is the main concern in each component of SIFT. The author also explains why SIFT do not try to cover wider affine invariance. He claims that none of previous approaches are yet fully affine invariant, as they start with initial featurescales and locations selected in a non-affine-invariant manner. Lossing some affine invariance makes SIFT more robust against changes in 3D viewpoint for non-planar surfaces, more efficient feature extraction, and the ability to identify larger numbers of features.

Since SIFT is composed of so many components, I won't give too much technical detail in the following. More focuses will be put on the what problems each component is trying to solve.

1. Detection of scale-space extrema

SIFT detects feature points using cascade filtering for efficiency. In this step, SIFT tries to find out the candidate locations that is invariant to the scale-space (In fact, also invariant to some simple transformations and illuminace changes). The candidate locations are found by finding the local maximum / minimum in the difference-of-Gaussian (DOG) function convolved with the image. The DoG function is a close approximation of the LoG function but the DoG can significantly accelerate the computation process. Since there are some parameters of the DOG function need to be determined, the author determines the parameters according to the repeatability of the candidate locations after doing some synthesized image transformtion. Some tricky implementation details are given to improve the computation efficiency.

2. Accurate keypoint localization

After the candidate positions are found, the position of the extremum is refined by solving the derivative of the Taylor expansion around the candidate position. Some more candidate positions are rejected if they are low contrast or along an edge. The author determines if a point is on an edge by checking the ratio of the two largest pricipal curvature (yes if the ratio is too large).

3. Orientation assignment

So far, the candidate positions are already found and would not be rejected in the following steps. Indeed, the methods used to find these positions are invariant to scale, rotation, and change in illuminace. However, we also need a point descriptor that is invariant to scale, rotation, and change in illuminace, too. In this step, the author determines one or more dominant directions of a point according to the gradient histogram. The author gives the flexibility to choose more than one dominant direction, since the second (third?) dominant direction may become the first dominant direction when there is image noise. The dominant direction will be used in the next step to achieve invariation to rotation.

4. The local image descriptor

The location, scale, orientation of each keypoint are already assigned by the previous steps. In this step, the main goal is to find a proper descriptor to reduce the effects caused by illumination variation and viewpoint. By adopting the gradient instead of the pixel value, the descriptor is invariant to illuminance. The effects caused by different image contrast is removed by normalizing the gradient histogram. And the effects caused by non-linear illumination changes is reduced by restricting the magnitudes of gradient (after normalization) to be less than 0.2. Again, the author tested the performance using several parameters, and chosen the one with good performance and low complexity.

Application

The author gives an object recognition application using SIFT. He suggests to use the ratio of the distance of the closest neighbor to the distance of the second-closest neighbor to achieve reliable matching. And then he recommends a nearest neighbor indexing method to more efficiently find the nearest neighbor with high probability. Since methods such as RANSAC or Least Median of Squares perform poorly when outliers dominate, Hough transformation is used to help identify clusters. At the end, the author gives the methods to obtain the solution for affine parameter.

My conclusion

This is not the first work addressing on scale-invariant, illuminant-invariant, or viewpoint-invariant features. The main contribute of this paper is that the author introduces a very complete procedure to get SIFT feature, by integrating and improving the merits of many previous works. To make SIFT more practicable, the author gives several tricks to reduce the computation cost. However, there is also some drawbacks in this paper - the application example somehow deviates from the main subject, and takes too many (7) pages.

Reference:

D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,”Int’l J. Computer Vision, vol. 2, no. 60, pp. 91-110, 2004.

沒有留言: