In "Related research" of this paper, the author gives a good review of the area of feature point matching. From the history, we can more clearly understand what is the main concern in each component of SIFT. The author also explains why SIFT do not try to cover wider affine invariance. He claims that none of previous approaches are yet fully affine invariant, as they start with initial featurescales and locations selected in a non-affine-invariant manner. Lossing some affine invariance makes SIFT more robust against changes in 3D viewpoint for non-planar surfaces, more efficient feature extraction, and the ability to identify larger numbers of features.
Since SIFT is composed of so many components, I won't give too much technical detail in the following. More focuses will be put on the what problems each component is trying to solve.
1. Detection of scale-space extrema
SIFT detects feature points using cascade filtering for efficiency. In this step, SIFT tries to find out the candidate locations that is invariant to the scale-space (In fact, also invariant to some simple transformations and illuminace changes). The candidate locations are found by finding the local maximum / minimum in the difference-of-Gaussian (DOG) function convolved with the image. The DoG function is a close approximation of the LoG function but the DoG can significantly accelerate the computation process. Since there are some parameters of the DOG function need to be determined, the author determines the parameters according to the repeatability of the candidate locations after doing some synthesized image transformtion. Some tricky implementation details are given to improve the computation efficiency.
2. Accurate keypoint localization
After the candidate positions are found, the position of the extremum is refined by solving the derivative of the Taylor expansion around the candidate position. Some more candidate positions are rejected if they are low contrast or along an edge. The author determines if a point is on an edge by checking the ratio of the two largest pricipal curvature (yes if the ratio is too large).
3. Orientation assignment
So far, the candidate positions are already found and would not be rejected in the following steps. Indeed, the methods used to find these positions are invariant to scale, rotation, and change in illuminace. However, we also need a point descriptor that is invariant to scale, rotation, and change in illuminace, too. In this step, the author determines one or more dominant directions of a point according to the gradient histogram. The author gives the flexibility to choose more than one dominant direction, since the second (third?) dominant direction may become the first dominant direction when there is image noise. The dominant direction will be used in the next step to achieve invariation to rotation.
4. The local image descriptor
The location, scale, orientation of each keypoint are already assigned by the previous steps. In this step, the main goal is to find a proper descriptor to reduce the effects caused by illumination variation and viewpoint. By adopting the gradient instead of the pixel value, the descriptor is invariant to illuminance. The effects caused by different image contrast is removed by normalizing the gradient histogram. And the effects caused by non-linear illumination changes is reduced by restricting the magnitudes of gradient (after normalization) to be less than 0.2. Again, the author tested the performance using several parameters, and chosen the one with good performance and low complexity.
Application
The author gives an object recognition application using SIFT. He suggests to use the ratio of the distance of the closest neighbor to the distance of the second-closest neighbor to achieve reliable matching. And then he recommends a nearest neighbor indexing method to more efficiently find the nearest neighbor with high probability. Since methods such as RANSAC or Least Median of Squares perform poorly when outliers dominate, Hough transformation is used to help identify clusters. At the end, the author gives the methods to obtain the solution for affine parameter.
My conclusion
This is not the first work addressing on scale-invariant, illuminant-invariant, or viewpoint-invariant features. The main contribute of this paper is that the author introduces a very complete procedure to get SIFT feature, by integrating and improving the merits of many previous works. To make SIFT more practicable, the author gives several tricks to reduce the computation cost. However, there is also some drawbacks in this paper - the application example somehow deviates from the main subject, and takes too many (7) pages.
Reference:
D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,”Int’l J. Computer Vision, vol. 2, no. 60, pp. 91-110, 2004.