The traditional SPM approach based on bag-of-features (BoF) requires nonlinear classifiers to achieve good image classification performance. This paper presents a simply but effective coding scheme called Locality-constrained Linear Coding (LLC) in place of VQ coding in traditional SPM. LLC utilizes the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated by max pooling to generate the final representation. With linear classifier, the proposed approach performs remarkably better than the traditional nonlinear SPM, achieving state-of-the-art performance on several benchmarks. Compared with the sparse coding strategy, the objective function used by LLC has an analytical solution. In addition, the paper proposes a fast approximated LLC method by first performing a K-nearest neighbor search and then solving a small constrainted least square fitting problem, bearing computational complexity of O(M+K^2). Hence even with very large codebooks, our system can still process multiple frames per second. This efficiency significantly adds to the practical values of LLC for real applications.
Our locality-constrained linear coding (LLC) algorithm follows the standard descriptor coding + spatial pooling strategy to extract features from the image:
- Local descriptor extraction from a dense grid on the image. We use dHoG (similar to SIFT) descriptors in this work.
- Nonlinear descriptor coding by LLC to transform each local descriptor into a sparse code.
- Multiscale spatial pyramid max pooling over the sparse codes to get the final feature representation.
The advantages of this method are: simple, fast and scalable, and competitive with ScSPM  with linear classifiers. The following table shows the comparison of our algorithm with SPM  and ScSPM  in the literature.
|Local descriptors||SIFT descriptors extracted from a dense grid on the image||SIFT descriptors extracted from a dense grid on the image||SIFT descriptors extracted from a dense grid on the image|
|Descriptor coding||Vector quantization||Sparse coding||LLC|
|Sparse codes pooling||Multiscale spatial average pooling (histogram)||Multiscale spatial max pooling||Multiscale spatial max pooling|
|Classifier||Nonlinear (chi-square or intersection kernel)||Linear||Linear|
Experiment results are reported based on three widely used datasets: Caltech 101, Caltech 256, and PASCAL VOC 2007. Only a single descriptor, the dHoG feature (similar to SIFT, but faster), is used throughout the experiments. In our setup, the dHoG features were extracted from patches densely located by every 8 pixels on the image, under three scales, 16x16, 25x25, and 31x31 respectively. The dimension of each dHoG descriptor is 128. Please refer to the paper for more details.
Caltech 101 Dataset
We train a codebok with 2048 bases, and used 4x4, 2x2, and 1x1 sub-regions for spatial max pooling. The images are resized to be no larger than 300x300 with preserved aspect ratio. For classification, we use one-vs-all trained linear SVM.
Caltech 256 Dataset
The experiment settings are the same with Caltech 101, except that we use a larger codebook, containing 4096 bases. We do find that using a larger codebook helps the performance from what we have tried.
The results are reported using the standard Average Precision (AP) for this challenge competition. Note that our system only use one single descriptor and linear SVM.
|Best of PASCAL'07||54.9||45.8||77.5||64.0||85.9||36.3||44.7||50.9||79.2||53.2||59.4|
Email me if you have any questions, or if you find any bugs with the codes.
 Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse coding for image classification. CVPR'09.
 Kai Yu, Tong Zhang, and Yihong Gong. Nonlinear learning using local coordinate coding. NIPS'09.
 S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. CVPR'06.