Murat Kunt
A Position Statement for Panel 1: Image/Video feature extraction and segmentation
The 1998 International Workshop on Very Low Bitrate Video Coding

It goes without saying that feature extraction and segmentation are and will be the major components of modern image processing systems. Yet the problems at hand are not fully solved. There are no optimal solutions to the best feature selection and segmentation. Existing ones are at best application dependent when they are not fully ad-hoc or cut & try type.
Despite this situation some maturity has been gained over the last decade but efforts should continue since the road ahead seems to be still quite long.
Cross fertilization with Pattern Recognition and Computer Vision gave enough experience to list general requirements on features: discriminability, variability, independence, coherence, number etc,. We still need rules on how to find them. Since features are combined with each other in some heuristic way, we learned that coherency is necessary to avoid adding (or multiplying) apples to peaches. The "non stationary" nature of the data lead to global, semi-global, semi-local or local feature types. A rather successful example is this respect is the so-called Gabor features (Gaussian weighted sines & cosines) that provide local subband decomposition. Output energies of each band become thus features. They are quite representative in some applications but not in all applications. One can certainly find similar features with other type of filters but looking for them seems to me a fine tuning approach without knowing where to tune. Equally important is the handling of motion features and their tracking in a coherent way. On top of these conceptual problems comes the everlasting trade-off between computational complexity and the quality of the features. Even though the limits are pushed away continously by higher speed processors, the trade-off is still present. Assuming that global past experience can allow us to define a set of N features to be extracted in all cases (having raw video data in mind), we need to find out the relevant subset of these features for later use. Supervised and unsupervised feature selection techniques of today will eventually converge to fully automatic extraction provided that the method used can track and record its training. For near future, supervised selection may still be necessary whenever computational complexity is an issue. This basic dimensionality reduction problem need to be solved properly for any subsequent operation to produce decent results. Hints borrowed from human behaviour (vision research results) give good insight on the tracks to follow. Assuming we have the selected features, we still need to find out how to combine them. Is a weighted sum a viable solution ? Here too suprevised and unsupervised combinations will develop in parallel. Spatial and temporal data (texture, structure & motion) must be combined in an adaptive and harmonious way.

It is well known that (but often overloked) a composite system cannot be any better than the weakest of its components. This is also true for segmentation that relies on features. A given segmentation technique produces different results for different feature sets. MPEG 4 assumes that segmented video is available. Its real time use looks to me problematic since so far segmentation has been done by hand! Even though there are some sophisticated unsupervised spatio-temporal segmentation algorithms, are they enough for MPEG4 ? Since we are still far away from the sematically meaningful segments, the way ahead is to look for methods that can combine subsegments of a real object into the object itself. For example, we can extract almost all the subsegments of Ms. Akiyo's face and chest more or less without supervision, but we need to connect them together to make the person a usable video object. Like in the feature selection and combination problem, this cannot be done in a fully automatic way without tracing and storing the training of the method.

In both cases (feature & segmentation) as we go off from the raw data in dimensionality reduction, we weaken the objective nature of the information fully present in the pixels and introduce an additional subjective component (experience dependent). We need to learn how to handle this component which certainly will not facilitate final objective quality assesment.