Auteurs
Résumé
La détection d’un concept visuel dans les videos est une tâche difficile, spécialement
Abstract
Single visual concept detection in videos is a hard task, especially for infrequent concepts or for those difficult to model. This question becomes even more difficult in the case of concept pairs. Two main directions may tackle this problem: 1) combine the predictions of their corresponding detectors in a way that is widely used in information retrieval, or 2) build super- vised learners for these pairs of concepts by generating annotations based on the occurrences of the two individual concepts. Each of these approaches have advantages and drawbacks. The major problem with the second method is the need of a set of annotated samples, especially for the positive class. If there are some concepts which are infrequent, this scarcity is increasing even more for pairs formed by their combinations. On the other hand, there may be two frequent concepts but they co-occur rarely conjointly in the same document. Some studies suggested to overcome this problem by harvesting samples from the web, but this solution is expensive in terms of time and money. In this work, we compare the two approaches without using any exter- nal resources. Our evaluation was carried in the context of the concept pair detection subtask of the TRECVID 2013 semantic indexing (SIN) task, and the results showed that in the case of videos, if we do not use external information resources, the approaches which combine the two concepts detectors can be more efficient than learning based methods, in contrast to what was shown previously in the case of still images. The described methods outperform the best official