Video Recognition

This page showcases my research in video recognition, focusing on efficient video understanding through motion vectors, CLIP integration, and advanced cropping techniques. My work addresses the computational challenges in video analysis while maintaining high accuracy.

Research Overview

Video recognition is a fundamental task in computer vision, but state-of-the-art models are often computationally expensive. My research focuses on three key approaches:

Motion Vector Integration: Leveraging compressed-domain motion information for efficient video understanding
CLIP-Video Fusion: Combining large-scale vision-language models with temporal information
Zero-Shot Detection: Training-free methods for video object detection and action recognition

Publications

1. MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors

Authors: Binhua Huang, Nan Wang, Arjun Parakash, Soumyabrata Dev
Journal: arXiv preprint, 2025
arXiv: 2509.17084
Code: GitHub Repository

Abstract: Video action recognition is a fundamental task in computer vision, but state-of-the-art models are often computationally expensive and rely on extensive video pre-training. In parallel, large-scale vision-language models like Contrastive Language-Image Pre-training (CLIP) offer powerful zero-shot capabilities on static images, while motion vectors (MV) provide highly efficient temporal information directly from compressed video streams. To synergize the strengths of these paradigms, we propose MoCLIP-Lite, a simple yet powerful two-stream late fusion framework for efficient video recognition.

Key Contributions:

Two-stream late fusion framework combining CLIP with motion vectors
Frozen backbone approach ensuring extreme efficiency
89.2% Top-1 accuracy on UCF101 dataset
Significant performance improvement over zero-shot (65.0%) and MV-only (66.5%) baselines

Technical Details:

Combines features from frozen CLIP image encoder
Lightweight supervised network trained on raw motion vectors
Only tiny Multi-Layer Perceptron (MLP) head is trained during fusion
Provides highly efficient baseline for video understanding

2. MVP: Motion Vector Propagation for Zero-Shot Video Object Detection

Authors: Binhua Huang, Ni Wang, Wendong Yao, Soumyabrata Dev
Journal: arXiv preprint, 2025
arXiv: 2509.18388
Code: GitHub Repository

Abstract: Running a large open-vocabulary (Open-vocab) detector on every video frame is accurate but expensive. We introduce a training-free pipeline that invokes OWLv2 only on fixed-interval keyframes and propagates detections to intermediate frames using compressed-domain motion vectors (MV). A simple 3x3 grid aggregation of motion vectors provides translation and uniform-scale updates, augmented with an area-growth check and an optional single-class switch.

Key Contributions:

Training-free pipeline for zero-shot video object detection
Motion vector propagation from keyframes to intermediate frames
3x3 grid aggregation for translation and scale updates
mAP@0.5=0.609 and mAP@[0.5:0.95]=0.316 on ILSVRC2015-VID
Outperforms tracker-based propagation methods

Technical Details:

Invokes OWLv2 only on fixed-interval keyframes
Uses compressed-domain motion vectors for propagation
Area-growth check and single-class switch mechanisms
No labels, no fine-tuning required
Compatible with all open-vocabulary methods

3. MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition

Authors: Binhua Huang, Wendong Yao, Shaowu Chen, Guoxin Wang, Qingyuan Wang, Soumyabrata Dev
Journal: arXiv preprint, 2025
arXiv: 2509.18473
Code: GitHub Repository

Abstract: This paper presents MoCrop, a motion-aware adaptive cropping module for compressed-domain efficient video action recognition. MoCrop leverages motion vectors in H.264 videos to locate motion-dense regions and applies a single clip-level crop to all I-frames during inference. The module is training-free, adds no parameters, and can be plugged into different backbone networks.

Key Contributions:

Training-free motion guided cropping for video action recognition
Motion vector utilization for motion-dense region localization
Single clip-level crop application to all I-frames
No parameter addition, plug-and-play design
Significant efficiency improvements in video processing

Technical Details:

Leverages H.264 motion vectors for motion analysis
Motion-dense region localization
Single crop application across all I-frames
Compatible with various backbone networks
No training required, zero parameter overhead

Research Impact

These works contribute to efficient video understanding through:

Computational Efficiency: Significant reduction in computational costs while maintaining accuracy
Training-Free Methods: Zero-shot and training-free approaches for practical deployment
Motion Vector Utilization: Effective use of compressed-domain information
Real-World Applications: Practical solutions for video analysis tasks

Technical Innovations

Motion Vector Integration

Direct utilization of compressed video streams
Efficient temporal information extraction
Reduced computational overhead

CLIP-Video Fusion

Large-scale vision-language model integration
Zero-shot capabilities for video understanding
Frozen backbone efficiency

Zero-Shot Detection

Training-free object detection pipelines
Motion vector propagation mechanisms
Open-vocabulary compatibility

Future Directions

Integration of motion vector methods with advanced vision transformers
Cross-domain application of cropping techniques
Real-time optimization for streaming video analysis
Multi-modal fusion approaches

This page summarizes my contributions to video recognition research, focusing on efficient methods that leverage compressed-domain information and large-scale vision-language models.