Capturing Motion

Introduction

With the widespread application of video understanding tasks, generating accurate and comprehensive video captions has become a key objective for large-scale video understanding models. However, existing models often fail to sufficiently capture dynamic video information, resulting in captions lacking detailed descriptions of actions and motion changes. This limitation hinders the performance of models in real-world scenarios such as automatic video summarization and video surveillance. This paper proposes a novel data processing and model fine-tuning approach to address this issue. We first preprocess existing datasets to ensure that the captions contain visual information and include more detailed action descriptions. Additionally, we integrate multi-source data, combining human-labeled action data with virtual action data generated using Unity3D. Through effective data fusion and a staged training strategy, we fine-tune existing large-scale video understanding models to generate captions with richer dynamic information. Experimental results demonstrate significant performance improvements across multiple datasets, especially in capturing and generating action-related descriptions, with notable advancements compared to the original models. The proposed method offers a new approach for capturing dynamic information in video understanding models and provides a more practical solution for real-world applications.

Here are examples of tht four sub-datasets.

Caption Comparison

As shown in figure, the comparison reveals that the fine-tuned model is capable of generating more detailed and comprehensive descriptions of object movements and interactions.

@misc{wang2024lvbench,
      title={Capturing Motion: Fine-Tuning Video Captioning Models with Dynamic Action Data},
      author={Zhiqin Fang and Lefan Wang and Zhuoyi Yang and Jiayan Teng and Yean Cheng and Shiyu Huang and Yuxiao Dong and Zhaofeng He and Jie Tang},
      year={2024},
      eprint={2406.08035},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Motion-caption

Datasets with more detailed action descriptions

Introduction

Datasets

Here are examples of tht four sub-datasets.

Statistics

Exact Quantity

Empirical Results

Caption Comparison

Citation