- Published on
Total Recall
- Authors
- Name
- Vincent Hu
TotalRecall: A Software Framework for Capturing Events of Interest from Long Surveillance Videos
Introduction

The Problem
Surveillance videos generate an overwhelming amount of data, making manual review time-consuming and inefficient. Traditional monitoring methods lack the ability to automatically identify key moments, leading to missed events and inefficient security operations. TotalRecall addresses this challenge by leveraging AI models to automate the detection and captioning of significant events.
Architecture

- Frontend: Responsible for UI display and visualization of detected events, allowing users to navigate and review critical moments efficiently.
- Backend: Performs event detection and captioning using state-of-the-art AI models, including Florence-2 and LLaVa-Next-Video, ensuring accurate and efficient processing.
Key Features
- Event Detection: Identifies key objects and actions in video clips, answering fundamental questions like “What’s in the video?” and “What happened?”.
Frame Difference Analysis: Uses a baseline method to detect changes between frames and segment events. The model visualizes results using an progress bar with energy, inspired by Youtube Video. An energy bar represents how significant an event is, based on the magnitude of detected motion or the rarity of an object appearing in the scene.
- Multi-Model Processing:
- Florence-2: Provides a unified representation for various vision tasks, improving event recognition.
- LLaVa-Next-Video: A fine-tuned model with strong zero-shot video understanding capabilities, capable of processing long videos efficiently.
GIS Integration: Incorporates geospatial information for location-based event analysis, making it particularly useful for surveillance applications requiring geographic context.
Efficient Video Summarization: Converts long surveillance videos into structured event logs with descriptions, icons, and categorized event types.
Workflow
- Video clips are processed through a frame processor to identify significant changes.
- Detected objects and events are labeled, and a customized prompt is used to generate event descriptions.
- AI models refine the event summaries, generating structured outputs with event type, title, and description.
- The frontend displays event logs for user interaction, visualization, and review.
Results

Limitations
Despite its success, TotalRecall faces some challenges:
- Processing Speed: AI-driven analysis can be computationally intensive.
- Visualization: Enhancing user-friendly representations of results remains an ongoing goal.
- Accuracy: Event detection performance depends on training data and model refinement.
Conclusion
TotalRecall represents a significant step forward in automated video surveillance analysis. By combining deep learning with GIS integration, it offers a scalable and efficient solution for detecting, captioning, and summarizing events in long surveillance videos. As the framework continues to evolve, improvements in speed, visualization, and accuracy will further enhance its capabilities.