We built VidMod to answer a single question: What happens when you give an AI model the ability to not just watch a video, but to edit it?
In the world of media, compliance is a bottleneck. Every minute of content uploaded to platforms like YouTube, TikTok, or Netflix requires rigorous checking for brand safety, profanity, and regulatory compliance. Today, this is done by armies of human moderators or fragile, single-purpose computer vision models.
VidMod is different. It is an autonomous agent built on the Gemini 3.0 family. It doesn't just flag violations; it orchestrates a complex, multi-step remediation process to fix them. It acts as a Director, Editor, and Compliance Officer all in one, powered by the reasoning capabilities of Gemini 3 Pro and the speed of Gemini 3 Flash.
VidMod operates on a cloud-native, event-driven architecture designed for high-throughput video processing.
graph TD
User[User / React Frontend] -->|1. Request Signed URL| API[FastAPI Backend]
API -->|2. Generate URL| GCS[Google Cloud Storage]
User -->|3. Direct Upload| GCS
User -->|4. Trigger Process| API
subgraph "The Gemini Loop"
API -->|5. Analyze Video| Gemini[Gemini 3.0 Pro Preview]
Gemini -->|6. JSON Findings| API
API -->|7. User Review| User
User -->|8. Approve Action| API
API -->|9. Generate Mask| SAM3[SAM3 Model]
API -->|10. Inpaint/Replace| Runway[Runway Gen-4]
API -->|11. Dubbing| Eleven[ElevenLabs]
API -->|12. Stitch Video| FFmpeg[Local FFmpeg]
end
FFmpeg -->|13. Final Video| GCS
User -->|14. Download| GCS
VidMod is not a wrapper. It is a system architected around Gemini 3's unique capabilities. This project would have been impossible with previous generation models.
Traditional video AI relies on "sampling"—extracting 1 frame every second and sending it to an image model. This loses time, context, and cause-and-effect.
VidMod leverages Gemini 3 Pro's native video understanding. We upload the entire video file to the model's 1M+ token context window. This allows Gemini to:
- Understand Action: Distinguish between "holding a bottle" vs. "drinking from a bottle."
- Track Context: Know that a scene is a "historical reenactment" (allowed) vs. "glamorization of violence" (violation).
- Temporal Consistency: Recognize that an object appearing at 0:05 is the same one at 0:15, even if the angle changes.
Gemini 3 doesn't just see; it hears. By processing video and audio together, Gemini 3 Flash can detect profanity with millisecond precision and, crucially, understand the intent behind the words. A friendly "shut up" is ignored; an aggressive slur is flagged for replacement.
We don't hard-code rules. We inject complex, real-world policy documents (e.g., "broadcast standards for daytime TV" or "platform usage guidelines") directly into Gemini's context. The model acts as a reasoning engine, citing specific policy clauses for every decision it makes.
We built a robust, event-driven architecture to turn Gemini's reasoning into reality. The system supports five distinct remediation actions, each powered by a specialized engineering pipeline.
For privacy protection (faces, license plates), precision is paramount. A bounding box is not enough for moving video.
- Implementation: When Gemini flags an object, we trigger SAM3 (Segment Anything Model 3).
- Mechanism: We feed the frame and the object prompt to SAM3, which generates a high-fidelity segmentation mask. This mask is tracked across frames to ensure the blur stays locked to the subject, even as they move or turn.
Similar to blur, but used for explicit content or stylistic censorship.
- Implementation: We utilize the same SAM3-driven masking pipeline but apply a mosaic filter via FFmpeg.
- Mechanism: The block size of the pixelation is dynamically calculated based on the resolution of the video to ensure illegibility without destroying the surrounding visual context.
This is where VidMod acts as a Creative Autopilot. Instead of masking an object, we replace it entirely.
- Implementation: We use Generative Video-to-Video technology (Runway Gen-4).
- Mechanism:
- Prompt Engineering: Gemini 3 Pro analyzes the scene's lighting, depth, and style, then writes a detailed prompt.
- Reference Generation: To ensure the replacement object (e.g., a specific brand-safe soda can) looks perfect, we use Gemini 3 Pro's native image generation to create a photorealistic reference image.
- Synthesis: We dispatch the prompt, the reference image, and the video clip to the generation engine, which synthesizes a replacement that respects the original object's motion and physics.
Replacing a spoken word without re-recording the actor is one of the hardest problems in AV engineering.
- Implementation: Custom Audio Separator pipeline using Demucs.
- Mechanism:
- Separation: We split the audio into four stems: Vocals, Drums, Bass, and Other.
- Voice Cloning: We clone the speaker's voice using a clean sample found elsewhere in the video.
- Synthesis: We generate the "safe" word (chosen by Gemini) in the actor's voice via ElevenLabs.
- Muting & Recombination: We mute only the specific timestamp in the Vocal stem and mix the new word with the persistent background stems. This ensures background music doesn't "dip" when the word is changed.
The classic censor.
- Implementation: Precision FFmpeg audio filtering.
- Mechanism: We generate a 1000Hz sine wave matching the exact duration of the profanity, zero out the original audio, and overlay the beep with zero leakage.
VidMod is designed for long-running, complex tasks.
- State Persistence: The pipeline maintains a job state machine that survives server restarts.
- Self-Healing: Automatic retries with exponential backoff for external APIs.
- Result Verification: Gemini maintains "Thought Signatures"—logically explaining why it chose a specific action (e.g., "Replacing 'beer' with 'soda' because policy 4.2 prohibits alcohol in kid-rated content").
VidMod proves that Gemini 3 is more than a chatbot—it's an infrastructure layer for the next generation of media tools. We are moving from a world where humans edit content to a world where humans direct agents to edit content for them.
Looking Ahead: Gemini Veo While we currently orchestrate external video models, the future of VidMod lies with Gemini Veo. As Veo's native video-to-video inpainting capabilities mature, we will replace our entire external generation stack with native Gemini video generation. This will unify the entire pipeline—Analysis (Pro), Audio (Flash), and Generation (Veo)—under a single, multimodal model family.
| Endpoint | Method | Description |
|---|---|---|
/api/upload-url |
GET |
Generates a specific GCS Signed URL for large file uploads. |
/api/process-upload |
POST |
Triggers the analysis pipeline after client-side upload completes. |
/api/analyze-video/{job_id} |
POST |
Starts the Gemini 3.0 analysis job (detects violations). |
/api/apply-edit/{job_id} |
POST |
Applies a specific edit (Blur, Pixelate, Replace) to a finding. |
/api/preview/{job_id}/frame/{idx} |
GET |
Retrieves a specific video frame for UI preview. |
/api/download/{job_id} |
GET |
Downloads the final processed video. |
VidMod is cloud-native and ready for production.
The backend is stateless and containerized.
# Deploy to Cloud Run
gcloud run deploy vidmod-backend \
--source . \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--port 8080 \
--env-vars-file env.yaml \
--project vidmod-2025 \
--min-instances 1 \
--session-affinityNote: Session affinity is enabled to ensure job state consistency during multi-step processing.
The frontend is a static React SPI deployed to the edge.
cd frontend
npm run build
firebase deploy --only hosting- Python 3.10+
- Node.js 18+ & npm
- FFmpeg: Must be available in system
PATH. - Google Cloud Account: Storage Bucket & Service Account (Token Creator role).
git clone https://github.com/gana36/VidMod.git
cd VidMod
# Create & Activate Virtual Env
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Mac/Linux
# Install Dependencies
pip install -r requirements.txt
# Configure Environment
cp .env.example .env
# ⚠️ Populate your .env with API keys!cd frontend
npm install
npm run devOpen http://localhost:5173 to start the VidMod agent.
VidMod/
├── app/ # FastAPI Backend
│ ├── main.py # Entry Point
│ ├── routers/ # API Endpoints
│ └── services/ # Business Logic
├── core/ # Core AI Engines
│ ├── gemini_video_analyzer.py # Gemini 3.0 Analysis
│ ├── runway_engine.py # Gen-4 Wrapper
│ └── gcs_uploader.py # Cloud Storage
├── frontend/ # React Frontend
│ └── src/components/ # UI Components
├── uploads/ # Local Temp Storage
└── requirements.txt # Python Dependencies
400 Bad Request (Can not finalize upload)
This occurs when the client-side upload is interrupted. It usually means the network connection dropped between your browser and Google Cloud Storage. Fix: Refresh the page and try uploading again.
500 Analysis Failed (Cloud Run)
If you see this in production, it likely means the server tried to write to a read-only directory. Fix: Ensure
env.yamlsetsUPLOAD_DIR,FRAMES_DIR, andOUTPUT_DIRto/tmp/.... Use the provided deployment command which includes this fix.
Asset duration must be at least 1 seconds (Runway Error)
VidMod has Smart Clipping built-in. If your selected violation is less than 1 second, the backend automatically expands the clip range to meet the API requirement.
VidMod is open for contributions.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
MIT License © 2026 VidMod