Skip to content

hemanth-sunkireddy/MultiModal-VideoQA

Repository files navigation

Automatic multimodal question and answering for video lectures

Presentation: Canva

Demo: Youtube

Project description

This work involves synthesizing a video from a set of video lectures that answers the question raised by the student. This contains following objectives.

  1. Select a video lectures set that containing SRTs.
  2. Study and implement the voice activity detection (VAD) algorithm.
  3. Extract the speech segments from the VAD output.
  4. Identify the spoken content in text form using ASR for each segment.
  5. Obtain the sentence specific time stamps.
  6. Create answer summary.
  7. Identify video parts corresponding to the answer summary.
  8. Stitch the summary video segments to obtain natural like video.

Guide

Chiranjeevi Yarra (Spoken Language Forensics & Informatics (SLFI) group - LTRC)

Running the Frontend

  • We used Flask, HTML to run frontend server. To run
  1. Navigate to the frontend directory:
    cd frontend
  2. Run
    python3 main.py

Video to Audio Conversion and Dividing Audio into Audio Chunks

Prerequisities

  1. FFMPEG: pip3 install ffmpeg-python
  2. PyTorch: pip3 install torch torchvision
  3. Transformers: pip3 install transformers
  4. Sentence Transfomers: pip3 install -U sentence-transformers
  5. Faiss: pip3 install faiss-cpu
  6. Silero VAD: pip3 install silero-vad
  7. SoundFile: pip3 install soundfile
  8. Sox: pip3 install sox
  9. Streamlit: pip3 install streamlit
  10. pysrt: pip3 install pysrt
  11. moviepy: pip3 install moviepy==1.0.3
  • Note: We require ffmpeg in system also. So please install through apt install ffmpeg (Linux) or brew install ffmpeg (Mac)

Steps

  • Videos should be in Data/ Folder.
  1. Run the following notebook to complete the processing up to audio chunk generation:

pipeline-qwen.ipynb

This notebook will:

  • Convert Video → Audio
  • Perform Voice Activity Detection (VAD)
  • Generate Audio Chunks
  • This will generate Audio Chunks (.wav) files for each lectures.
  1. Converting Audio Chunks into SRT files using Whisper model and MFA.
  2. Encode this SRT files using QWEN model by running Encoding.py file.
  3. Now, the generated .index files use in the backend repo to use with backend.

Question Classifier Output

Classify Question Image

Finding Related Sentences for Question Output

Related Sentences output

Voice Activity Detector Output

Voice Activity Detector Output

About

Automatic multimodal question and answering for video lectures

Topics

Resources

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •