Presentation: Canva
Demo: Youtube
This work involves synthesizing a video from a set of video lectures that answers the question raised by the student. This contains following objectives.
- Select a video lectures set that containing SRTs.
- Study and implement the voice activity detection (VAD) algorithm.
- Extract the speech segments from the VAD output.
- Identify the spoken content in text form using ASR for each segment.
- Obtain the sentence specific time stamps.
- Create answer summary.
- Identify video parts corresponding to the answer summary.
- Stitch the summary video segments to obtain natural like video.
Chiranjeevi Yarra (Spoken Language Forensics & Informatics (SLFI) group - LTRC)
- We used Flask, HTML to run frontend server. To run
- Navigate to the frontend directory:
cd frontend - Run
python3 main.py
- FFMPEG:
pip3 install ffmpeg-python - PyTorch:
pip3 install torch torchvision - Transformers:
pip3 install transformers - Sentence Transfomers:
pip3 install -U sentence-transformers - Faiss:
pip3 install faiss-cpu - Silero VAD:
pip3 install silero-vad - SoundFile:
pip3 install soundfile - Sox:
pip3 install sox - Streamlit:
pip3 install streamlit - pysrt:
pip3 install pysrt - moviepy:
pip3 install moviepy==1.0.3
- Note: We require
ffmpegin system also. So please install throughapt install ffmpeg(Linux) orbrew install ffmpeg(Mac)
- Videos should be in Data/ Folder.
- Run the following notebook to complete the processing up to audio chunk generation:
pipeline-qwen.ipynb
This notebook will:
- Convert Video → Audio
- Perform Voice Activity Detection (VAD)
- Generate Audio Chunks
- This will generate Audio Chunks (.wav) files for each lectures.
- Converting Audio Chunks into SRT files using Whisper model and MFA.
- Encode this SRT files using QWEN model by running
Encoding.pyfile. - Now, the generated
.indexfiles use in the backend repo to use with backend.


