Official repository for "HASap: Hierarchical Acoustic-Semantic Annotation Pipeline for Scripted Speech Data" (Accepted to ICASSP 2026).
The workflow comprises three stages: (1) Speech Preprocessing; (2) Hierarchical Annotation; and (3) Context-Aware Resegmentation. Example texts originate in Chinese and are shown here in English translation.
To obtain clean and speaker-consistent speech segments suitable for reliable annotation, we preprocess the raw audio data using a pipeline inspired by Emilia-Pipe. The preprocessing pipeline consists of the following steps:
- Audio Standardization: Convert all audio files to a unified format.
- Source Separation: Remove background music and non-speech components, retaining vocal signals.
- Speaker Diarization: Detect speaker boundaries and ensure each segment contains speech from a single speaker.
- VAD-based Segmentation: Apply voice activity detection to split audio into fine-grained speech segments.
We use ASR models such as Whisper and Paraformer to generate transcripts with timestamps.
All prompts used in this stage are defined in prompts/prompts.py.
-
Global Semantics Extraction
We useMETA_PROMPT_ZHandMETA_PROMPT_ENto extract global information, including story synopsis, role profiles, and scene boundaries with descriptions. -
Role Mapping
ROLE_MAPPING_PROMPT_ZHandROLE_MAPPING_PROMPT_ENare used to assign character roles to individual utterances. -
Acoustic Attribute Extraction
We applyPROMPT_EMOTION,PROMPT_TONE, andPROMPT_ACOUSTICSto extract key acoustic attributes using Qwen2.5-Omni. -
Style Description
STYLE_PROMPT_ZHandSTYLE_PROMPT_ENare used to generate a style description reconciled with transcript, scene, and role information via an LLM. -
Local Prosody Emphasis
Local emphasis patterns are captured using the Wavelet Prosody Toolkit.
We apply context-aware resegmentation to create role-consistent and semantically coherent speech segments for controllable TTS training. Within each scene, consecutive segments from the same role are merged (up to 30 seconds). If merged segments exhibit different delivery styles, they are unified into a single descriptor via trend-based summarization.
All experiments are conducted on the StoryTTS dataset, using CosyVoice 2 as the TTS backbone.
Annotations generated by the HASAP pipeline are provided in the annotations directory.
