HASap: Hierarchical Acoustic-Semantic Annotation Pipeline for Scripted Speech Data

Official repository for "HASap: Hierarchical Acoustic-Semantic Annotation Pipeline for Scripted Speech Data" (Accepted to ICASSP 2026).

Pipeline Overview

The workflow comprises three stages: (1) Speech Preprocessing; (2) Hierarchical Annotation; and (3) Context-Aware Resegmentation. Example texts originate in Chinese and are shown here in English translation.

Speech Preprocessing

To obtain clean and speaker-consistent speech segments suitable for reliable annotation, we preprocess the raw audio data using a pipeline inspired by Emilia-Pipe. The preprocessing pipeline consists of the following steps:

Audio Standardization: Convert all audio files to a unified format.
Source Separation: Remove background music and non-speech components, retaining vocal signals.
Speaker Diarization: Detect speaker boundaries and ensure each segment contains speech from a single speaker.
VAD-based Segmentation: Apply voice activity detection to split audio into fine-grained speech segments.

We use ASR models such as Whisper and Paraformer to generate transcripts with timestamps.

Hierarchical Annotation

All prompts used in this stage are defined in prompts/prompts.py.

Global Semantics Extraction
We use META_PROMPT_ZH and META_PROMPT_EN to extract global information, including story synopsis, role profiles, and scene boundaries with descriptions.
Role Mapping
ROLE_MAPPING_PROMPT_ZH and ROLE_MAPPING_PROMPT_EN are used to assign character roles to individual utterances.
Acoustic Attribute Extraction
We apply PROMPT_EMOTION, PROMPT_TONE, and PROMPT_ACOUSTICS to extract key acoustic attributes using Qwen2.5-Omni.
Style Description
STYLE_PROMPT_ZH and STYLE_PROMPT_EN are used to generate a style description reconciled with transcript, scene, and role information via an LLM.
Local Prosody Emphasis
Local emphasis patterns are captured using the Wavelet Prosody Toolkit.

Context-Aware Resegmentation

We apply context-aware resegmentation to create role-consistent and semantically coherent speech segments for controllable TTS training. Within each scene, consecutive segments from the same role are merged (up to 30 seconds). If merged segments exhibit different delivery styles, they are unified into a single descriptor via trend-based summarization.

Experimental Setup

All experiments are conducted on the StoryTTS dataset, using CosyVoice 2 as the TTS backbone.

Annotations

Annotations generated by the HASAP pipeline are provided in the annotations directory.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
annotations		annotations
docs		docs
prompts		prompts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HASap: Hierarchical Acoustic-Semantic Annotation Pipeline for Scripted Speech Data

Pipeline Overview

Speech Preprocessing

Hierarchical Annotation

Context-Aware Resegmentation

Experimental Setup

Annotations

About

Uh oh!

Releases

Packages

Languages

thuhcsi/HASap

Folders and files

Latest commit

History

Repository files navigation

HASap: Hierarchical Acoustic-Semantic Annotation Pipeline for Scripted Speech Data

Pipeline Overview

Speech Preprocessing

Hierarchical Annotation

Context-Aware Resegmentation

Experimental Setup

Annotations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages