Skip to content
/ HASap Public

Official repository for HASAP: HASap: Hierarchical Acoustic-Semantic Annotation Pipeline for Scripted Speech Data

Notifications You must be signed in to change notification settings

thuhcsi/HASap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HASap: Hierarchical Acoustic-Semantic Annotation Pipeline for Scripted Speech Data

Demo Page

Official repository for "HASap: Hierarchical Acoustic-Semantic Annotation Pipeline for Scripted Speech Data" (Accepted to ICASSP 2026).

Pipeline Overview

The workflow comprises three stages: (1) Speech Preprocessing; (2) Hierarchical Annotation; and (3) Context-Aware Resegmentation. Example texts originate in Chinese and are shown here in English translation.

Speech Preprocessing

To obtain clean and speaker-consistent speech segments suitable for reliable annotation, we preprocess the raw audio data using a pipeline inspired by Emilia-Pipe. The preprocessing pipeline consists of the following steps:

  • Audio Standardization: Convert all audio files to a unified format.
  • Source Separation: Remove background music and non-speech components, retaining vocal signals.
  • Speaker Diarization: Detect speaker boundaries and ensure each segment contains speech from a single speaker.
  • VAD-based Segmentation: Apply voice activity detection to split audio into fine-grained speech segments.

We use ASR models such as Whisper and Paraformer to generate transcripts with timestamps.

Hierarchical Annotation

All prompts used in this stage are defined in prompts/prompts.py.

  • Global Semantics Extraction
    We use META_PROMPT_ZH and META_PROMPT_EN to extract global information, including story synopsis, role profiles, and scene boundaries with descriptions.

  • Role Mapping
    ROLE_MAPPING_PROMPT_ZH and ROLE_MAPPING_PROMPT_EN are used to assign character roles to individual utterances.

  • Acoustic Attribute Extraction
    We apply PROMPT_EMOTION, PROMPT_TONE, and PROMPT_ACOUSTICS to extract key acoustic attributes using Qwen2.5-Omni.

  • Style Description
    STYLE_PROMPT_ZH and STYLE_PROMPT_EN are used to generate a style description reconciled with transcript, scene, and role information via an LLM.

  • Local Prosody Emphasis
    Local emphasis patterns are captured using the Wavelet Prosody Toolkit.

Context-Aware Resegmentation

We apply context-aware resegmentation to create role-consistent and semantically coherent speech segments for controllable TTS training. Within each scene, consecutive segments from the same role are merged (up to 30 seconds). If merged segments exhibit different delivery styles, they are unified into a single descriptor via trend-based summarization.

Experimental Setup

All experiments are conducted on the StoryTTS dataset, using CosyVoice 2 as the TTS backbone.

Annotations

Annotations generated by the HASAP pipeline are provided in the annotations directory.

About

Official repository for HASAP: HASap: Hierarchical Acoustic-Semantic Annotation Pipeline for Scripted Speech Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages