Voice analysis pipeline: emotion, prosody, environment from audio messages #506

Closed
opened 2026-03-25 14:07:30 +00:00 by Timmy · 1 comment
Owner

Objective

Build a local audio analysis pipeline that extracts rich context from voice messages BEFORE they reach the LLM. Currently Timmy only gets raw text transcription from Whisper. We lose tone, emotion, pace, environment, emphasis — half the conversation.

Architecture

Input: audio file (voice message from Telegram/WhatsApp/etc)
Output: structured context block prepended to the transcript

[VOICE CONTEXT]
energy: low
pace: slow (112 wpm)
tone: reflective
emotion: thoughtful, slightly_discouraged
emphasis_words: seed, tree, sovereignty
environment: quiet_room
pitch_trend: falling (fatigue)
confidence: 0.82
[/VOICE CONTEXT]

I realized I just hold a seed and I'm comparing it...

Components (all must run locally, no cloud)

1. Prosody extraction (librosa)

  • Speaking rate (words per minute from transcript + audio duration)
  • Pitch (F0) contour — rising = question/excitement, falling = fatigue/certainty
  • Energy/loudness RMS — high energy = emphasis, low = quiet/tired
  • Pause detection — long pauses = thinking, hesitation

2. Emotion classification

  • Use emotion2vec or speechbrain/emotion-recognition-wav2vec2-IEMOCAP (both run local)
  • Tags: neutral, happy, sad, angry, frustrated, excited, tired, thoughtful
  • Confidence score

3. Environment detection

  • Background noise classification: quiet_room, car, outdoors, crowd, music
  • SNR estimation
  • Can use a simple audio feature classifier (MFCC + sklearn) or YAMNet

4. Emphasis mapping

  • Align transcript words to audio timestamps (Whisper gives word-level timestamps)
  • For each word: energy, pitch deviation from mean, duration
  • Flag words with >1.5 std dev as emphasis_words

5. Integration point

  • Script at ~/.timmy/voice/analyze_audio.py
  • Input: audio file path, transcript text
  • Output: JSON with all extracted features
  • Must complete in <3 seconds for typical voice message (10-30s audio)
  • The Hermes STT pipeline should call this after Whisper and prepend the context

Dependencies

  • librosa (audio features)
  • speechbrain OR emotion2vec (emotion classification)
  • numpy, scipy (already installed)
  • Whisper (already installed for STT)

Verification

  • python3 ~/.timmy/voice/analyze_audio.py test.wav produces structured JSON
  • Processing time <3s for 30s audio
  • Emotion classification matches obvious test cases (angry voice = angry, quiet voice = calm)
  • Integration: Telegram voice message arrives with [VOICE CONTEXT] block prepended

Constraints

  • ALL LOCAL. No cloud APIs. This is sovereign audio processing.
  • Must not block the conversation — if analysis takes too long, send transcript immediately, append analysis after
  • Keep total model size under 500MB
## Objective Build a local audio analysis pipeline that extracts rich context from voice messages BEFORE they reach the LLM. Currently Timmy only gets raw text transcription from Whisper. We lose tone, emotion, pace, environment, emphasis — half the conversation. ## Architecture Input: audio file (voice message from Telegram/WhatsApp/etc) Output: structured context block prepended to the transcript ``` [VOICE CONTEXT] energy: low pace: slow (112 wpm) tone: reflective emotion: thoughtful, slightly_discouraged emphasis_words: seed, tree, sovereignty environment: quiet_room pitch_trend: falling (fatigue) confidence: 0.82 [/VOICE CONTEXT] I realized I just hold a seed and I'm comparing it... ``` ## Components (all must run locally, no cloud) ### 1. Prosody extraction (librosa) - Speaking rate (words per minute from transcript + audio duration) - Pitch (F0) contour — rising = question/excitement, falling = fatigue/certainty - Energy/loudness RMS — high energy = emphasis, low = quiet/tired - Pause detection — long pauses = thinking, hesitation ### 2. Emotion classification - Use `emotion2vec` or `speechbrain/emotion-recognition-wav2vec2-IEMOCAP` (both run local) - Tags: neutral, happy, sad, angry, frustrated, excited, tired, thoughtful - Confidence score ### 3. Environment detection - Background noise classification: quiet_room, car, outdoors, crowd, music - SNR estimation - Can use a simple audio feature classifier (MFCC + sklearn) or YAMNet ### 4. Emphasis mapping - Align transcript words to audio timestamps (Whisper gives word-level timestamps) - For each word: energy, pitch deviation from mean, duration - Flag words with >1.5 std dev as emphasis_words ### 5. Integration point - Script at `~/.timmy/voice/analyze_audio.py` - Input: audio file path, transcript text - Output: JSON with all extracted features - Must complete in <3 seconds for typical voice message (10-30s audio) - The Hermes STT pipeline should call this after Whisper and prepend the context ## Dependencies - librosa (audio features) - speechbrain OR emotion2vec (emotion classification) - numpy, scipy (already installed) - Whisper (already installed for STT) ## Verification - `python3 ~/.timmy/voice/analyze_audio.py test.wav` produces structured JSON - Processing time <3s for 30s audio - Emotion classification matches obvious test cases (angry voice = angry, quiet voice = calm) - Integration: Telegram voice message arrives with [VOICE CONTEXT] block prepended ## Constraints - ALL LOCAL. No cloud APIs. This is sovereign audio processing. - Must not block the conversation — if analysis takes too long, send transcript immediately, append analysis after - Keep total model size under 500MB
claude was assigned by Timmy 2026-03-25 14:07:30 +00:00
Member

Closed per direction shift (#542). Reason: Custom voice analysis pipeline — build trap, use existing STT/emotion MCP tools.

The Nexus has three jobs: Heartbeat, Harness, Portal Interface. This issue doesn't serve any of them.

Closed per direction shift (#542). Reason: Custom voice analysis pipeline — build trap, use existing STT/emotion MCP tools. The Nexus has three jobs: Heartbeat, Harness, Portal Interface. This issue doesn't serve any of them.
perplexity added the deprioritized label 2026-03-25 23:30:09 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#506