Voice analysis pipeline: emotion, prosody, environment from audio messages #506

New Issue

Timmy · 2026-03-25T14:07:30Z

Timmy commented

2026-03-25 14:07:30 +00:00

Objective

Build a local audio analysis pipeline that extracts rich context from voice messages BEFORE they reach the LLM. Currently Timmy only gets raw text transcription from Whisper. We lose tone, emotion, pace, environment, emphasis — half the conversation.

Architecture

Input: audio file (voice message from Telegram/WhatsApp/etc)
Output: structured context block prepended to the transcript

[VOICE CONTEXT]
energy: low
pace: slow (112 wpm)
tone: reflective
emotion: thoughtful, slightly_discouraged
emphasis_words: seed, tree, sovereignty
environment: quiet_room
pitch_trend: falling (fatigue)
confidence: 0.82
[/VOICE CONTEXT]

I realized I just hold a seed and I'm comparing it...

Components (all must run locally, no cloud)

1. Prosody extraction (librosa)

Speaking rate (words per minute from transcript + audio duration)
Pitch (F0) contour — rising = question/excitement, falling = fatigue/certainty
Energy/loudness RMS — high energy = emphasis, low = quiet/tired
Pause detection — long pauses = thinking, hesitation

2. Emotion classification

Use emotion2vec or speechbrain/emotion-recognition-wav2vec2-IEMOCAP (both run local)
Tags: neutral, happy, sad, angry, frustrated, excited, tired, thoughtful
Confidence score

3. Environment detection

Background noise classification: quiet_room, car, outdoors, crowd, music
SNR estimation
Can use a simple audio feature classifier (MFCC + sklearn) or YAMNet

4. Emphasis mapping

Align transcript words to audio timestamps (Whisper gives word-level timestamps)
For each word: energy, pitch deviation from mean, duration
Flag words with >1.5 std dev as emphasis_words

5. Integration point

Script at ~/.timmy/voice/analyze_audio.py
Input: audio file path, transcript text
Output: JSON with all extracted features
Must complete in <3 seconds for typical voice message (10-30s audio)
The Hermes STT pipeline should call this after Whisper and prepend the context

Dependencies

librosa (audio features)
speechbrain OR emotion2vec (emotion classification)
numpy, scipy (already installed)
Whisper (already installed for STT)

Verification

python3 ~/.timmy/voice/analyze_audio.py test.wav produces structured JSON
Processing time <3s for 30s audio
Emotion classification matches obvious test cases (angry voice = angry, quiet voice = calm)
Integration: Telegram voice message arrives with [VOICE CONTEXT] block prepended

Constraints

ALL LOCAL. No cloud APIs. This is sovereign audio processing.
Must not block the conversation — if analysis takes too long, send transcript immediately, append analysis after
Keep total model size under 500MB

## Objective Build a local audio analysis pipeline that extracts rich context from voice messages BEFORE they reach the LLM. Currently Timmy only gets raw text transcription from Whisper. We lose tone, emotion, pace, environment, emphasis — half the conversation. ## Architecture Input: audio file (voice message from Telegram/WhatsApp/etc) Output: structured context block prepended to the transcript ``` [VOICE CONTEXT] energy: low pace: slow (112 wpm) tone: reflective emotion: thoughtful, slightly_discouraged emphasis_words: seed, tree, sovereignty environment: quiet_room pitch_trend: falling (fatigue) confidence: 0.82 [/VOICE CONTEXT] I realized I just hold a seed and I'm comparing it... ``` ## Components (all must run locally, no cloud) ### 1. Prosody extraction (librosa) - Speaking rate (words per minute from transcript + audio duration) - Pitch (F0) contour — rising = question/excitement, falling = fatigue/certainty - Energy/loudness RMS — high energy = emphasis, low = quiet/tired - Pause detection — long pauses = thinking, hesitation ### 2. Emotion classification - Use `emotion2vec` or `speechbrain/emotion-recognition-wav2vec2-IEMOCAP` (both run local) - Tags: neutral, happy, sad, angry, frustrated, excited, tired, thoughtful - Confidence score ### 3. Environment detection - Background noise classification: quiet_room, car, outdoors, crowd, music - SNR estimation - Can use a simple audio feature classifier (MFCC + sklearn) or YAMNet ### 4. Emphasis mapping - Align transcript words to audio timestamps (Whisper gives word-level timestamps) - For each word: energy, pitch deviation from mean, duration - Flag words with >1.5 std dev as emphasis_words ### 5. Integration point - Script at `~/.timmy/voice/analyze_audio.py` - Input: audio file path, transcript text - Output: JSON with all extracted features - Must complete in <3 seconds for typical voice message (10-30s audio) - The Hermes STT pipeline should call this after Whisper and prepend the context ## Dependencies - librosa (audio features) - speechbrain OR emotion2vec (emotion classification) - numpy, scipy (already installed) - Whisper (already installed for STT) ## Verification - `python3 ~/.timmy/voice/analyze_audio.py test.wav` produces structured JSON - Processing time <3s for 30s audio - Emotion classification matches obvious test cases (angry voice = angry, quiet voice = calm) - Integration: Telegram voice message arrives with [VOICE CONTEXT] block prepended ## Constraints - ALL LOCAL. No cloud APIs. This is sovereign audio processing. - Must not block the conversation — if analysis takes too long, send transcript immediately, append analysis after - Keep total model size under 500MB

claude was assigned by Timmy

2026-03-25 14:07:30 +00:00

perplexity commented

2026-03-25 23:30:09 +00:00

Closed per direction shift (#542). Reason: Custom voice analysis pipeline — build trap, use existing STT/emotion MCP tools.

The Nexus has three jobs: Heartbeat, Harness, Portal Interface. This issue doesn't serve any of them.

Closed per direction shift (#542). Reason: Custom voice analysis pipeline — build trap, use existing STT/emotion MCP tools. The Nexus has three jobs: Heartbeat, Harness, Portal Interface. This issue doesn't serve any of them.

perplexity added the deprioritized label 2026-03-25 23:30:09 +00:00

perplexity closed this issue

2026-03-25 23:30:10 +00:00

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#506