Neural Audio Intelligence
Decoding human emotion from raw voice — one waveform at a time.
01 — Signal Analysis
Before any prediction, the model listens. It processes raw audio at 22,050 Hz, trims silence, and pads to a fixed 3-second window — isolating the emotional signature embedded in pitch, rhythm, and amplitude.
02 — Feature Extraction
The mel spectrogram transforms audio into a 128-band frequency image over time. The CNN branch reads this like a photograph — detecting texture, rhythm, and tonal contours that correlate with specific emotional states.
03 — Architecture
CNN reads the mel spectrogram. MLP reads 645 hand-crafted acoustic features. Both streams fuse to classify emotion in real time.
04 — Emotion Classes
05 — Live Inference
Upload a .wav clip or record 3 seconds of your voice. The model processes it in real-time and returns an emotion prediction with confidence scores.
Drop your audio file here
supports .wav · .mp3 · .flac · .ogg
Analysing waveform...
Results will appear here after you upload or record audio.