Neural Audio Intelligence

Speech Emotion
Recognition

Decoding human emotion from raw voice — one waveform at a time.

87.4% Accuracy

7,442 Audio Clips

4 Datasets

Scroll

01 — Signal Analysis

Listening to the shape of sound

Before any prediction, the model listens. It processes raw audio at 22,050 Hz, trims silence, and pads to a fixed 3-second window — isolating the emotional signature embedded in pitch, rhythm, and amplitude.

02 — Feature Extraction

What the CNN sees

The mel spectrogram transforms audio into a 128-band frequency image over time. The CNN branch reads this like a photograph — detecting texture, rhythm, and tonal contours that correlate with specific emotional states.

03 — Architecture

Dual-branch signal processing

CNN reads the mel spectrogram. MLP reads 645 hand-crafted acoustic features. Both streams fuse to classify emotion in real time.

01 / INPUT

Raw Audio

22,050 Hz · mono
silence-trimmed · 3 s window

02 / CNN

Mel Spectrogram

128 mel bands × 131 frames
3-channel multiscale

03 / MLP

Flat Feature Vector

645 dims · MFCC + chroma
spectral · tonnetz · pitch

04 / FUSION

Feature Fusion

Concat → Dense 256
CBAM attention · Dropout

05 / OUTPUT

Classification

Softmax(4) · Focal Loss
angry · happy · neutral · sad

04 — Emotion Classes

Four emotional states, one model

Angry

89.2%

class accuracy

Happy

85.7%

class accuracy

Neutral

84.1%

class accuracy

Sad

90.8%

class accuracy

05 — Live Inference

Try it yourself

● checking model...

Upload a .wav clip or record 3 seconds of your voice. The model processes it in real-time and returns an emotion prediction with confidence scores.

Drop your audio file here

supports .wav · .mp3 · .flac · .ogg

Analysing waveform...

DEMO MODE — model not loaded

—

Confidence: —

Results will appear here after you upload or record audio.

Speech Emotion Recognition · Live Processing Pipeline

Raw Audio

Mel Spec

Features

CNN + MLP

Result

01 / INPUT Raw Audio Signal 22,050 Hz · mono · silence-trimmed

02 / CNN BRANCH Mel Spectrogram 128 mel bands · 131 time frames

03 / MLP BRANCH Feature Extraction 645 dims · MFCC + chroma + spectral

04 / FUSION → OUTPUT CNN + MLP Inference Dual-branch · Softmax(4)

Speech EmotionRecognition

Listening to the shape of sound

What the CNN sees

Dual-branch signal processing

Four emotional states, one model

Try it yourself

Speech Emotion
Recognition