Natural speech re-synthesis from direct cortical recordings utilizing pre-trained acoustic and linguistic speech generators

Abstract: Reconstructing speech from neural recordings is crucial for understanding speech coding and developing brain-computer interfaces (BCIs) and neuroprosthetics. However, existing methods trade off acoustic richness (pitch, prosody) for linguistic intelligibility (words, phonemes). To overcome this limitation, we propose a dual-path framework to concurrently decode acoustic and linguistic representations. The acoustic pathway uses a long-short term memory (LSTM) decoder and high-fidelity generative adversarial network (HiFi-GAN) to reconstruct spectrotemporal features. The linguistic pathway employs a transformer adaptor and text-to-speech (TTS) generator for word tokens. These two pathways merge via voice cloning to preserve speaker naturalness. Using only 20 minutes of electrocorticography (ECoG) data, our approach achieves highly intelligible synthesized speech (mean opinion score = 3.956 ± 0.173, word error rate = 18.9% ± 3.3%). Our dual-path framework reconstructs natural and intelligible speech from ECoG, resolving the acoustic-linguistic trade-off.
Speech Demo For this part, we demonstrate some natural and reconstructed speech.

Sample 1 Sample 2 Sample 3
Ground Truth
Final Results
MLP regression
Acoustic pathway output
Linguistic pathway output
Voice cloning without fine-tuning
Ground Truth
Final Results
MLP regression
Acoustic pathway output
Linguistic pathway output
Voice cloning without fine-tuning
Ground Truth
Final Results
MLP regression
Acoustic pathway output
Linguistic pathway output
Voice cloning without fine-tuning