MagicData
SIGN IN

Global Data for Voice Agent Meetup Launches | Debut in Singapore

1750128659-英文logo带背景

Posted at 10小时 ago

As we enter 2026, native voice interaction for Voice Agents is experiencing a full-scale breakout.

OpenAI is reportedly planning to launch its first intelligent hardware product centered on voice interaction, aiming to deliver a more natural and fluid human–machine conversational experience.
Apple is preparing a major overhaul of Siri, fundamentally restructuring both its interaction paradigm and system architecture to enable a more intelligent and efficient voice assistant.

Meanwhile, Meta has recently open-sourced Omnilingual ASR, a large-scale multilingual speech model with extremely broad language coverage, providing foundational support for both research and real-world deployment of multilingual voice AI.

Together, these developments signal a clear trend: after the explosive growth of LLMs, voice-enabled multimodal systems are moving rapidly from experimentation into everyday applications. Human–computer interaction is shifting from click-first interfaces to voice-first experiences. A new era of deep human–machine coupling is approaching.

Connecting the Voice Ecosystem Through Open Source: MagicHub.com Sets Sail Again

Magic Data officially launched its open data community, Magichub.com, in 2021 with the goal of fostering open research and exploration in voice-centric multimodal AI. Through open collaboration, the community aims to accelerate innovation and adoption across the Voice AI landscape.

With 2026 shaping up to be a pivotal year for Voice Agents, Magichub.com is embarking on a renewed global journey. To create more opportunities for in-person exchange and deep technical discussion, we initiated the Global Data for Voice Agent Meetup series. Through these events, we hope to connect developers, researchers, practitioners, and enthusiasts who are passionate about Voice Agents—and together push forward their real-world adoption and impact.

Our first stop on this global tour is Singapore, a country where Eastern and Western cultures intersect deeply.

A Fully Booked Cross-Border Gathering: A Cocktail Night That Sparked Meaningful Connections

The Singapore meetup took place at the Dorsett Changi, adjacent to the Singapore Expo. Coinciding with the AAAI conference, the event was initially planned as a small gathering of 20–30 voice AI practitioners based in Singapore. However, interest far exceeded expectations.

Due to overwhelming demand, we expanded capacity to 40 attendees, striving to ensure a comfortable and engaging environment for every participant. For future events, we plan to secure larger venues and additional seating to welcome even more members of the voice technology community. We look forward to meeting more passionate practitioners, exchanging insights, and diving deeper into cutting-edge discussions.

To help participants feel at home while abroad, we prepared a generous selection of drinks and curated cold dishes, using food and conversation as a medium to convey warmth and collegiality within the global voice AI community.

Soft jazz music filled the room as conversations flowed freely. In this relaxed atmosphere, participants naturally opened up—discussing research, industry trends, and hands-on experience. Ideas collided over clinking glasses, and shared understanding emerged through genuine dialogue.

This evening was not just about speech models and algorithms, but also about connection, resonance, and community.

Lightning Talks: High-Impact Insights from the Front Lines of Research

The highlight of the evening was the lightning talk session, where several speakers shared their latest research and practical explorations in speech technology, sparking strong engagement from the audience.

1. Step-Audio-EditX: Making Audio Editing as Flexible as Image Editing

Hu Guoqiang, a graduate student at Nanyang Technological University and research intern at StepFun, opened the session by presenting the latest progress on Step-Audio-EditX, an audio editing model trained with LLM-based reinforcement learning.

The core objective of Step-Audio-EditX is to make audio editing more flexible and fine-grained. Traditionally, audio processing has struggled to support iterative, compositional edits—such as simultaneously modifying prosody, injecting emotion, and preserving speaker identity. These operations often require repeated manual steps with limited controllability. Step-Audio-EditX is designed specifically to address these challenges.

Rather than relying on complex representation-level disentanglement, the model adopts a simpler large-margin, synthetic-data-driven approach.

  1. By contrasting emotional or stylistic reference audio with neutral speech, the team generates synthetic data and constructs training triplets.
  2. The model is then trained through supervised fine-tuning, reward model optimization, and reinforcement learning with human feedback (RLHF), enabling precise alignment with diverse editing instructions.

In practice, Step-Audio-EditX demonstrates strong capabilities:

  • Zero-shot voice cloning, reproducing target speaker timbre without prior training.
  • Flexible control over 14 emotion categories and 32 speaking styles. Multilingual support including Mandarin, Cantonese, and English.
  • Integration of paralinguistic elements such as laughter and sighs for more natural expressiveness

After multiple rounds of iterative editing, the model shows significant gains in emotional and stylistic accuracy, with paralinguistic rendering comparable to proprietary, closed-source systems.

The model also supports practical utilities such as speech rate adjustment, background noise removal, and automatic trimming of excessive silence. These features make it well-suited for content creation workflows (e.g., podcasts, dubbing, media production) as well as applications like virtual avatar voice customization and pronunciation training for language learners—delivering an image-editing-like experience for audio.

Hu also outlined the team’s roadmap:

  • Short term: expand emotion, style, and paralinguistic editing capabilities.
  • Mid term: support 20+ languages, dialects, and multilingual scenarios.
  • Long term: enable free-form emotional and stylistic editing directly via text descriptions.

The Step-Audio-EditX project has been fully open-sourced. Code and benchmarks are available at:
https://github.com/stepfun-ai/Step-Audio-EditX

2. Low-Hallucination Speech Enhancement: Phoneme Priors Crack Industry Pain Points

Next, Rong Xiaobin, a PhD student from Nanjing University, presented his lab’s latest research on speech enhancement.

Speech enhancement aims to extract clean speech from signals corrupted by noise and reverberation, and plays a critical role in communications, human–computer interaction, and downstream ASR systems.

While deep learning-based approaches achieve strong performance in standard conditions, they still struggle under low signal-to-noise ratios. Discriminative methods can suppress noise effectively but often distort speech components, degrading perceptual quality and intelligibility and harming ASR accuracy. Generative approaches can produce high-quality enhanced speech, but frequently suffer from hallucination—introducing speaker identity drift or semantic inconsistencies.

To address hallucination in generative speech enhancement, the team proposed leveraging phoneme priors extracted from the self-supervised speech foundation model WavLM. These priors constrain the phoneme representation space during enhancement, ensuring that generated phoneme sequences remain linguistically “well-formed.”

Ablation studies confirmed the critical role of phoneme priors in suppressing semantic hallucinations. Further analysis revealed an important insight: the effectiveness of the phoneme prior does not primarily stem from large-scale pretraining data, but from the masked prediction pretraining paradigm itself.

This finding suggests that even under limited data conditions, effective phoneme priors can be constructed to achieve low-hallucination speech enhancement—demonstrating strong practical value for real-world systems.

Multidimensional Exploration: Deployment and Extension of Voice Technology

In addition to algorithmic advances, several academic experts offered broader perspectives on the application and extension of speech technology:

  • Prof. Lei Shi(Communication University of China) discussed cross-media semantic learning and its applications in speech-related tasks
  • Prof. Li Liu (Shandong Normal University) explored cross-modal retrieval and generation techniques and their potential in voice interaction scenarios
  • Prof. Jia Li (Hefei University of Technology) presented real-world deployments of emotionally interactive robots in smart eldercare settings, highlighting the human-centered value of voice technology

Additional speakers also shared their research insights, keeping the exchange dynamic and thought-provoking throughout the evening.

The successful Singapore stop marked a strong opening for Magic Data’s global Data for Voice Agent Meetup series. Moving forward, we will continue traveling across regions to build high-impact bridges within the voice technology community—enabling practitioners worldwide to exchange ideas and share outcomes.

We look forward to meeting more like-minded collaborators on this journey, jointly accelerating the deployment of Voice Agent technologies, and welcoming an era where voice interaction becomes the dominant interface in deeply coupled human–machine systems.

Related Datasets

Datasets Download Rank

ASR-RAMC-BigCCSC: A Chinese Conversational Speech Corpus
Multi-Modal Driver Behaviors Dataset for DMS
ASR-SCCantDuSC: A Scripted Chinese Cantonese (Canton) Daily-use Speech Corpus
ASR-SCSichDiaDuSC: A Scripted Chinese Sichuan Dialect Daily-use Speech Corpus
ASR-EgArbCSC: An Egyptian Arabic Conversational Speech Corpus
ASR-CCantCSC: A Chinese Cantonese (Canton) Conversational Speech Corpus
ASR-SCCantCabSC: A Scripted Chinese Cantonese (Canton) Cabin Speech Corpus
MagicData-CLAM-Conversation_CN
ASR-SCShhiDiaDuSC: A Scripted Chinese Shanghai Dialect Daily-use Speech Corpus