"Hello, Little Melon, please play a variation of Mozart's Little Star"
"Hi, Mia, please set the air conditioner temperature to 23 degrees Celsius"
"Xiao Ai, please turn on the sweeping robot"
"Small degree, please turn on the water heater"
In the past, there were all sorts of household chores that servants were instructed to do, but now they have been replaced by smart home devices. Smart home devices empower everyone to be in charge of their own home.
Smart home scenarios mainly include smart curtains, smart refrigerators, smart washing machines, smart sweeping robots, smart range hoods, smart video surveillance, smart speakers, smart air conditioners and other devices that require manual control, which can be replaced by voice control. It uses the residence as the platform and forms a home ecosystem based on the Internet of Things technology, software system and cloud computing platform, and provides personalized services to users through data collection and analysis of user behavior data. At present, smart homes are entering the era of "intelligence of the whole house", and conversational AI is the experience of smart home voice interaction capabilities. Conversational voice technology has always been the core technology in the smart home field. In conversational smart home devices, intelligent speech technology is mainly used in conversational interaction between people and devices, and consists of voiceprint recognition (SR), speech recognition (ASR) and speech synthesis (TTS). From the workflow point of view, the real voice of the user will first convert its identity and content into text through SR (Voiceprint Recognition) and ASR (Speech Recognition) technologies, and then access NLU (Natural Language Understanding) to further understand the user's intent. Therefore, the accuracy of speech recognition is crucial, which determines the subsequent natural language understanding effect. At the same time, speech synthesis determines the user's experience of receiving information, so clear, logical and anthropomorphic speech synthesis is very important.
The challenges of conversational speech processing on smart home devices mainly lie in the complex home environment noise, the diversity of dialects and members' age, and the difficulty of far-field speech recognition.
Complex Home Environment Noise
The home environment often exceeds two people, so there will be many acoustic scenarios where multiple people chat together. At this time, waking up the smart home device to perform the task is equivalent to that the smart home device needs to pick up instruction information from many sounds and automatically filter out other human voices and noises. This is a classic "cocktail party problem" in speech recognition. The current speech recognition technology can already recognize the words spoken by a person with high accuracy, but when there are two or more people speaking, the speech recognition rate will be greatly reduced. This problem is called the cocktail party problem. In addition, the unique noise environment of the home environment, such as door opening, reverberation, air conditioning noise, pet noise, etc., will lead to a decrease in the accuracy of your identification and speech recognition.
Diversity of Dialects and User Age
The promotion of smart home products is facing the national and even global user market. How to allow each user to interact with smart home devices naturally requires solving the problem of inaccurate speaker and speech recognition caused by dialects. At the same time, the family environment generally includes children, adults and the elderly. The age span is large, and age affects people's voice. Children and the elderly may not articulate as clearly, which brings many difficulties to the speech recognition of smart home devices.
Far-Field Speech Recognition
Smart home devices are placed indoors, and the size of the room or the placement of the remote location will cause serious reverberation and more noise in the voice picked up by the device. Reverberation will cause the signal received by the microphone to have a long tail, which makes people feel stuffy. In practice, the human ear has the ability to automatically de-reverberate, so people communicate with each other in the actual room. The sound is full, but this is fatal for speech recognition. A possible reason is that the impact response of the room is too long, generally 400ms-1000ms, and the length of a frame of speech recognition is only 50ms, even if the neural network model has memory capabilities, but also limited, so the speech recognition rate drops in reverberation.
The above-mentioned problems are largely due to insufficient data diversity and incomplete coverage of scenarios. On the one hand, the method of algorithm enhancement can be used to de-noise and de-reverberate far-field speech to enhance far-field pickup. On the other hand, it should cover more comprehensive speech scene noise, increase the diversity of speech timbre, and add more dialect accents. Comprehensive coverage of data is the key to solving the problem.
Far-Field Pickup Enhancement
For far-field speech, first use the speech enhancement method, remove the reverberation, and use the beamforming method to enhance the multi-microphone array microphone signal into one signal. Reduce non-stationary noise interference and room reverberation to improve recognition performance. However, it is difficult to combine the front-end enhanced signal processing algorithm with the back-end speech recognition algorithm. Splitting the optimization into two modules leads to a local optimum problem, and joint optimization is the trend. Enhanced algorithm research is still inseparable from the support of home environment dialogue data, which requires MagicData to provide high-quality conversational AI voice data.
Multiple Scene Data Coverage
Facing the complexity and variability of the home environment, data covering more languages and dialects, children, the elderly, and real home noise databases is the key to improving the recognition performance of smart home devices. The diversity and complexity of these data require a lot of manpower, material and financial resources to record, label and process. Researchers should spend more time on algorithm research. Collection and annotation of these data to be handed over to professional AI data company for processing may be a better choice.
At present, Magic Data has a great variety of massive conversational speech datasets, including multilingual, multi-dialect, children's speech and other conversational speech datasets. The cases in the dataset are as follows:
- MDT-ASR-E075 Indian English Scripted Speech Corpus—Daily Use Sentence
- MDT-ASR-E027 Turkish English Scripted Speech Corpus—Keyword Spotting
- MDT-ASR-G026 Noise Dataset
Deep learning is mainly based on big data, that is, training on big data, and summarizing knowledge or laws that can be used by computers on similar data. For smart homes to serve the users well, the key is the support of conversational data in home scenarios.