Open Dataset | Chinese-Japanese Parallel Corpus for E-Commerce

Posted at 2 years ago


Virtual agents are very popular currently among all AI applications. They have been applied widely as IVR, and in entertaining, and social media scenes. So, let’s talk about virtual agents today.

We know cross-border e-commerce has become a very important marketing method for more and more companies in China and beyond. Thus, we have to think about how to communicate with clients from varied countries. Especially in terms of making our virtual agents capable of communicating with clients of different native languages. So automatic machine translation is needed naturally here.

In this regard, we need to consider whether the content can be conveyed completely and accurately to the counterpart in the translation generated by the AI virtual agents. This is a matter of accurate and elegant translation. But how to evaluate whether the translation is good or good but requires improvement? Here it is actually decided by many subjective aspects. For example, we can notice that people from different backgrounds might comprehend the same translation very differently. And this subjectiveness can be inferred by a technology called knowledge graph.

Another fun fact is that, translation is not just content-related, but also very much context-related. This is to say, when we ask “how are you?” in English, it can be translated into “你怎么样啊?不错啊?” in Chinese. Yet this translation can also mean many different things in different contexts for Chinese people who hear it, as they can interpret it with their subjective standings. Another case would be the answer to this question. You can say “I’m Ok”, which makes perfect sense here. But the same answer also applies to other questions which means “I’m ok” with the problem under discussion. Therefore, context really matters to AI virtual agents, especially when AI may not be able to recognize the implied context as smart as humans are.

Also, we should mention the importance of translation in the conversational setting. Varied domains and scenarios etc. would all affect how the translations are processed in conversations, given that conversations are more complex translating tasks. Therefore, MagicHub has provided open access of a Chinese-Japanese parallel corpus to the public. Developers and researchers who are interested in building virtual agents for a real cross-border e-commerce setting, welcome to play around with this corpus. We believe a corpus like this can help build more natural conversational virtual agents and catering better communication with clients of different ethnicities. We invited professionals to clean and process the data of this corpus, so that more elegant machine translation models can be trained for the AI virtual agents. We hope that with the MagicHub parallel text corpus, more automatic translation tasks can be solved. So welcome to talk to us for more information.

Related Datasets

NLP_Chinese-Japanese Parallel Corpus - E-Commerce

Datasets Download Rank

ASR-RAMC-BigCCSC: A Chinese Conversational Speech Corpus
Multi-Modal Driver Behaviors Dataset for DMS
ASR-SCKwsptSC: A Scripted Chinese Keyword Spotting Speech Corpus
ASR-SCCantDuSC: A Scripted Chinese Cantonese (Canton) Daily-use Speech Corpus
ASR-SCCantCabSC: A Scripted Chinese Cantonese (Canton) Cabin Speech Corpus
ASR-EgArbCSC: An Egyptian Arabic Conversational Speech Corpus
ASR-CCantCSC: A Chinese Cantonese (Canton) Conversational Speech Corpus
ASR-SpCSC: A Spanish Conversational Speech Corpus
ASR-CStrMAcstCSC: A Chinese Strong Mandarin Accent Conversational Speech Corpus