Natural Sciences - Research article Open Access Logo

Surveying word boundary factor in Chinese - Vietnamese statistical machine translation

Phuoc Thanh Tran 1, *
Dien Dinh 2
  1. Ton Duc Thang University
  2. University of Science, VNU-HCM
Correspondence to: Phuoc Thanh Tran, Ton Duc Thang University. Email: pvphuc@vnuhcm.edu.vn.
Volume & Issue: Vol. 18 No. 2 (2015) | Page No.: 70-78 | DOI: 10.32508/stdj.v18i2.1133
Published: 2015-06-30

Online metrics


Statistics from the website

  • Abstract Views: 2774
  • Galley Views: 1705

Statistics from Dimensions

Copyright The Author(s) 2023. This article is published with open access by Vietnam National University, Ho Chi Minh city, Vietnam. This article is distributed under the terms of the Creative Commons Attribution License (CC-BY 4.0) which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited. 

Abstract

In isolating languages such as Chinese and Vietnamese, words are not separated by spaces, a word can include one or more spelling words. Segmenting word or not before training and translating process is a problem that need to be considered. In this paper, we will survey the effect of word boundary factor in the translation result of Chinese-Vietnamese statistical machine translation (SMT). The experimental result of this paper will be the basis for word segmentation improvement in future research which increase machine translation performance. We surveyed on two experiments: word segmentation (WS) and word un-segmentation (WUS) on the corpus of 8,000 and 12,000 sentence pairs. Based on the experimental results, we found that both of WS corpus and WUS corpus have their own advantages and defects. We propose integrating the advantages of these two methods in SMT

Comments