Downloads
Abstract
In isolating languages such as Chinese and Vietnamese, words are not separated by spaces, a word can include one or more spelling words. Segmenting word or not before training and translating process is a problem that need to be considered. In this paper, we will survey the effect of word boundary factor in the translation result of Chinese-Vietnamese statistical machine translation (SMT). The experimental result of this paper will be the basis for word segmentation improvement in future research which increase machine translation performance. We surveyed on two experiments: word segmentation (WS) and word un-segmentation (WUS) on the corpus of 8,000 and 12,000 sentence pairs. Based on the experimental results, we found that both of WS corpus and WUS corpus have their own advantages and defects. We propose integrating the advantages of these two methods in SMT
Issue: Vol 18 No 2 (2015)
Page No.: 70-78
Published: Jun 30, 2015
Section: Natural Sciences - Research article
DOI: https://doi.org/10.32508/stdj.v18i2.1133
Download PDF = 789 times
Total = 789 times