Surveying word boundary factor in Chinese - Vietnamese statistical machine translation

Phuoc Thanh Tran; Dien Dinh

doi:10.32508/stdj.v18i2.1133

Natural Sciences - Research article

Surveying word boundary factor in Chinese - Vietnamese statistical machine translation

Phuoc Thanh Tran ^{1, *}

Dien Dinh ²

Ton Duc Thang University
University of Science, VNU-HCM

Correspondence to: Phuoc Thanh Tran, Ton Duc Thang University. Email: pvphuc@vnuhcm.edu.vn.

Volume & Issue: Vol. 18 No. 2 (2015) | Page No.: 70-78 | DOI: 10.32508/stdj.v18i2.1133

Published: 2015-06-30

Abstract

In isolating languages such as Chinese and Vietnamese, words are not separated by spaces, a word can include one or more spelling words. Segmenting word or not before training and translating process is a problem that need to be considered. In this paper, we will survey the effect of word boundary factor in the translation result of Chinese-Vietnamese statistical machine translation (SMT). The experimental result of this paper will be the basis for word segmentation improvement in future research which increase machine translation performance. We surveyed on two experiments: word segmentation (WS) and word un-segmentation (WUS) on the corpus of 8,000 and 12,000 sentence pairs. Based on the experimental results, we found that both of WS corpus and WUS corpus have their own advantages and defects. We propose integrating the advantages of these two methods in SMT

VNUHCM Journal of

Science and Technology Development

Surveying word boundary factor in Chinese - Vietnamese statistical machine translation

Online metrics

Statistics from the website

Statistics from Dimensions

Statistics from PlumX

Abstract

Comments