Improving Vietnamese Fake News Detection based on Contextual Language Model and Handcrafted Features

Introduction : In recent years, the rise of social networks in Vietnam has resulted in an abundance of information. However, it has also made it easier for people to spread fake news, which has done a great disservice to society. It is therefore crucial to verify the reliability of news. This paper presents a hybrid approach that uses a pretrained language model called vELECTRA along with handcrafted features to identify reliable information on Vietnamese social network sites. Methods : The present study employed two primary approaches, namely: 1) fine-tuning the model by utilizing solely textual data, and 2) combining additional meta-data with the text to create an input representation for the model. Results : Our approach performs slightly better than other refined BERT methods and achieves state-of-the-art results on the ReINTEL dataset published by VLSP in 2020. Our method achieved a 0.9575 AUC score, and we used transfer learning and deep learning approaches to detect fake news in the Vietnamese language using meta features. Conclusion : With regards to the results and analysis, it can be inferred that the number of reactions a post receives, and the timing of the event described in the post are indicative of the news' credibility. Furthermore, it was discovered that BERT can encode numerical values that have been converted into text.


INTRODUCTION
In today's world, the internet is rife with an overwhelming amount of information, and social networking sites such as Facebook and Twitter are experiencing a significant surge in popularity and becoming more accessible to the general public. However, some individuals exploit these platforms by spreading unreliable information for their personal gain, such as earning more revenue from website clicks or influencing the community's political agenda 1 . Due to the sensational nature of fake news, it is often difficult to detect its provocative intention, and if social media users do not read it carefully, fake news can spread rapidly, causing dire consequences for its victims, such as tarnishing people's reputations. Major global events are closely associated with fake news, and since the outbreak of the COVID-19 pandemic in 2020, fake news related to the pandemic has been rampant. People living in countries with travel restrictions amid COVID-19 tend to believe rumors circulated within their communities without considering their validity. In their fear of the outbreak, readers tend to trust unconfirmed information about pandemic prevention on social media, which can be reckless 2 . This misinformation can lead to terrifying misunderstandings about any global event. Therefore, de-termining the reliability of news has gained remarkable attention to prevent the wave of false information. In this study, we focused on classifying trustworthy news written in Vietnamese, which is a low-resource language. Therefore, we attempted to find the best solution to predict whether the news is reliable or unreliable by implementing Transformer-based architectures and conducting several experiments on finetuning strategies to improve the model's performance.

RELATED WORKS
Fake news detection can be seen as a text classification task in which the input is a news article labeled either fake or real. Various machine learning and deep learning approaches have been attempted to detect fake news. Agarwal 10 applied PhoBERT embeddings to extract document content and used TF-IDF to encode them into vector representations fused with vectors of meta-data. These representations were used as inputs into tree-based methods, which are Random Forest, CatBoost and LightGBM, to predict label probabilities.

METHODOLOGY
In this section, we will present our approach to solve this problem. We chose the SOTA techniques, including the BERT model and its variants that we mentioned below. We will promote a new approach that describes meta-data provided in datasets as verbal words to enrich text information. As we noticed that numeric features can contribute a lot insights to our model and BERT can somewhat encode numbers well in medium ranges. Therefore, we will describe their values as text and combine it with the original text. Ultimately, we will put the combined text into our model and using advanced fine-tuning techniques for the text input.

Data processing
Fake news detection can encompass not only the contents of news but also related information. In the ReINTEL dataset, we applied preprocessing techniques to ensure the cleanliness of the data.
• Extract timestamp features: We extracted timestamp features into days, months, years, hours and weekdays to enrich features for the dataset. These features were found to be helpful in the prediction accuracy of the models through data analysis.
• Remove absurd data points: During the data collection process, there might be several outliers. As described in Table 1, the maximum value is greater than one billion, which is implausible.
• Fill missing data: After removing implausible data points, a significant number of missing values were encountered, as shown in Table 2.
We decided to fill missing values with different strategies: mean and median value for three features involved in interaction including number of likes, number of comments and number of shares. Furthermore, we duplicated the feature values of the DateTime data for the most similar content compared to the text, including the missing DateTime value.
For the post message, we must pre-process it more carefully than the rest of the features because it is the most crucial feature for the model to understand the text's content. The more thoroughly we process text, the more meaningful insights our models can extract and therefore improve their performance. Thus, we performed these preprocessing techniques on text content before loading them into the model: • HTL tags, redundant characters and stop words were removed.
• PhoBERT 11 requires word segmentation, so we use VnCoreNLP 12 to segment the input.
• We converted emojis and icons into the respective text describing them. For example, "((" into the token "negative".

Data Analysis
To detect fake news, the content of news is a crucial element in determining its veracity. Therefore, we visualized the news content using word clouds to analyze and compare the differences between real and fake news. Based on Figure 1, we identified keywords related to the COVID-19 pandemic and noted some distinctions between reliable and unreliable news. Specifically, we observed the following: • Real news content tends to contain many URLs and hashtags, indicating a greater number of reliable and accredited sources.  • Fake news content contains fewer credible sources but more keywords and negative icons. Additionally, since fake news is often fraught with a political purpose, we observed that the words 'USA' and 'China' appeared frequently in the word cloud. These two countries are global superpowers with opposing interests, making them the subject of many baseless speculations.
In addition to news content, we would like to plot the distribution of the number of reactions by fake and real news in Figure 2 after many preprocessing steps for these handcrafted features.
As we said in the previous section, fake news often has dramatically sensational content, which attracts many people to comment and lets them believe to share because readers have been overwhelmed emotionally without analyzing the content. For number of likes, those readers will read carefully until they agree with the content before deciding to press the "Like" button on that post. Following Qazvinian et al.  13 , they also found that common readers tend to express their biases and feelings toward fake news more than reliable news. Furthermore, we see that unreliable news often appears from

Pretrained Models
Due to the rapid development of deep learning, a multitude of pretrained language models are now available for Vietnamese, categorized as multilingual and monolingual. Because this dataset is a collection of Vietnamese news and we found that monolingual models outperform multilingual models on this dataset 8 , we decided to use some monolingual models for this language, ViBERT, vELECTRA 14 and PhoBERT 11 , for this task. This list below is a summarized introduction of them: PhoBERT a : This model is pretrained by a masked language modeling task on 20 GB Vietnamese data, inherited from the original BERT 15 . In this pretraining task, PhoBERT is inherited from RoBERTA 16 , which cuts down the Next Sentence Prediction pretraining task from the original BERT for better performance. a https://huggingface.co/vinai/phobert-base  Then, the generator produces output tokens at the position of the masked-out tokens. The discriminator will try to determine whether those tokens have been replaced or come from the original sentence. The experimental results indicated that this pretraining task outperformed the MLM task of BERT.
ViBERT e : This model is pretrained on a 10 GB Vietnamese corpus from preprocessed online newspapers.

Fine-tuning approach
In this section, we present our fine-tuning approaches with a pretrained language model. We recognized that many teams in 2020 VLSP competitions used transfer learning methods, but most of them fine-tuned on textual features only 17 and ignored the other features, while determining fake news should be based on not only news content but also several components. Therefore, we decided to fine-tune them on text incorporating timestamp and reaction features by verbal words. For example, we have written sample sentences as follows: "This post was written on Friday, April 3rd at 8 PM. There are 19447 likes, 32 comments and 50 shares from this post" (in Vietnamese) to describe those features, and then we merge them with the original sentences to make new ones. We illustrated this process in Figure 3. Our implementation is based on fine-tuning ideas proposed by 15   fully connected layer which contains a Softmax function to output probability of each label. In the classification part, we applied many advanced fine-tuning techniques: warm-up learning rate and unfreeze layers gradually in each training epoch. According to previous works, we observed that lower layers in BERT architectures contain surface features and general information, and the information gradually obtains more granularity and complex semantic features in higher layers 19,20 . Therefore, concatenating multiple layers together ensures that our information is both general and granular. In this case, we decided to concatenate the four last layers because it helps model obtain the best classification performance 18 .
Another factor that we need to consider is the text input's length. Because BERT-based architecture's input has a maximum length of 512 tokens. Moreover, we can only set the maximum length of 256 tokens due to computational computation. Therefore, we applied the "head tail" truncation method, which keeps the first 64 tokens and last 192 tokens because the key information usually comes from the beginning and ending of the entire text, which provides many meaningful insights.
We also intended to combine the image feature with the other features, including text, number of likes, comments and shares, into the model. However, the huge shortage of this information (

Training and testing procedures
In the training procedure, we fine-tune the model by feeding the encoded [CLS] token of the input information and use the concatenated representations from the four last hidden layers to a fully connected layer on top of BERT. During training, we use the binary cross entropy with logits loss (BCEWithLogits) to calculate the difference between the predicted model output's logits and the actual label. In the gradual unfreezing method, we first freeze all layers and then fine-tune them after a few epochs. This method ensures that gradients are not updated at the first epoch because the low-level layers already have generally learned features; therefore, we will avoid overfitting.
In the testing phase, the saved model outputs distributional probabilities of labels. Because the test data is unlabeled, we must submit the output probabilities to the competition's system due to their submission regulation.

EXPERIMENTS Experimental configuration
We conducted experiments on Google Colab Pro with Tesla K80 GPU configuration, 13 GB RAM, and Intel Xeon CPU @ 2.2 GHz. We use the PyTorch f library in Python programming language. We choose a batch size of 32 to save computational cost and AdamW f https://pytorch.org/ as the optimizer. The learning rate is selected in the range [1e-5, 1e-4]. All of our results are conducted on the private test set.

Evaluation Metrics
We use the Area Under the Receiver Operating Characteristic Curve for evaluation (AU--ROC). This is a performance measurement for classification at various thresholds. ROC is a probability curve, and AUC represents the degree or measure of separability. It tells how much the model can distinguish classes.

EXPERIMENTAL RESULTS
Our experimental results are reported by AUC score metrics in Tables 3, 4, 5 and 6. We will describe them in more detail below. Table 3 shows that we applied some traditional machine learning models on meta-data only to investigate the effectiveness on these features. We achieved approximately 0.7 AUC score, which is a good performance on numerical features for fake news detection, with Linear SVM model. As shown in Table 4, we trained the model under two circumstances: trained on text only and text incorporated with meta-data. The results show that the performance of all pretrained models increases significantly when merging verbal words that describe all meta-data to the input sequence. In particular, vELECTRA achieved the best performance. To make it clearer, we also conducted experiments on incorporating text with each feature to investigate which feature contributes best to the model's performance. Furthermore, the values of meta-data will also be different for each filling value (mean and median), so we want to observe how strong the classification performance is with each different filling method. Many of these experimental results prevailed the top teams' ones. As shown in Table 5, we found that the model always increased its accuracy regardless of features merged with original text as well as filling methods.
Therefore, we have achieved state-of-the-art results on this dataset. We have summarized all previous results in Table 6, borrowed from the contest's organizers 17 (⋆).

DISCUSSION
Since there are many missing values appearing in this dataset and if we did not preprocess them carefully, they would reduce our model's predictive performance. In preprocessing steps, we removed the clear outliers, and then the distribution of data would be less skewed. Therefore, the mean value is a suitable one to impute those missing data. Poulos & Valle (2018) also addressed that the mean value is important and preferred for quantitative features for supervised learning 21 .
Following the experimental results in Table 3, although those methods applied to meta-data are not as effective as textual features, they could contain some predictive information instead of randomly guessing for this task because their performances are larger than 0.5 AUC score. Therefore, when we concatenated these features with text, it would help the new text contain more meaningful information. The results shown in Tables 4 and 5 prove that this is true, and the model always outperforms when we incorporate information about the distribution of reactions with news content. Additionally, we can capture the patterns that (1) fake news usually has more shares or even comments and (2) there are likely more likes coming from real news rather than fake news, which is based on not only this dataset but also many other datasets, such as the PolitiFact dataset and Gossip-Cop dataset 14 . That is the interesting point to study users' behaviors on social media, which can help exploit more information to detect disinformation more effectively.

CONCLUSION
To recapitulate, we have conducted effective methods and fine-tuning strategies to detect untrustworthy news: fine-tuning the ViBERT, PhoBERT and vELECTRA models combined with numeric features by verbal words. Our best result belongs to vELEC-TRA when it achieved 0.9575 AUC score on this dataset. Through many experiments, we observed that in addition to the text data, we can leverage other features for classification because they may contain key information to help the model predict more accurately.
Because identifying the reliability of the news is even difficult for humans, it can be determined not only on the news' content but also on related knowledge about PhoBERT + meta-data 6e-05 6 0.9519  users' reactions. Thus, in the future, we propose to create a more elaborate dataset containing sources of news. The sources and authors that have a high percentage of fake news will be penalized strongly by the model. In contrast, in news that comes from accredited sources, authors will be more likely to be considered trustworthy news. Moreover, we want that the dataset will contain more images because we can take advantage of them since "a picture is worth a thousand words", and images are the key pieces of evidence for verifying news. In addition, fake news problems exist not only in Vietnamese but also in other languages, and multilingual methods are worth considering for a future of less fake news. Therefore, in the future, we are going to improve the performance by multimodal and multilingual approaches.