Genome-wide analyses of mitochondrial DNA barcodes of Labeo chrysophekadion in Lower Mekong River basin
- The University of Da Nang-University of Science and Technology
- Nha Trang University
- Nha Trang University, 02 Nguyen Dinh Chieu, Khanh Hoa, Vietnam
- University of Science and Technology, The University of Danang, 54 Nguyen Luong Bang, Da Nang, Vietnam
Abstract
Background: The Labeo chrysophekadion is an economically important cyprinid that migrates short distances seasonally between the mainstream and floodplains of the Mekong River. However, wild populations are threatened by fishing pressure and environmental impacts. Genome-wide studies have been widely applied in molecular ecology to inform fisheries management. This study aimed to assemble and annotate the complete mitogenome from whole-genome restriction site-associated DNA sequencing data and use the assembled mitogenome to identify aligned mitogenome segments and investigate the population genetics of L. chrysophekadion in the lower Mekong Basin (LMB).
Methods: A total of 255 individuals were collected in the lower Mekong Basin. There were six sites on the Mekong mainstem (Paksan, Pakse-Lao PDR, Ubon Ratchathani-Thailand, Kratié-Cambodia, Dong Thap, An Giang-Vietnam), one site at the confluence of the Mekong and 3S Rivers (Stung Treng-Cambodia), and two sites on tributaries in the LMB: the Khan River, Luang Prabang, Lao PDR, and the Chi River, Roi Et, Thailand. High-quality sequence reads were identified and used to assemble, annotate, and visualize the mitogenome using the Mitoz toolkit. Following the RADbarcoder pipeline, aligned DNA segments were identified and used to estimate the genetic diversity and haplotype network.
Results: The complete mitogenome of L. chrysophekadion was 16,600 base pairs in length and exhibited a high identity of 99.8% to a previously published genome (accession number AP011199) derived from an individual fish in Kandal, Cambodia. This genome comprised 13 protein-coding genes, 22 transfer RNA genes, 2 ribosomal RNA genes, and a control region gene. When mapping individual sequence reads to the mitogenome, 757 bp were identified as the aligned mitogenome segment data. A total of 49 haplotypes from 247 individuals were detected, with a haplotype diversity of 0.849 (±0.014) and nucleotide diversity of 0.005 (±0.0008). High connectivity was detected among the sample populations, with three dominant common haplotypes shared among 8-9 populations. Additionally, nine haplotypes were shared by at least two populations, while 35 private haplotypes were distributed among all examined populations. This significant finding provides an overview of population structure, serving as a scientific basis for the conservation and management of aquatic resources.
Conclusion: This study provides information on the mitogenomic characteristics and spatial genetics of L. chrysophekadion in the LMB.
Introduction
The highly biodiverse Mekong River has been increasingly threatened by climate change and anthropogenic activities 1, 2, 3. This river is characterized by fast-flowing rapids, waterfalls, confluences, tributaries, catchments, seasonal floodplains and deep pools4. Along with seasonal flow changes and dynamic hydrological features, 80% of fish species in this area are migratory fish, of which numerous species spawn in the upper reaches5.
Known as the world's second most productive inland fishery, the Mekong River supplies up to 80% of the protein needs for nearly 65 million people living in the lower Mekong Basin (LMB; 6, 7). Although not as widely known as catfish species, the black sharkminnow (Bleeker, 1849) (Cypriniformes: Cyprinidae) is an important commercial food fish in Asia. This species is also used in the aquarium trade, and it shows potential for aquaculture development8, 9. Throughout its distribution range from the Mekong and Chao Phraya Basins to the Malay-Indonesian Archipelago8, size variations were recorded: 40 cm in Vietnam, 70 cm in Cambodia and Lao PDR, and up to 90 cm in Thailand10.
Currently, the migration pattern of is still debated, whether it is a short or long migratory fish, commonly known as “gray” or “white” fish, respectively 11, 12, 13. Although this species has been reported to be migratory9, information about the migration behavior of remains fragmented. Generally, mature adults begin their upstream migration at the end of the rainy season and early dry season (March to August), and spawning occurs at the onset of the rainy season (e.g., June to July in southern Lao PDR) 8. The fry and adults migrate back to floodplains for feeding and return to mainstems from October to December. The migratory routes may vary due to different habitats, such as between permanent and seasonal water bodies of floodplains, waterfalls (e.g., Khone Falls) to the Mekong Delta, and/or subcatchments and small streams 9, 10.
Knowledge gaps exist for most Mekong fish species, including the black sharkminnow, and biological, ecological, and genetic studies are needed. With advancements in next-generation sequencing and bioinformatic tools, we are gradually replacing single-molecule markers with whole-genome sequencing14. Mitochondrial DNA is a subset obtained from whole-genome sequencing and a crucial molecular marker used in evolutionary genetics, molecular ecology, species identification, and conservation biology. This marker is characterized by high mutation rates, the absence of recombination, maternal inheritance, and a rapid evolutionary rate 15. While the complete mitochondrial genome (mitogenome) offers a wealth of information about genetic diversity and evolutionary processes, DNA barcoding provides a valuable tool for identifying fish species in the absence of sufficient morphological data.
Assembly of the complete mitogenome from RAD-seq data is now possible 16, 17, 18. Furthermore, Bird (2021) developed a pipeline to identify aligned mitogenome segments (AMS) from RAD-seq data, which can be applied to investigate the gene flow and population structure of aquatic organisms19, 20, 21.
The lack of information on the genetic diversity and migration routes of exploited species, such as hinders the development of sustainable management policies, which may exacerbate resource depletion. Currently, there is only one complete mitogenome available from an individual collected in Kandal, Cambodia22. Multiple distinct populations of are believed to exist in the LMB based on fisheries observation data 8. Moreover, according to Mashyaka and Duong (2021), no significant differences were found between sampling sites from Paksan (Lao PDR) to the Vietnamese Mekong Delta, despite an estimated distance of 1,200 km, as determined by intersimple sequence repeat markers12.
This study aimed to 1) assemble and annotate mitogenomes from the LMB using restriction site-associated DNA sequencing (ezRAD) and 2) identify AMSs to examine the genetic diversity and population structure of in the LMB. These significant findings provide valuable information for comprehensive research and effective tools for managing single and multiple species in the Mekong River.

Sampling sites (blue dots) of Labeo chrysophekadion in the lower Mekong Basin. The red words represent existing, under construction, or planned hydropower dams (62).

Haplotype network of Labeo chrysophekadion mitogenomes in the lower Mekong Basin. Yellow circles represent eight consensus sequences and one mitogenome retrieved from GenBank. Red dots represent median vectors. The numbers in parentheses indicate the number of base step mutations distinguishing the haplotypes.

Circular map of the mitochondrial genome of Labeo chrysophekasion. Genes encoded on the H-strand and L-strand are shown inside and outside the circular map, respectively. The GC and AT contents are plotted in the dark and light regions in the inner gray circle, respectively.

Graphic view of the AMS data using the NCBI Multiple Sequence Alignment Viewer. Blue signifies noncoding regions, while green and red represent coding regions and their corresponding amino acid sequences, respectively.

TCS haplotype network of Labeo chrysophekadion in the lower Mekong Basin using AMS data. The color represents the current sampling site and previous study site (REF). Each haplotype is represented by a circle in which the circle size is proportional to the haplotype frequency. Mutations between haplotypes are indicated by lines representing mutations from the common haplotype.
Materials and Methods
Fish Sampling
A total of 255 individuals of were field-identified8, 23, 24 and collected at nine locations along the LMB. The sampling strategy included six mainstem sites spread across four countries (Paksan, Pakse – Lao PDR; Ubon Ratchathani – Thailand; Kratie – Cambodia; Dong Thap, and An Giang – Vietnam); one site at the confluence of the Mekong and 3S Rivers (Stung Treng – Cambodia); and two LMB tributary sites – the Khan River – Luang Prabang, Lao PDR – and the Chi River – Roi Et, Thailand (
Information on the sampling sites for
LMB location |
Country |
Sampling sites (Code) |
Geographic coordinates |
No. of individuals | |
Mainstem |
Lao PDR |
Paksan (PA) |
18°23'40.5"N |
103°39'09.1"E |
32 |
Pakse (PE) |
15°07'30.0"N |
105°48'47.8"E |
32 | ||
Thailand |
Ubon Ratchathani (UB) |
15°18'48.2"N |
105°29'52.6"E |
28 | |
Cambodia |
Kratié (KT) |
12°49'31"N |
106°01'71.5"E |
27 | |
Vietnam |
An Giang (AG) |
10°41'07.1"N |
105°11'59.1"E |
24 | |
Dong Thap (DT) |
10°46'59.5"N |
105°20'49.6"E |
32 | ||
Mekong and 3S confluence |
Cambodia |
Strung Treng (ST) |
13°31'46.7"N |
105°57'06.6"E |
28 |
Tributary |
Lao PDR |
Luang Prabang (LP) |
19°53'39.2"N |
102°08'28.6"E |
22 |
Tributary |
Thailand |
Roi Et (RE) |
15°57'39.8"N |
103°59'31.5"E |
30 |
Total |
255 |
Muscle tissue (~ 50 mg) from each individual was preserved in 95% molecular grade ethanol and transported to the Molecular Biology Laboratory at Nha Trang University, Vietnam, for further analysis.
Genomic library preparation and sequencing
DNA extraction was performed from preserved tissue samples using the Wizard® SV Genomic DNA Purification System kit (Promega, USA) following the manufacturer's instructions. A minor modification was made in the elution step; the extracted DNA was eluted three separate times, with 100 µL of AE buffer used each time instead of 250 µL of nuclease-free water. Subsequently, all elutions were subjected to electrophoresis on a 1% agarose gel and quantified using a Qubit 2.0 fluorometer with the dsDNA High Sensitivity kit (Invitrogen).
Selected DNA (100 ng, ≥ 3 ng/µl) from 255 individuals was used for ezRAD library preparation16, 17. The implementation process involved randomly fragmenting the genomic DNA, performing end repair, size selection, A-tailing, ligating with Illumina adapters, and PCR amplification. All libraries were then sent to the Genomics Core Laboratory (Texas A&M University, Corpus Christi, USA) for paired‐end 150 bp sequencing on the Illumina HiSeq 4000 platform.
Mitogenome assembly and annotation
The data were analyzed on a server with the following configuration: Intel(R) Xeon(R) Gold 6168 CPU @ 2.40 GHz, 80 CPU, and 187 GB of RAM. The operating system used was Ubuntu 21.10, version X_86 64-bit.
The quality of the raw paired-end reads (FASTQ) of the obtained libraries was analyzed and visualized using FastQC25 and MultiQC26, respectively. The mitochondrial genomes were assembled following the MitoZ v3.4 toolkit27. Trimmomatic v0.3628 was used to remove adapters, restriction site sequences, bases with a Phred quality score less than 30, and any reads that were less than 50 base pairs in length. Then, quality-filtered reads were assembled using graph (DBG) algorithms based on Megahit v1.2.9 (quick mode default length of 71) 29. The output files (FASTA) were assembled contigs and/or scaffolds of both the mitochondrial and nuclear genomes. The FindMitoScaf module was applied for the following steps: The genomes were mapped to the profile Hidden Markov Model; All sequences falling outside the database were removed; and The confidence scores were calculated and ranked for protein-coding genes. GeneWise v2.2 30, MiTFi v1.0 31, and infernal v1.1.132 were used to annotate protein-coding genes (PCGs), transfer RNA (tRNA), and ribosome RNA (rRNA), respectively.
Consensus genome sequences in FASTA format were aligned to the mitogenome from GenBank (AP011199 33) using pagan2 34. Based on sequence length and percentage mapping, a haplotype network of eight consensus mitogenomes from LMB and the previously published genome was created using POPART v1.7 35. Additionally, Kimura’s two-parameter genetic distance was calculated using BioEdit 7.0.5.3 36 to determine the genetic differences between the LMB consensus sequences.
The selected complete mitogenome was rearranged using BWA v0.7.17 37 and SAMtools v1.15.1 38. A circular map of the mitogenome was generated using Circos 39. The nucleotide composition of the mitogenome was determined using MEGA X 40. Finally, the complete mitogenome was submitted to GenBank using Bankit (https://submit.ncbi.nlm.nih.gov/about/bankit/).
AMS identification and population genetics
DNA mitogenome processing was implemented using the RADbarcoder pipeline 22. All reads from each individual that passed quality trimming were mapped to the current mitogenome of using BWA v0.7.12 37 with the MEM algorithm 34. The unmapped reads were filtered using the ‘stats’ function in SAMtools38. The ‘bam2GENO’ function was used to convert the resulting BAM files to consensus genome sequences in FASTA format, which were then aligned to two mitogenomes (AP011199 33 and the OR637878 in current study) using pagan2 41. The ‘fltrGENOSITES’ function was applied to remove sites with missing/ambiguous/indel base calls and individual sequences with low percentage sequence mapping (< 50%). The position on the mitochondrial genome of the collected AMS (FASTA file) was determined using the Basic Logical Alignment Search Tool (BLAST, http://blast.ncbi.nlm.nih.gov/) and viewed by Multiple Sequence Alignment Viewer v1.25.0. Then, the FASTA file of the AMS data was converted to a NEXUS file for further analysis using Seaview42.
To visualize the relationships between individuals and populations of in LMB, a haplotype network was constructed based on the AMS data using the TCS algorithm43 implemented in POPART v1.735. Genetic diversity indices, including the number of haplotypes (H), number of polymorphism sites (S), haplotype (Hd) and nucleotide (π) diversity, were calculated using DnaSP v5 18. The genetic differences (F) between all pairs of sites were also computed using ARLEQUIN v3.544.
Results
Mitogenome structure and composition
In this study, a total of 255 ezRAD libraries of from nine sampling sites across the LMB were sequenced. A total of 1,062,049,264 raw sequence reads (151 bp paired-end), ranging from 1,094–37,265,254 reads per individual, were obtained. After adapter trimming and quality filtering, 1,026,721,048 high-quality reads (912–36,260,038 per individual) were passed and used to assemble the mitogenome. With the MitoZ toolkit, 0.06-0.12% of the high-quality reads were successfully mapped to the mitogenome. Due to low percentage sequence mapping (< 50%), eight consensus mitogenomes were removed from the dataset. The lengths of the remaining 247 consensus mitogenomes varied from 8,510-16,600 bp, 20-8,069 bp in gap, and 0-152 in missing nucleotides (N). Among these, 46 mitogenomes had lengths greater than 16,000 bp (16,011–16,600) and displayed 96.3–99.8% identity to a previously published genome (AP011199) 33 (
Based on its high similarity (99.8%) to a previously published mitogenome, the selected complete mitogenome of from Ubon Ratchathani, Thailand, was chosen for annotation. This mitogenome contains 16,600 bp, with 42.9% GC content. The mitogenome contains 37 typical mitochondrial genes, including 13 protein-coding genes (PCGs), 22 transfer RNA genes (tRNA), 2 ribosomal RNA genes (rRNA), and a noncoding control region of the D-loop. Most of the mitochondrial genes are encoded on the heavy strand (H-strand), while one PCG () and eight tRNA genes ( and ) are encoded on the L-strand (Figure 3 ,
As shown in
AMS identification and population genetics
Following the RADbarcoder pipeline, the mapping of high-quality reads from 247 individuals to the current mitogenome and alignment to both the current and previous mitogenomes resulted in the identification of 757 bp of aligned mitogenome segment data (
Summary of read count, consensus sequence length, and number of individuals after various steps of mitogenome assembly and RADbarcoder pipeline
Parameters |
No. of reads/Length |
No. of individuals |
Raw reads (reads) |
1,094 – 37,265,254 |
255 |
High-quality reads after trimming (reads) |
912 – 36,260,038 |
255 |
High-quality reads per individual successfully mapped to mitogenome (%) |
0.06 – 0.12 |
253 |
Consensus sequence length per individual (bp) |
8,510 – 16,600 |
247 |
Consensus sequences mapped to reference mitogenome (%) |
51.2 – 99.8 |
247 |
AMS dataset collected (bp) |
757 |
247 |
A total of 49 haplotypes (19.8%) were identified from 247 individuals of The number of haplotypes ranged from 6/22 (LP, 27.3%) to 12/30 (RE, 40%). The average haplotype diversity (Hd) was high (mean±SD = 0.849±0.014), ranging from 0.71±0.071 (LP) to 0.871±0.046 (UB). The average nucleotide diversity (π) and number of polymorphic sites (S) were low (0.005±0.0008 and 113), ranging from 0.002±0.001 and from 9–10 (DT and ST) to 0.016±0.002 and 45 (LP), respectively (
Summary statistics of genetic variation in
Sampling sites (Code) |
Nse |
H |
S |
Hd (mean±SD) |
π (mean±SD) |
Paksan (PA) |
31 |
10 |
37 |
0.738±0.073 |
0.005±0.002 |
Pakse (PE) |
30 |
10 |
30 |
0.805±0.058 |
0.005±0.002 |
Ubon Ratchathani (UB) |
26 |
11 |
19 |
0.871±0.046 |
0.004±0.002 |
Kratié (KT) |
27 |
10 |
55 |
0.835±0.049 |
0.009±0.003 |
An Giang (AG) |
24 |
10 |
16 |
0.822±0.061 |
0.004±0.002 |
Dong Thap (DT) |
29 |
10 |
10 |
0.842±0.049 |
0.002±0.001 |
Strung Treng (ST) |
28 |
9 |
9 |
0.833±0.051 |
0.002±0.001 |
Luang Prabang (LP) |
22 |
6 |
45 |
0.71±0.071 |
0.016±0.002 |
Roi Et (RE) |
30 |
12 |
54 |
0.807±0.06 |
0.007±0.003 |
Total |
247 |
49 |
113 |
0.849±0.014 |
0.005±0.0008 |
The haplotype network of revealed high connectivity among the nine defined populations at LMB. Three dominant common haplotypes (H7, H10, and H1, Figure 5) were detected and were shared among 8 and 9 populations. Additionally, haplotype H14 was shared by 5 populations (PA, UB, ST, AG, and DT); haplotype H8 was found in PA, ST, KT, and DT; and haplotype H26 was found in RE, UB, ST, and DT. Furthermore, six haplotypes (H16, H2O, H29, H36, and H37) were joined by at least 2 populations. The Luang Prabang (LP) population shared only one common haplotype (H1) and was characterized by several private haplotypes (e.g., H3 shared by 7 individuals) spanning mutation steps. A high number of unique haplotypes (i.e., those found at a single location) were detected (35 out of 49) and were distributed among all examined populations (Figure 5).
Discussion
The Mekong River is one of the most biodiverse and productive rivers in the world, supporting more than 1000 fish species and sustaining the livelihoods of millions of people45, 4. As in other regions of the world, fisheries resources in the Mekong River have experienced declines due to factors such as exploitation, environmental pollution, urban development, habitat fragmentation, and climate change46, 1. Therefore, understanding the population structure and connectivity of fishes is crucial for mitigating the detrimental effects of these threats and implementing effective multispecies management strategies17.
In recent years, rapid advances in high-throughput sequencing technologies and available bioinformatic tools have facilitated the successful assembly and annotation of growing numbers of mitogenomes, including those of Mekong fish species 47, 48, 49, 50. Despite its importance and wide distribution range, only one available mitogenome of has been generated from an individual fish in Kandal, Cambodia33. In this study, based on RAD-seq data, an additional mitogenome (16,600 bp) from a fish individual collected in Ubon Ratchathani, Thailand, was assembled and submitted to GenBank. This genetic information will enrich our understanding of this fish species in the LMB and help fill existing knowledge gaps.
Among the 54 available mitogenomes of species, the sequence length varied from 16,602 bp () to 16,766 bp ()33. Generally, the minor length variations between closely related species are caused by changes in tandem repeats within the control region, the lengths of intergenic regions, and gene overlaps51. In comparison to a previous mitogenome sequence, our study failed to identify two nucleotides (C and A) that were found at the last position of the D-loop. In this case, the cause may be sequence disturbance, rendering it unreliable for detecting these two nucleotides. The circular map also showed that the gene order (13 PCGs, 2 rRNAs, 22 tRNAs and a D-loop) was consistent with that of previously published mitogenomes across various fish species47, 48, 49, 52.
Over the past few decades, numerous mitochondrial DNA datasets (later combined with nuclear markers) have been generated across diverse sets of organisms12, 19, 21, 47, 48, 49. Recently, due to genome-wide analyses, genetic analyses have expanded to include nuclear genomes53. AMS is a new approach that utilizes the power of whole-genome sequencing to subset the mitogenome, allowing comparison with the large dataset of mitogenones that have been produced. The 757 bp AMS present on most coding and noncoding genes allowed a complete analysis of genetic diversity and population connectivity and comparison with analyses from numerous mitogenome studies. Overall, high haplotype (Hd) and low nucleotide (π) diversity of were observed based on the categories of genetic diversity suggested by Grant and Bowen (1998) 54. In comparison to another study conducted in the LMB, one migratory catfish, (Siluriformes: Pangasiidae), exhibited similar levels of genetic diversity, with Hd = 0.941 and π = 0.0083 when analyzing the D-loop region. However, lower genetic diversity was observed for the gene, with Hd = 0.381 and π = 0.00063 55.
Low genetic diversity was detected in populations related to the Khan River (LP), confluence site (ST), and delta sites (AG and DT). Interestingly, the RE population, which was separated from the mainstem by the Pak Mun dam, showed high genetic diversity. The haplotype network showed high connectivity among all populations, except for Luang Prabang. is known to migrate upstream of the Mekong River for spawning and downstream for feeding 9 however, its migration distance has not been documented. Hydropower dams can act as barriers that fragment habitats, block access to spawning grounds, and modify habitats both upstream and downstream. These alterations can result in reduced genetic diversity and increased genetic differences among isolated riverine fish populations56. An explanation for the low genetic diversity in the LP population may be the development of hydroelectric dams on the Mekong mainstem and tributaries in Lao PDR (Nam Dong, Nam Khan 2, Nam Khan 3, and Nam Ko), which hinders the population from completing the migration routes.
Thirteen dams have been built in the Sesan and Srepok Rivers in Vietnam (without fish passages), and approximately 7 dams have been proposed to be built in the Sekong River (Lao PDR) (Figure 1). According to the knowledge of fish migration routes in the LMB, the Sesan catchment basin is an important spawning site, as well as a refuge during the dry season 57. A previous study reported a diminishing effective population size, elevated relatedness, and inbreeding of one catfish species, , sampled upstream of dams (Dak Lak) on the Srepok River 58. A low effective population size (Ne) was also recorded in populations in the 3S tributary of two fish species, and .
The Mekong Delta is one of the regions that suffers from severe consequences from climate change and human activities. Research has shown that an important food fish species, , is at risk of difficulty recovering from environmental changes17. A similar situation could occur with , a fish species of high economic value that is currently facing overexploitation.
Based on microsatellite markers, Mashyaka and Duong (2019) reported the population connectivity of between the Mekong Delta (Can Tho, An Giang, and Dong Thap) and Lao PDR (Paksan) and suggested a long migratory distance of this species (approximately 1,200 km) 12. Biological information and studies on the genetic diversity and population structure of Mekong fishes are still limited12, 17, 58, 59, 60, 61, 62, and additional studies are needed to reliably infer the migratory patterns of . Given the ecological diversity and hydrological conditions, multispecies management is essential for the Mekong River. Therefore, more in-depth research on molecular ecology and species biology is needed to develop effective conservation strategies.
Conclusions
Our study assembled and analyzed the mitogenome sequence of which is 16,600 bp in length. This mitogenome is composed of 37 genes, including 13 PCGs, 22 tRNA genes, 2 rRNA genes, and a D-loop region. An AMS dataset consisting of 757 bp was identified from 247 individuals collected from nine sample sites across the LMB. High connectivity was detected among the sample populations, except for Luang Prabang. This significant finding provides a comprehensive overview of the population structure, serving as a scientific basis for the conservation and management of aquatic resources.
List of abbreviations used
3S: Sekong, Sesan and Srepok
H: number of haplotypes
Hd: haplotype diversity
LMB: lower Mekong Basin
Ne: Effective population size
Nse: number of individuals used in analyses
PCGs: protein-coding genes
RAD: Restriction site-associated DNA
rRNA: ribosome RNA
S: number of polymorphic sites
tRNA: transfer RNA
π: nucleotide diversity
Competing interests
The authors declare that they have no competing interests.
Acknowledgments
We would like to thank the Mekong project partners who helped us collect tissues from local fish markets. We also express our gratitude to Prof. Kent E. Carpenter of Old Dominion University (USA) for his invaluable review and linguistic correction of the finished manuscript.
Funding
This project was funded by the United States Agency for International Development supported Partnerships for Enhanced Research Project 6-435 under USAID Cooperative Agreement AID-OAA-A-11-00012. PhD Truong Thi Oanh was funded by Vingroup Joint Stock Company and supported by the Domestic Master/PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF), Vingroup Big Data Institute (VINBIGDATA), code VINIF.2020.TS.34, VINIF.2021.TS091, and VINIF.2022.TS.091.