Deception and Continuous Training Approach for Web Attack Detection using Cyber Traps and MLOps

With the growth and expansion of the internet, web attacks have become more powerful and pose a significant threat in the cyber world. In response to this, this paper presents a deceptive approach for gathering malicious behavior to understand the strategies used by web attackers. The harmful requests collected through cyber traps or honeypots are analyzed and used to train machine learning (ML) models for web attack detection. Additionally, we implement an ML operations (MLOps) pipeline to automate the continuous training and deployment of these ML models in defensive systems. This pipeline trains the production model with newly collected data by using predefined triggers. Our experiments on two datasets, including Fwaf and our own, demonstrate that a proactive and continuous approach to tracking adversary behavior can effectively detect zero-day attacks, such as CVE-2022-26134 in web application servers.


INTRODUCTION
Daily life has seen a significant increase in the volume and speed of data since the advancement of information technology. Furthermore, the Internet has transformed traditional methods of daily life. Web applications have become the most widely used applications on the Internet. Web applications are used in different areas and are important in people's daily routines, especially as more individuals transfer their applications and personal information to the cloud. Due to the widespread use of web applications and the significant amount of personal data saved on servers, they become attractive targets for attacks. A recent report about cybersecurity 1 incidents illustrated that 75% of attacks are detected in the application layer, while web servers are the main targets of hackers. There are two reasons why adversaries are inclined to intrude and break into web servers. First, because large amounts of private data are stored in server databases, attackers can profit significantly by selling that data. Second, the ability to inject malicious code into server source files allows attackers to hack and manipulate users who browse or download these documents. Obviously, safeguarding web applications from intrusions is essential. Typically, attacks are detected primarily based on recognizable characteristics. This method is useful for detecting known attack types, but it necessitates human involvement in collecting and analyzing attack data samples for identification.
Consequently, identity-based website attack detection (WAD) is no longer adequate for detecting attacks with novel exploits. Additionally, with the rapid and significant advancement in the field of ML, these algorithms have demonstrated their effectiveness in numerous fields, including web attack detection. Some research has achieved desired results when incorporating ML into web protection systems. To automatically detect denial-ofservice (DoS) attacks, Francisco et al. 2 proposed an ML model to make inferences based on signatures previously extracted from samples of network traffic. The results of this model on four modern benchmark datasets have achieved an online detection rate (DR) of attacks above 96%. In the same domain of WAD, Liang et al. 3 proposed a deep learning-based approach to classify anomalous requests. Recurrent neural networks (RNNs) were used to learn patterns of normal requests using only unsupervised normal requests. Then, a neural network classifier that takes the outputs from the RNNs as input was trained in a supervised manner to differentiate between anomalous and normal requests. Similarly, Yunyi et al. 4 suggested a WAD that applied a long short-term memory (LSTM) network to analyze the malicious intentions hidden in user actions. The experimental results on the CSIC 2010 dataset achieved an accuracy of 99.87%. In addition, the ML approach for detecting many attack types needs administrators or data scientists to understand and select appropriate representation data for the training ML model. These approaches use datasets published on the internet or create a dataset to train the ML model. Therefore, it has a limitation in recognizing rare attacks that are much different from existing attack patterns in datasets. It requires administrators to continuously collect new attack patterns to enrich their datasets. However, there are difficulties in updating data with new attack patterns, which consumes more time and effort. In this context, an effective method for creating a training dataset was introduced by Nikola et al. 5 to effectively improve the performance of WAD without many false positives. They combined hostile requests from cyber traps with good requests from a typical website. Based on the features of the deceptive network, a system is vulnerable to the attack that entices an attacker to exploit the vulnerability. It is easy to collect and update datasets with new indicators to overcome the weaknesses of the ML system. Furthermore, to avoid forgetting the existing knowledge when training the ML model with new data, three incremental learning approaches are also implemented for WAD and obtained good results during testing. Moreover, to evaluate the use of deception in the domain of web applications, Xiao et al. 6 implemented a web deception framework that allows us to introduce deception in any web application. In their experiments, the authors showed that over 36% of attackers who were able to exploit a vulnerability did not set off any of their traps. Their research has shown that while deception is a useful complement to other detection techniques, it is insufficient as a stand-alone protection mechanism. ML initiatives have produced fresh difficulties that do not exist in conventional software development. Keeping production deployment current, one of these involves tracking input data, data versions, tuning parameters, etc. Meanwhile, MLOps was defined as including three parts: ML, Development and Operations (DevOps), and data engineering. The largest effect on MLOps development came from DevOps 7 . To map how MLOps is currently understood and how it compares and differs from related techniques such as DevOps, the article 8 used meta-analysis, document analysis, and triangulation. This research gives up the comment that these related studies [8][9][10][11] effectively conceptualize MLOps and demonstrate that it lies at the nexus of software engineering, data engineering, DevOps, and ML. With this characteristic, MLOps is considered a potential solution for continuing to harden the robustness of WAD. Moreover, S. Garg et al. 12 discussed open research problems and provided a detailed representation for automation with DevOps in ML-based applications, called MLOps. The pipeline seeks to achieve the advantages of both contexts by keeping the De-vOps pipeline's trademark simplicity and incorporating new circular phases for updating ML. This aims to produce a self-maintaining ML-based development subsystem that may advance concurrently with software development. The authors in Garg et al. (2023) 13 also described the three tiers of MLops and open-source platforms that help users easily monitor the workflow of the system, visualize operations, and trace errors when building and operating the ML model. Unlike the above approaches, our work focuses on web attack detection, which enables a continuous training strategy based on cyber traps and MLOps. The main contributions in this article are summarized as follows: • First, an ML-based WAD using a stacking classifier is proposed to harness the capabilities of detecting web attacks with better performance than any single model in the ensemble. • Second, we integrate the MLOps pipeline to formalize a strategy of automatic training ML models frequently. • Finally, a honeypot system is deployed to collect attack data to diversify the training dataset. In addition, a Service Website is built to take on the role of deploying ML-based WAD to users.
The remainder of this paper is structured as follows.
In Section 2, the methodology and detailed workflow of our proposed system are described. The implementation and experiments are presented in Section 3. Finally, we conclude the paper in Section 4.

DESIGN
This section describes our approach to create an MLbased WAD with cyber traps to be proactive in collecting training datasets for the ML process. After that, we integrate the MLOps pipeline in our system detection to enhance automation in the deployment and maintenance of ML models.

A. Overall Model
In this part, we introduce a WAD model consisting of Kubeflow, Data Storage, Honeypot, and Serving Website, as shown in Figure 1. Each of these parts performs a different task and executes on separate hosts. The workflow of MLOps and the Deception model for WAD can be summarized as follows: • First, when executing the workflow, collected logs in Kubeflow send a data flow signal to Trigger.
• Next, a signal is sent to the honeypot to save all the collected data during the trap's working period on data storage.
• Data saved in Data Storage are preprocessed in the Preprocess block of Kubeflow to prepare for the training step.
• Next, the pipeline flow signal is sent to the train/retrain block.
• Finally, a new ML model is deployed on Data Storage. A serving flow signal is sent to the website serving to update the new model.

B. ML-based Web Attack Detection
In this study, our WAD model is built based on the ensemble method to reduce the uncertainty in the generalization performance of using a single algorithm 14 . In the ensemble model, we use 3 well-known ML algorithms in classification, including support vector machine (SVM), logistic regression (LR), and K-nearest neighbors (KNN). In addition, we also built a convolutional neural network (CNN) model to compare the performance of the ensemble method with that of a neural network.

Ensemble Learning (Stacking Classifier) model
Stacking, also known as a voting classifier, is an ensemble learning method. This method improves the model's performance by combining different ML classifiers for classification. Majority voting, a popular voting technique, is used in this approach. It has three scenarios based on unanimous voting, simple majority voting, and plurality voting. Furthermore, hard voting usually refers to plurality voting. We assume that the decision of the t th classifier is d t,c ∈ 0, 1, t = 1, ..., T and c = 1,…,C, where T is the number of classifiers and C is the number of classes. If the t^th classifier chooses class c, then d t;c = 1, and 0 otherwise. Finally, mathematical representationbased plurality voting is computed to choose the class desired c * as follows:

CNN model
In this study, the architecture of the CNN model contains 2 convolutional blocks, as shown in Figure 2. The input x contains features that indicate whether this is a good or bad. x is passed through 2 convolutional blocks denoted as ConvBlock 1 and ConvBlock 2 , respectively, as in (6).
Then, the output value of the 2 nd convolution block will be flattened as in (7).
This result f goes through a fully connected layer, as in (8).
Finally, the Softmax layer transforms the result c into values between 0 and 1 as the probability y (9).
The cross-entropy function is used to optimize the loss value between the vector output y and vector y containing the actual label values.

C. MLOps for Web Attack Detection
In this study, we proposed a methodology for a WADbased MLOps pipeline to create an automatic training rule for continuously updating ML models. The pipeline workflow is shown in Figure 3.

The Log Collection Process
The log collection process as in Figure 4 is explained as follows: The honeypot's daily process is located outside the internet to collect logs.
After an activation signal from Kubeflow, Honeypot turns on the firewall allowing only administrator access, and each step of the log processing is executed sequentially: • Collect good requests: the process of simulating web page interactions as normal users to collect labeled datasets is normal. • Collect bad requests: This process will collect the malicious exploit code attacked by the administrator. • Parse and push log: process collected logs and send them to the Data storage.

Data Acquisition and Processing
Data acquisition is responsible for collecting enough data to train the ML model. The tasks for data acquisition and processing can be summarized as follows: • Data Extraction and Analysis: Analyzing data to understand the data schema and integrating relevant data to maximize model performance. • Data Preparation: is responsible for cleaning and splitting datasets into training, validation, and test sets. For instance, these outliers detected by the local outlier factor algorithm will be removed from the dataset, or NULL values are transformed to zero. This step results in formatted data for training ML models.
When there is not enough data to train or retrain the ML model, two main approaches allow for overcoming the issue as follows: • Data Augmentation: This technique aims to balance non-IID (nonindependent and identically distributed) data. More specifically, when the difference ratio between the attack and normal labels is more than 20%, the conditional generative adversarial network (CGAN) will generate data with a minority label to keep the difference within 20%.

Training, Testing, and Deploying the ML (ML) Model
The process of training ML models is an iterative process in which data scientists work with several algorithms, data features, and hyperparameters. The output of this step is a set of metrics for evaluating the quality of the model. Once the best ML model has been chosen, it is saved and deployed to the serving website. Our primary objective is to ensure that we monitor all testing experiments, provide the reusability of code, and uphold the ability to continue updating the robustness of ML-based WAD.

Building Honeypot System and Service Website Honeypot System Overview
In this study, we apply the Web Honeypot to simulate specific web services and attract particular types of attacks by specific web technologies, as in Figure 5. Honeypot runs potentially attractive web applications on the Nginx server using Java programming language. The system records all requests to the above web applications. At the same time, honeypots are always listening for trigger signals from Kuberflow to perform tasks of collecting datasets, processing data, and pushing data to Data Storage.

Trigger Processing
A trigger is a flag that tells the system when a recurring run configuration spawns a new run. The following types of run triggers are available: • Access Management: allows objects based on the IP address to access the web application.

Deploying ML Models to the Serving Website:
We are constructing a serving website designed in Figure 8, which continuously listens for updating new machine learning models. Additionally, the serving website will receive files in both.txt and.csv formats, containing good and bad requests. Subsequently, the serving website employs a machine learning model to classify these requests. To achieve the goal of detecting new attacks and avoiding false negatives, we selected a detection threshold of 0.5 for identifying bad requests. Therefore, the predicted label of the detection model is 1 (indicating a bad request) if the predicted value is greater than 0.5, and vice versa.

IMPLEMENTATION AND EXPERIMENT
In this section, we carried out a performance evaluation of the MLOps pipeline in a variety of scenarios. First, we describe data resources and partitions. Then, we concentrate on experimental settings, such as environmental conditions, baseline studies, and performance metrics. Finally, we ran a series of experiments to compare the MLOps pipeline's performance with various ML models.

A. Dataset
In our experiment, we evaluated the proposed model on 2 datasets in the same feature extraction step. More specifically, we used the public Fwaf dataset to evaluate the effectiveness of the model. We also simulated the model on the real scenario with the dataset collected by our honeypot. We also separated these datasets into 3 parts, including 80% for training, 10% testing and 10% validating, as shown in Table 1 and Table 2. In more detail, the validating dataset is used to evaluate the performance of the model after each epoch in the training and retraining process, as shown in Figure 3. The testing dataset will re-evaluate the model after training, as shown in the test-model block of Figure 3.

Feature extraction
In the WAD scenario, we use queries or requests to detect attacks. Furthermore, we extract all records based on the TF-IDF (Term Frequency -Inverse Document Frequency) numerical statistical method. As in Figure 9, we count the frequency of each word in every request (TF) and count the number of queries/requests in which this word occurs (IDF). These words are the features of the dataset that are equal to the ratio of TF and IDF.

Simulating and collecting attack datasets in Cyber Traps (Our dataset)
As in Figure 5, to realize attacks, we built a honeypot based on 3 web applications: • Spring boot web application is a well-known website built based on the Java programming language. • Confluence: is a web-based corporate wiki that was created in the Java programming language and released for the first time in 2004.
• Liferay is a Java-based web app-location platform that provides a toolset for the development of customizable portals and websites.
Furthermore, these access logs in honeypot will be statistics and recorded by Nginx. All bad requests collected belong to XSS, SQLi, Path Traversal, RCE, and Command Injection. Next, they are processed and labeled with Bad and Good classes as in Section II.D.2.
The created dataset consists of 26,020 good records and 32,757 bad records. Furthermore, to prove the effectiveness of the MLOps pipeline in retraining the model with new data, we separated our dataset into 2 phases based on the data collection timeline of honeypots. Each phase that represents the new data update period is also split into 3 sets, including training, validating and testing sets. Detailed data are presented specifically in Table 1.

Fwaf -Web-Application Firewall dataset
Our experiments are also evaluated on the Fwaf dataset. In this dataset, all collected records contain benign traffic and the most up-to-date common attacks related to real-world scenarios. The data were labeled by two types: Bad and Good. Approximately 1,350,000 records have been captured and processed for CSV structure format. Furthermore, with the ratio of good and bad labels less than 2.5%, as shown in Table 2, the Fwaf dataset is completely a non-IID dataset.

Experimental settings
We implemented the MLOps pipeline with the Ten-sorFlow framework. The designed CNN model and 3 ML algorithms are implemented using Keras and Scikit-learn (SKlearn). Our experiments are conducted on a Kali 2022.1 virtual machine (VM) with the CPU AMD Ryzen 5 3500U 4 cores 3.7 GHz with 8   In the training process with CNN, we train the model in 2 epochs with a batch size of 128. The loss function is binary cross entropy, and the Adam optimizer is also used with a learning rate of 0.001. In addition, the ML-based attack detectors based on KNN, SVM and LR are described in Table 3.

C. Performance Metrics
In this part, 4 common metrics are used to evaluate the performance of the ML-based detection model, including accuracy, precision, recall, and F1-score.
• Accuracy: the results of the model to predict the correct proportion with total samples in the dataset as in Eq. (10). Accuracy = n true n total (10) • Precision: the proportion of samples identified as attacks (attack true + attack f alse ) that are indeed attacks attack true , as in Eq. (11).  (11) • Recall is the proportion of all attack samples (attack true + normal f alse ) correctly identified as exact types of attacks attack true , as in Eq. (12).
Recall = attack true attack true + normal f alse (12) • F1-score: This is calculated by taking the harmonic mean of precision and recall as in Eq. (13).

Evaluation of MLOps Pipeline with data augmentation scenario:
In this part, to evaluate the effectiveness of the MLOps pipeline in the data augmentation scenario, we completed a total of 2 experiments. In the first experiment, we train ML models with non-IID data. Then, in the second experiment, we train the ML model with augmented data. More specifically, in this scenario, we evaluate these ML models with the non-IID Fwaf dataset. Each experiment has the same circumstances with 5 ML models. The numerical results, which are shown in Table 4, illustrate the performance of the ML models in terms of the accuracy, precision, recall and F1score. It can be easily seen that five types of ML models have also affected the performance of the MLOps pipeline on the dataset with augmentation, with accuracy and F1-score above 99%. However, the performances of the 5 models were significantly decreased when training on the dataset without augmentation, with a lower F1-score of 91.892%. Although the CNN model is the most effective model on the augmentation dataset, with 4 metrics of approximately 99.982%, the CNN model on the dataset without augmentation is not efficient, with precision and F1-scores of approximately 70.117% and 80.607%, respectively. In addition, when training with augmented data, the differences in performance between the CNN model and ensemble are negligible, with less than 0.003% in 4 metrics. Therefore, the ensemble model has proven to be effective and stable in both experiments, with 4 metrics greater than 91%.

Evaluation of MLOps Pipeline with Incremental Learning scenario:
In addition, to determine the effectiveness of these ML models in the first training, we also train and test these models with the dataset in phase 1. With the results shown in Table 6, it is easy to see that the performance of 5 models when trained and tested with our dataset on phase 1 has been very high with values of over 97% on 4 metrics.
In the experiment of model training without IL, when testing a large amount of new testing data in phase 2, the numerical results in Table 5 have proven that the effectiveness of these models has reduced significantly by at least 8% in 4 metrics. The results in Table 5 illustrate the effectiveness of IL in retraining the model with new data. More specifically, when testing with new data in phase 2, the performance of all retrained models reached more than 97% in all metrics. Furthermore, with the results in Table 5 and Table 6, the Ensemble model has proven stability and effect in   the context of WAD even under the condition that the model is neither retrained nor regularly updated, with Accuracy and F1-score always greater than 90%. Moreover, to demonstrate the effectiveness of the system in terms of predictability, we also summarize some experimental evaluation results in this realistic scenario with the ensemble model. According to the comparison of Real and Predicted (Pred) Label in Table 7, all the abnormal requests are classified correctly. In particular, the 11 th record is one of the Confluence's latest exploits assigned as CVE-2022-26134. It has been detected by our system.

CONCLUSION
To effectively detect unknown web attacks, ML models must be trained using a large, up-to-date dataset. As such, deceptive tactics such as installing honeypots and cyber traps within the network can prove to be a valuable strategy for attracting attackers to exploit vulnerable servers. The logs and requests obtained from these deceptive resources can then be used to train the ML-based web attack detection model. The experimental results on two datasets demonstrate the feasibility and effectiveness of the proposed method for constructing high-performance machine learning models for detecting web attacks. In particular, our ensemble model also proves more efficient and stable than SVM, KNN, LR and CNN and is perfectly suitable for deployment on the Serving website. Furthermore, by utilizing triggers defined in the MLOps pipeline, these models can be regularly updated with new indicators identified in the cyber trap system, enabling them to recognize harmful requests directed toward web servers. Our experiments involving the setup of a web honeypot, an MLOps pipeline, and evaluations on two datasets demonstrate the benefits of proactive, deceptive, and ongoing approaches in the realm of web attack detection.