The development of 5G network and beyond has led to an explosion of data generation. It is therefore crucial to have an intrusion detection system (IDS) to detect and remove malicious packets from entering network. Th...
详细信息
ISBN:
(纸本)9798350324136
The development of 5G network and beyond has led to an explosion of data generation. It is therefore crucial to have an intrusion detection system (IDS) to detect and remove malicious packets from entering network. This paper therefore presents an IDS based on a Feature Selection approach which applies the Recursive Feature Elimination and Random Forest Classifier with 10-fold Cross Validation to classify malicious and benign traffic on a publicly available UNSW-NB15 dataset. Most existing Feature Selection approaches on this dataset directed to enhance the performance of a limited number of algorithms used. Our proposed Feature Selection approach was tested on six well-known supervisedmachinelearning (ML) algorithms including Artificial Neural Network (ANN), Random Forest (RF), Decision Tree (DT), K-Nearest Neighbor (KNN), Support Vector machines (SVM) and Logistic Regression (LR) performing binary classification. In addition, we performed hyperparameter tuning to get the best possible parameters for each ML algorithm. Unlike hyperparameter tuning in most studies, we perform both Manual Search and Grid Search. The performance of the selected ML algorithms are evaluated based on Accuracy, Recall, Precision, and F1 score. The results from our experiments indicate that the most robust algorithm is ANN whereas the weakest performing algorithm is LR. RF is the second-best performing algorithm, however, its runtime is much lower than that of ANN. In particular, ANN excels with (testing accuracy, F1 score) of (88.62%, 96.473%), RF with (87.40%, 89.60%), DT with (87.266%, 89.414%), KNN with (87.11%, 88.7%), SVM with (81.835%, 86.959%) and LR with (81.835%, 85.632%). In addition, the over-fitting problems are eliminated based on our proposed Feature Selection and Hyperparameter turning. Compared with existing works with the same ML algorithms on UNSW-NB15 dataset, our proposed Feature Selection approach achieved better results in most cases and more stable among different
Computer network security and integrity are severely impacted by network attacks. The ability to predict and prevent these attacks is crucial for maintaining a secure network environment. supervised ML (machine Learni...
详细信息
supervised machine learning algorithms are powerful classification techniques commonly used to build prediction models that help diagnose the disease early. However, some challenges like overfitting and underfitting n...
详细信息
supervised machine learning algorithms are powerful classification techniques commonly used to build prediction models that help diagnose the disease early. However, some challenges like overfitting and underfitting need to be overcome while building the model. This paper introduces hybrid classifiers using the ensembled model with a majority voting technique to improve prediction accuracy. Furthermore, a proposed preprocessing technique and features selection based on a genetic algorithm is suggested to enhance prediction performance and overall time consumption. In addition, the 10-folds cross-validation technique is used to overcome the overfitting problem. Experiments were performed on a dataset for cardiovascular patients from the UCI machinelearning Repository. Through a comparative analytical approach, the study results indicated that the proposed ensemble classifier model achieved a classification accuracy of 98.18% higher than the rest of the relevant developments in the study.
Today, cancer has become a common disease that can afflict the life of one of every three people. Breast cancer is also one of the cancer types for which early diagnosis and detection is especially important. The earl...
详细信息
Today, cancer has become a common disease that can afflict the life of one of every three people. Breast cancer is also one of the cancer types for which early diagnosis and detection is especially important. The earlier breast cancer is detected, the higher the chances of the patient being treated. Therefore, many early detection or prediction methods are being investigated and used in the fight against breast cancer. In this paper, the aim was to predict and detect breast cancer early with non-invasive and painless methods that use data mining algorithms. All the data mining classification algorithms in Weka were run and compared against a data set obtained from the measurements of an antenna consisting of frequency bandwidth, dielectric constant of the antenna's substrate, electric field and tumor information for breast cancer detection and prediction. Results indicate that Bagging, IBk, Random Committee, Random Forest, and SimpleCART algorithms were the most successful algorithms, with over 90% accuracy in detection. This comparative study of several classification algorithms for breast cancer diagnosis using a data set from the measurements of an antenna with a 10-fold cross-validation method provided a perspective into the data mining methods' ability of relative prediction. From data obtained in this study it can be said that if a patient has a breast cancer tumor, detection of the tumor is possible.
As the amount of available digital documents keeps growing rapidly, extracting useful information from them has become a major challenge. Data mining, natural language processing, and machinelearning are powerful tec...
详细信息
As the amount of available digital documents keeps growing rapidly, extracting useful information from them has become a major challenge. Data mining, natural language processing, and machinelearning are powerful techniques that can be used together to deal with this problem. Depending on the task at hand, there are many different approaches that can be used. The methods available are continuously improved, but not all of them have been tested and compared in a set of coherent problems using supervised machine learning algorithms. For example, what happens to the quality of the methods if we increase the training data size from, say, 100 MB to over 1 GB? Moreover, are quality gains worth it when the rate of data processing diminishes? Can we trade quality for time efficiency and recover the quality loss by just being able to process more data? We attempt to answer these questions in a general way for text processing tasks, considering the trade-offs involving training data size, learning time, and quality obtained. For this, we propose a performance trade-off framework and apply it to three important tasks: Named Entity Recognition, Sentiment Analysis and Document Classification. These problems were also chosen because they have different levels of object granularity: words, paragraphs, and documents. For each problem, we selected several supervised machine learning algorithms and we evaluated the trade-offs of them on large publicly available data sets (news, reviews, patents). To explore these trade-offs, we use different data subsets of increasing size ranging from 50 MB to several GB. For the last two tasks, we also consider similar algorithms with two different data sets and two evaluation techniques, to study their impact on the resulting trade-offs. We find that the results do not change significantly and that most of the time the best algorithms are the ones with fastest processing time. However, we also show that the results for small data (say less than 1
This study examined the biological, social, and clinical risk factors for mortality in coronavirus of the year 2019 (COVID-19) hospitalised patients. The population of the study is prone to COVID-19, thus understandin...
详细信息
This study examined the biological, social, and clinical risk factors for mortality in coronavirus of the year 2019 (COVID-19) hospitalised patients. The population of the study is prone to COVID-19, thus understanding the most common traits and comorbidities of people who were affected is crucial in reducing its consequences. In this study, four supervised machine learning algorithms were implemented and compared to predict the mortality rate based on the explanatory variables across the five districts of Limpopo Province in South Africa. The data was obtained from Lim-popo Department of Health. Prediction about the chances of dying from COVID-19 disease was made using logistic regression, random forest, support vector machine, and decision tree algo-rithms on the dataset of 20,592 records with twenty-one attributes. Due to the imbalanced nature of the data, Random Over-Sampling Examples (ROSE) were employed to balance our data for more accurate classification effectively. The ROSE package provides functions to deal with binary classification problems in the presence of imbalanced classes. We used 70% of the data for training, while 30% was selected for testing the predictive algorithms. A technique called Step Akaike's Information Criterion (StepAIC) was deployed to reduce the insignificant variables from the full model of the logistic regression. According to the findings of the study, among the four algorithms tested, random forest had the highest recall rate for predicting mortality at roughly 79 percent compared to the other three algorithms. Accordingly, we conclude that random forest algorithm is appropriate for predicting the chances of patients dying from COVID-19 based on the attributes of the five districts of Limpopo Province. In terms of the features and their importance, a function called Variable Importance (VarImp) was used to check which of the attributes have predictive power on the outcome variable (discharged status). The findings revealed that
暂无评论