Chinese spam text detection is essential for social media since these texts affect the user experience of Chinese speakers and pollute the community. The underlying text classification method is employed to explore th...
详细信息
ISBN:
(纸本)9781450392365
Chinese spam text detection is essential for social media since these texts affect the user experience of Chinese speakers and pollute the community. The underlying text classification method is employed to explore the unique combinations of characters that represent clues of spam information from annotated or further augmented data. However, based on the diversity of Chinese characters in glyphs, the spammers frequently wrap the spam content in another visually close text to fool the model but make sure people understand1. This paper proposes to adopt the essence of human cognition of these adversarial texts into spam text detection models, by designing a pre-trained model to learn the morphology semantics of Chinese characters and represent their contextual meanings from scratch. The model pre-trains on self-supervised Chinese corpus and fine-tunes on spam-annotated community texts. Besides, cooperating with the pre-trained model that can capture the morphological features of Chinese, a new data perturbation method is introduced to guide the optimization towards the direction of recognizing the actual meaning of a text after spammers tamper with partial characters by visually close ones. The experimental results have shown that our proposed methodology can notably improve the performance of spam text detection as well as maintain robustness against adversarial samples.
Hyponymy is one of the most critical semantic relations, which contributes magnificently to semantic dictionary, information retrieval etc. In this paper, a method of extracting hyponymy is proposed based on multiple ...
详细信息
ISBN:
(纸本)9781467399043
Hyponymy is one of the most critical semantic relations, which contributes magnificently to semantic dictionary, information retrieval etc. In this paper, a method of extracting hyponymy is proposed based on multiple data sources fusion, which convert the extraction of hyponymy to the extraction of hypernyms for target words. First, mining candidate hypernyms for the target words based on search engine, encyclopedia resources and core suffix words. Second, fusing the candidates from the above data sources. At last, the classification algorithm is used to filter the noise and extract the hypernyms, which is a quite mature machine learning algorithm. There is hyponymy between the target words and their correctly extracted hypernyms. The experimental results show that the highest accuracy rate of hyponymy extraction reaches 0.832 using the proposed method.
Recently, the amount of blogs on the Internet rises sharply. Hence, mining valuable information in blogs possesses realistic significance for improving user experience, network services, etc. This paper proposes a min...
详细信息
ISBN:
(纸本)9783037858882
Recently, the amount of blogs on the Internet rises sharply. Hence, mining valuable information in blogs possesses realistic significance for improving user experience, network services, etc. This paper proposes a mining algorithm with blog authors' interests based on classification techniques, which introduces an evaluation standard of non-empty intersection. This algorithm can also improve the hit ratio of recommendation service based on blog authors' interests by means of the interest collection from expanding prediction;therefore, it can reach a higher degree of satisfaction. In addition, this paper performs experiments with the data set from Sina Blog and NetEase Blog, whose result illustrates the higher accuracy of our algorithm.
This paper proposes a serial architecture as implementation for Two Means Decision Tree. This DT algorithm exhibits lower complexity. This architecture is implemented on Field Programmable Gate Array (FPGA) running at...
详细信息
ISBN:
(纸本)9781665404785
This paper proposes a serial architecture as implementation for Two Means Decision Tree. This DT algorithm exhibits lower complexity. This architecture is implemented on Field Programmable Gate Array (FPGA) running at 62 MHz. Simulation results show that the proposed hardware architecture exhibits 10x speed-up as compared to its software implementation and runs 28x faster than C4.5 algorithm. It also consumes less power as compared to complex algorithms implemented On Graphical Processor Units (GPU). Hence the architecture is suitable for simple low power high speed applications.
Internet provides great convenience for our life and becomes an important channel to get information. However, a large amount of false information, called rumors, come with it. In terms of automatically detecting rumo...
详细信息
ISBN:
(纸本)9781728114101
Internet provides great convenience for our life and becomes an important channel to get information. However, a large amount of false information, called rumors, come with it. In terms of automatically detecting rumors, two main contributions of this paper are as follows: (1) To reduce the impact of unbalanced data on classification, we proposed an improvement SMOTE algorithm to resample data. (2) We proposed six new features based on Sina microblogs, including Words with Guidance (WG), Words with Menace (WM), Suspected Topic (ST), Recognition of Information (RI), Degree of Attention to Users (DAU) and Credit Rating (CR), which are related to user-based features, content-based features, propagation-based features and microblog-based features. By building subsets with new features and using machine learning algorithms including Xgboost etc. We tested the effect of rumor detection on a real data set. Experiments showed that our rumor detection method was significantly improved compared with the most advanced method of the same type, with precision, recall and F1 at 0.827, 0.837 and 0.825 respectively, and AUC at 0.895.
Researchers in higher education are beginning to explore the potential of data mining in analyzing data for the purpose of giving quality service and needs of their graduates. Thus, educational data mining emerges as ...
详细信息
ISBN:
(纸本)9781467393799
Researchers in higher education are beginning to explore the potential of data mining in analyzing data for the purpose of giving quality service and needs of their graduates. Thus, educational data mining emerges as one tools to study academic data to identify patterns and help for decision making affecting the education. This paper predicts the employability of IT graduates using nine variables. First, different classification algorithms in data mining were tested making logistic regression with accuracy of 78.4 is implemented. Based on logistic regression analysis, three academic variables directly affect;IT_Core, IT_Professional and Gender identified as significant predictors for employability. The data were collected based on the five year profiles of 515 students randomly selected at the placement office tracer study.
This paper explores the pros and cons of different algorithm models on the same selection problem, and then uses the combined prediction theory to obtain a new combined prediction model to explore its prediction accur...
详细信息
This paper explores the pros and cons of different algorithm models on the same selection problem, and then uses the combined prediction theory to obtain a new combined prediction model to explore its prediction accuracy. The actual problem to be solved is to help financial institutions to scientifically classify customers who choose financial products. We select the bank data set in the UCI database, which is derived from the survey data of a customer conducted by a financial institution in Portugal for a wealth management product. Decision tree C5.0 algorithm, naive Bayes classification algorithm and binary logit model are individually used to carry out a single model of empirical research on financial product customer classification. Through the empirical analysis of the five combination models, it is concluded that in the model that uses the least squares weighting method to determine the weight, the weight appears negative, which does not conform to the actual situation. The model that is based on the least squares weighting method and the model that is based on the simple weighting method are excluded. In contrast, the arithmetic mean weighted model is better than the reciprocal variance weighted model and the reciprocal mean square model. The accuracy reaches 89.91%, which is 0.43% higher than the accuracy of a single model. It can be concluded that the model that is based on the arithmetic average weighting is a better combination forecasting model.
Physiological state abnormality due to genetic diseases, excessive exercise, etc. is becoming a fatal killer endangering people's life and safety because of its hidden characteristics. K-Nearest Neighbor(KNN) Algo...
详细信息
ISBN:
(数字)9798350354621
ISBN:
(纸本)9798350354638;9798350354621
Physiological state abnormality due to genetic diseases, excessive exercise, etc. is becoming a fatal killer endangering people's life and safety because of its hidden characteristics. K-Nearest Neighbor(KNN) algorithm is widely used in various fields due to its simple implementation, but when the sample capacity is too large or the feature attributes are too many, the classification efficiency decreases significantly. This paper proposed an improved KNN(IKNN) algorithm based on clustering by hierarchically clustering the data in data pre-processing, which reduced the search space of the algorithm and effectively improved the search efficiency. When the improved KNN algorithm was applied in the physiological state abnormality discrimination field, which better improved the efficiency and accuracy of physiological abnormality discrimination. Results show that this could provide an effective guarantee for the early discovery of physiological parameter abnormality symptom, the timely adoption of dispositive measures, and the maintenance of people's life safety.
With the development of the times, the traditional personal credit is facing a severe test. This paper makes an exploratory study on the practical application and development of personal credit evaluation by using the...
详细信息
ISBN:
(纸本)9783319943619;9783319943602
With the development of the times, the traditional personal credit is facing a severe test. This paper makes an exploratory study on the practical application and development of personal credit evaluation by using the MicroBlog data. According to the previous study of personal credit evaluation literature to dig out the credit-related indicators. We summed up the three major attributes of "Attributes of Demographic", "Tweets Content", and "User Relationship Structure". We use support vector machine (SVM), naive Bayesian (NB), logical regression (LR) and AdaBoost classification algorithm, according to the actual problem modeling, to analysis of social network data on personal credit. Compared with other algorithms, the AUC value of AdaBoost algorithm achieves the best effect with 0.564 under the equalization setting.
This paper collects data on the damage to the traffic system caused by earthquakes in China in the past two decades, and uses KNN algorithm, SVM algorithm, logistic regression algorithm, naive Bayes algorithm and deci...
详细信息
ISBN:
(纸本)9781728165790
This paper collects data on the damage to the traffic system caused by earthquakes in China in the past two decades, and uses KNN algorithm, SVM algorithm, logistic regression algorithm, naive Bayes algorithm and decision tree algorithm to train the data, then establish earthquake prediction models. The paper introduces the process of preprocessing, modelling, evaluation, and visualization of disaster data. An earthquake disaster inversion model based on traffic data has been established, which can predict the earthquake intensity based on the relevant data provided by the traffic department. The prediction accuracy is relatively accurate, which is very helpful for earthquake prediction and rescue operations.
暂无评论