This paper describes an empirical research work based on the use of a suitable data structure, named Flow Graph (FG), that can be induced from a supervised training data set. A FG can be approached as a weighted and l...
详细信息
Today, cancer has become a common disease that can afflict the life of one of every three people. Breast cancer is also one of the cancer types for which early diagnosis and detection is especially important. The earl...
详细信息
Today, cancer has become a common disease that can afflict the life of one of every three people. Breast cancer is also one of the cancer types for which early diagnosis and detection is especially important. The earlier breast cancer is detected, the higher the chances of the patient being treated. Therefore, many early detection or prediction methods are being investigated and used in the fight against breast cancer. In this paper, the aim was to predict and detect breast cancer early with non-invasive and painless methods that use data mining algorithms. All the data mining classification algorithms in Weka were run and compared against a data set obtained from the measurements of an antenna consisting of frequency bandwidth, dielectric constant of the antenna's substrate, electric field and tumor information for breast cancer detection and prediction. Results indicate that Bagging, IBk, Random Committee, Random Forest, and SimpleCART algorithms were the most successful algorithms, with over 90% accuracy in detection. This comparative study of several classification algorithms for breast cancer diagnosis using a data set from the measurements of an antenna with a 10-fold cross-validation method provided a perspective into the data mining methods' ability of relative prediction. From data obtained in this study it can be said that if a patient has a breast cancer tumor, detection of the tumor is possible.
In recent years, online reviews have been playing an important role in making purchase decisions. This is because, these reviews can provide customers with large amounts of useful information about the goods or servic...
详细信息
ISBN:
(纸本)9781538661475
In recent years, online reviews have been playing an important role in making purchase decisions. This is because, these reviews can provide customers with large amounts of useful information about the goods or service. However, to promote factitiously or lower the quality of the products or services, spammers may forge and produce fake reviews. Due to such behavior of the spammers, customers would be misleaded and make wrong decisions. Thus detecting fake (spam) reviews is a significant problem. In this paper, we propose two types of features and apply supervised machine learning algorithms for performing classification on Yelp's real-life data. In terms of features used, there are two new semantic feature sets: readability features and topic features. Our results show that our proposed new features are more effective than n-gram features in detecting spam reviews. To improve classification on the real Yelp review data, we use a set of behavioral features about reviewers and their reviews for learning, which dramatically improves the classification result on real-life opinion spam data. For further improvement, we also ensure the number of reviewers instead of reviews is balanced.
As the amount of available digital documents keeps growing rapidly, extracting useful information from them has become a major challenge. Data mining, natural language processing, and machinelearning are powerful tec...
详细信息
ISBN:
(纸本)9781538627150
As the amount of available digital documents keeps growing rapidly, extracting useful information from them has become a major challenge. Data mining, natural language processing, and machinelearning are powerful techniques that can be used together to deal with this problem. Depending on the task at hand, there are many different approaches that can be used. The methods available are continuously improved, but not all of them have been tested and compared in a set of coherent problems using supervised machine learning algorithms. For example, what happens to the quality of the methods if we increase the training data size from, say, 100 MB to over 1 GB? Moreover, are quality gains worth it when the rate of data processing diminishes? Can we trade quality for time efficiency and recover the quality loss by just being able to process more data? We attempt to answer these questions in a general way for text processing tasks, considering the trade-offs involving training data size, learning time, and quality obtained. For this, we propose a performance trade-off framework and apply it to three important tasks: Named Entity Recognition, Sentiment Analysis and Document Classification. These problems were also chosen because they have different levels of object granularity: words, paragraphs, and documents. For each problem, we selected several supervised machine learning algorithms and we evaluated the trade-offs of them on large publicly available data sets (news, reviews, patents). To explore these trade-offs, we use different data subsets of increasing size ranging from 50 MB to several GB. For the last two tasks, we also consider similar algorithms with two different data sets and two evaluation techniques, to study their impact on the resulting trade-offs. We find that the results do not change significantly and that most of the time the best algorithms are the ones with fastest processing time. However, we also show that the results for small data (say less than 1
As the amount of available digital documents keeps growing rapidly, extracting useful information from them has become a major challenge. Data mining, natural language processing, and machinelearning are powerful tec...
详细信息
As the amount of available digital documents keeps growing rapidly, extracting useful information from them has become a major challenge. Data mining, natural language processing, and machinelearning are powerful techniques that can be used together to deal with this problem. Depending on the task at hand, there are many different approaches that can be used. The methods available are continuously improved, but not all of them have been tested and compared in a set of coherent problems using supervised machine learning algorithms. For example, what happens to the quality of the methods if we increase the training data size from, say, 100 MB to over 1 GB? Moreover, are quality gains worth it when the rate of data processing diminishes? Can we trade quality for time efficiency and recover the quality loss by just being able to process more data? We attempt to answer these questions in a general way for text processing tasks, considering the trade-offs involving training data size, learning time, and quality obtained. For this, we propose a performance trade-off framework and apply it to three important tasks: Named Entity Recognition, Sentiment Analysis and Document Classification. These problems were also chosen because they have different levels of object granularity: words, paragraphs, and documents. For each problem, we selected several supervised machine learning algorithms and we evaluated the trade-offs of them on large publicly available data sets (news, reviews, patents). To explore these trade-offs, we use different data subsets of increasing size ranging from 50 MB to several GB. For the last two tasks, we also consider similar algorithms with two different data sets and two evaluation techniques, to study their impact on the resulting trade-offs. We find that the results do not change significantly and that most of the time the best algorithms are the ones with fastest processing time. However, we also show that the results for small data (say less than 1
Classification is one of the key issues in medical diagnosis. In this paper, a novel approach to perform pattern classification tasks is presented. This model is called Associative Memory based Classifier (AMBC). Thro...
详细信息
Classification is one of the key issues in medical diagnosis. In this paper, a novel approach to perform pattern classification tasks is presented. This model is called Associative Memory based Classifier (AMBC). Throughout the experimental phase, the proposed algorithm is applied to help diagnose diseases;particularly, it is applied in the diagnosis of seven different problems in the medical field. The performance of the proposed model is validated by comparing classification accuracy of AMBC against the performance achieved by other twenty well known algorithms. Experimental results have shown that AMBC achieved the best performance in three of the seven pattern classification problems in the medical field. Similarly, it should be noted that our proposal achieved the best classification accuracy averaged over all datasets. (C) 2011 Elsevier Ireland Ltd. All rights reserved.
暂无评论