duplicatebug reporting is a critical problem in the software repositories’mining *** bug reports can lead to redundant efforts,wasted resources,and delayed software ***,their accurate identification is essential for...
详细信息
duplicatebug reporting is a critical problem in the software repositories’mining *** bug reports can lead to redundant efforts,wasted resources,and delayed software ***,their accurate identification is essential for streamlining the bug triage process mining *** researchers have explored classical information retrieval,natural language processing,text and data mining,and machine learning *** emergence of large language models(LLMs)(ChatGPT and Huggingface)has presented a new line of models for semantic textual similarity(STS).Although LLMs have shown remarkable advancements,there remains a need for longitudinal studies to determine whether performance improvements are due to the scale of the models or the unique embeddings they produce compared to classical encoding *** study systematically investigates this issue by comparing classical word embedding techniques against LLM-based embeddings for duplicatebug *** this study,we have proposed an amalgamation of models to detect duplicatebug reports using textual and non-textual information about bug *** empirical evaluation has been performed on the open-source datasets and evaluated based on established metrics using the mean reciprocal rank(MRR),mean average precision(MAP),and recall *** experimental results have shown that combined LLMs can outperform(recall-rate@k 68%–74%)other individual=models for duplicatebug *** findings highlight the effectiveness of amalgamating multiple techniques in improving the duplicatebug report detection accuracy.
About 40% of software bug reports are duplicates of one another, which pose a major overhead during software maintenance. Traditional techniques often focus on detecting duplicatebug reports that are textually simila...
详细信息
ISBN:
(纸本)9781665452786
About 40% of software bug reports are duplicates of one another, which pose a major overhead during software maintenance. Traditional techniques often focus on detecting duplicatebug reports that are textually similar. However, in bug tracking systems, many duplicatebug reports might not be textually similar, for which the traditional techniques might fall short. In this paper, we conduct a large-scale empirical study to better understand the impacts of textual dissimilarity on the detection of duplicatebug reports. First, we collect a total of 92,854 bug reports from three open-source systems and construct two datasets containing textually similar and textually dissimilar duplicatebug reports. Then we determine the performance of three existing techniques in detecting duplicatebug reports and show that their performance is significantly poor for textually dissimilar duplicate reports. Second, we analyze the two groups of bug reports using a combination of descriptive analysis, word embedding visualization, and manual analysis. We found that textually dissimilar duplicatebug reports often miss important components (e.g., expected behaviors and steps to reproduce), which could lead to their textual differences and poor performance by the existing techniques. Finally, we apply domain-specific embedding to duplicatebug report detection problems, which shows mixed results. All these findings above warrant further investigation and more effective solutions for detecting textually dissimilar duplicatebug reports.
duplicate bug detection is the problem of identifying whether a newly reported bug is a duplicate of an existing bug in the system and retrieving the original or similar bugs from the past. This is required to avoid c...
详细信息
ISBN:
(纸本)9781538609927
duplicate bug detection is the problem of identifying whether a newly reported bug is a duplicate of an existing bug in the system and retrieving the original or similar bugs from the past. This is required to avoid costly rediscovery and redundant work. In typical software projects, the number of duplicatebugs reported may run into the order of thousands, making it expensive in terms of cost and time for manual intervention. This makes the problem of duplicate or similar bugdetection an important one in Software Engineering domain. However, an automated solution for the same is not quite accurate yet in practice, in spite of many reported approaches using various machine learning techniques. In this work, we propose a retrieval and classification model using Siamese Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM) for accurate detection and retrieval of duplicate and similar bugs. We report an accuracy close to 90% and recall rate close to 80%, which makes possible the practical use of such a system. We describe our model in detail along with related discussions from the Deep Learning domain. By presenting the detailed experimental results, we illustrate the effectiveness of the model in practical systems, including for repositories for which supervised training data is not available.
We present an approach to identify duplicatebug reports expressed in free-form text. duplicate reports needs to be identified to avoid a situation where duplicate reports get assigned to multiple developers. Also, du...
详细信息
ISBN:
(纸本)9780769542669
We present an approach to identify duplicatebug reports expressed in free-form text. duplicate reports needs to be identified to avoid a situation where duplicate reports get assigned to multiple developers. Also, duplicate reports can contain complementary information which can be useful for bug fixing. Automatic identification of duplicate reports (from thousands of existing reports in a bug repository) can increase the productivity of a Triager by reducing the amount of time a Triager spends in searching for duplicatebug reports of any incoming report. The proposed method uses character N-gram-based model for the task of duplicatebug report detection. Previous approaches are word-based whereas this study investigates the usefulness of low-level features based on characters which have certain inherent advantages (such as natural-language independence, robustness towards noisy data and effective handling of domain specific term variations) over word-based features for the problem of duplicatebug report detection. The proposed solution is evaluated on a publicly-available dataset consisting of more than 200 thousand bug reports from the open-source Eclipse project. The dataset consists of ground-truth (pre-annotated dataset having bug reports tagged as duplicate by the Triager). Empirical results and evaluation metrics quantifying retrieval performance indicate that the approach is effective.
暂无评论