A Survey and Taxonomy of Approaches for Mining Software Repositories in the Context of Software Evolution
By: H. Kagdi, M. L. Collard, and J. I. Maletic
Identified 4 dimensions in order to objectively describe and compare the different approaches
The software repositories utilized: what information sources are used?
The purpose of MSR: why mine or what to mine for?
The methodology: how to achieve the purpose of mining from the selected software repositories?
The evaluation of the undertaken approach: how to assess quality?
What types of sources can be considered as software repositories?
source-control systems are used for storing and manageing changes to source code artifacts, typically files, under evolution.
defect-tracking systems are used to manage the reporting and resolution of defects/bugs/faults and/or feature enhancements.
archived communications: making them sources for information including change rationales.
repositories have a common goal of supporting software evolution by managing the lifecycle of a software change.
In light of the primacy of source code change => 3 basic categories of information in a software repository that can be mined:
the software artifacts/versions
the differences between the artifacts/versions
the metadata about the software change
I interest toppic below
4.9. Classification with supervised learning
4.9.1. Maintenance relevance relations
A classification-learning technique is used by Shirabad et al. [37–39] to determine the co-update relations between a pair of source code files, i.e., given two files determine whether a change in one leads to a change in the other. Such types of relations are also termed maintenance-relevance relations. A decision-tree classifier (i.e., model) is produced by a machine-learning (induction) algorithm. A time-based heuristic is employed to assign a relevant or non-relevant relation between a pair of files to form the learning and testing sets. A fixed time period between time T1 and T2 (T2
rate versus true-positive rate), precision, and recall plots imply that the PR attributes generate better classifiers than those of syntactic attributes. The comment attributes generated classifiers do not perform on a par with those generated with the PR attributes. However, they are better than those generated from the syntactic attributes. The classifiers generated from a combination of syntactic and comment attributes produce better results than either of them considered alone.
4.9.2. Triage bug reports
Anvik et al.  used a supervised learning (i.e., support vector machine algorithm) in order to recommend a list of potential developers for resolving a BR. Past reports in the Bugzilla repository are used to produce a classifier. The authors developed project-specific heuristics to train the classifier instead of directly using the assigned-to field of a BR. This was done to avoid incorrect assignment of BRs with default assignments that may not necessarily reflect the actual developer who resolved a bug. The approach is evaluated on three open-source projects Eclipse, Firefox, and GCC. Developers that contributed at least nine BR resolutions over the most recent three months were considered in the training set for Eclipse and Firefox. The precision for Eclipse and Firefox was 57% and 64%, respectively, and the recall 7% and 2%, respectively. The precision of GCC was 6% for recommending one developer and 18% for two/three developers. The recall of GCC was 0.3%, 2%, and 3% for recommending one, two, and three developers, respectively.
5. DISCUSSION AND OPEN ISSUES
5.1. MSR on fine-grained entities
One major issue is the disparity between the software-evolution data available in the repositories
and the needs of the stockholders, not just researchers but also including software maintainers.
The majority of current MSR approaches operate at either the physical level (e.g., system,
subsystems, directories, files, lines) or at a fairly high level of logical/syntactic entities (e.g., classes). This is regardless of the primary focus, i.e., changes of properties or artifacts. In part this is due to the researchers restricting their approaches/studies to what is directly available and supported by the software repositories (e.g., file and line view of source code and their differences). However, the investigations by Zimmermann et al.  have shown the benefits of further processing the information directly available from source code repositories for change prediction and impact-analysis tasks. In their study , there was no significant difference in precision and recall values between filebased and logical-based entities (i.e., classes, methods, and variables) with respect to change-prediction tasks. However, there is an implicit gain in terms of the context available to themaintainer, for example, the exact location of a predicted change. Predicting a change at an entity level rather than a file level reduces the manual effort as only the predicted entities (versus the whole file) need to be examined. This leads to the issue of extending current MSR by increasing the source code awareness. The issue of source code awareness could be twofold with regards to the types of MSR questions and the source code artifacts and differences. For example, on one end, a market-basket question is used to find logical/evolutionary couplings between source code entities. These couplings are termed ‘hidden’ dependencies as they are solely based on the historical information of software changes. However, very little attention has been paid as to whether these hidden dependencies correspond to relationships present in well-established source code models (e.g., control-flow graphs, dependency graphs, call graphs, and UML models). We feel that a finer-grained understanding of the source code changes is needed to address these types of questions. Fluri et al.  analyzed change-sets from a CVS repository to distinguish between changes within source code entities such as classes and methods (termed as structural changes) from the changes to license updates and white space between source code entities (termed as non-structural changes). The goal of their work was to refine evolutionary couplings detected from the version history with this information (i.e., reduce false positives). Their study on an Eclipse plugin found over 31% of change-sets with no structural changes and over 51% of change-sets with at least one non-structural change. In one of the rare cases, Ying et al.  defined the interestingness measure of the evolutionary coupling based on the source code dependencies such as calls, inheritance, and usage. Their study on Eclipse and Mozilla found evolutionary couplings that were not represented by the source code dependencies they considered. We feel that further utilizing such source code dependencies (such as association and dependency relationships defined in UML) will result in developing heuristics and criteria that would further reduce false evolutionary couplings. It will also help to detect evolutionary couplings that are prevalent but do not exhibit any source code dependencies (e.g., domain or developer induced dependencies). More studies in this direction are needed to realize the exclusive and synergistic contributions of MSR approaches.
5.2. Historical context: how many versions?
Software repositories bring a rich history of software development and evolution. One goal of MSR is to undercover the past successes, and failures, from historical information and improve the evolution process of the software system(s) under consideration.However, one needs to be carefulwhen selecting the amount and period of historical data for basing tools or models supporting a particular aspect of software evolution. Considering the development data too far back in the history imposes a risk of irrelevant information. The design or operational assumptions of the system may no longer be similar, or worse may be entirely different. For example, consider a hypothetical system that has undergone 1000 versions. The information about the changes in the first 50 versions may be totally irrelevant for predicting the changes in version 1001. A series of changes from version 50 to version 200 could be attributed to an unstable unit in the system that has now stabilized. On the other hand, considering too few versions of the system imposes the risk of being incomplete or missing important relevant information thus resulting in few useful results. For example, a current version of a system may be in the middle of a refactoring that is achieved by a sequence of changes (versions). The minimum requirement would be the past versions beginning from when the refactoring started to first confirm the kind of refactoring taking place and predict the remaining steps. The number of versions to mine depends on the task and the current state/phase of the system under consideration.
5.3. Threats to validity in MSR
MSR approaches use a variety of software repositories, ask different questions, and draw conclusions within the context of the conducted study. All of these factors are subject to threats to validity. Gasser et al.  identified the challenges associated with the common need among researchers in selecting, gathering, and maintaining the raw data of open-source projects for their respective investigations. They suggested a research infrastructure to deal with such challenges and to serve as a benchmark to facilitate comparative and collaborative research. They discussed the infrastructure with regards to representation standards for data and metadata available in various software repositories, linking them, the required tools, and a centralized data repository. German et al. further suggested a set of projects representing various sizes and domains, their extracted source code facts (i.e., syntax and semantic), and the period of considered history and observation for these projects to be benchmarked [10,18]. We call for a comparative framework to objectively compare MSR approaches with regards to the aspects of software evolution, MSR questions, and the results. Such a framework will facilitate more generic conclusions in the MSR research. Currently, it is difficult to see that two independent MSR investigations are asking equivalent questions or studying the same or similar aspect of software evolution. A benchmark of this nature would help address the expressiveness and effectiveness of MSR in improving software evolution.