The different papers captured in the search are analyzed below. They have been further categorised into five subsections to capture commonly recurring areas, although there may be some overlap between a paper in one section with another section.

Refactoring to improve software quality

Ó Cinnéide and Nixon (1999a) developed a methodology to refactor software programs to apply design patterns to legacy code. They created a tool to convert the design pattern transformations into explicit refactoring techniques that can then be automatically applied to the code. The tool, called DPT (Design Pattern Tool), was implemented in Java and applied the transformations first to an abstract syntax tree that was used to represent the code, before changes were applied to the code itself. The tool would first work out the transformations needed to convert the current solution to the desired pattern (in the paper a plausible precursor was chosen first). It then converted the pattern transformations into a set of minipatterns. These minipatterns would then be further decomposed, if needed, into a set of primitive refactorings. The minipatterns would be reused if applicable for other pattern transformations.

The authors analyzed the (Gamma et al., 1994) patterns to determine whether a suitable transformation could be built with the applicable mini transformations. They found that while the tool generally worked well for the creational patterns, structural patterns and behavioral patterns caused problems. In a different paper (Cinnéide & Nixon, 1999b), more detail was given on the tool and how it can be used to apply the Factory Method pattern, and in another subsequent paper (Cinnéide, 2000), Ó Cinnéide defined further steps of work to test the applicability of the tool. He defined plans to apply the patterns to production software to test whether behavior is truly preserved and to create a tool to objectively measure how suitably the pattern has been applied to the software.

O’Keeffe and Ó Cinnéide (2003) continued to research in the area of SBSE relating to software maintenance by developing a tool called Dearthóir. They introduce Dearthóir as a prototype tool used to refactor Java code automatically using SA. The tool used two refactorings, “pullUpMethod” and “pushDownMethod” to modify the hierarchical structure of the target program. Again, the refactorings must preserve the behavior of the program in order for them to be applicable. They must also be reversible in order to use the SA method. To measure the quality of the solution, the authors employed a small metric suite to analyze the object-oriented structure of the program. The metrics, “availableMethods” and “methodsInherited” were measured for each class in the program and a weighted sum was used to give an overall fitness value for the solution. A case study was employed to test the effectiveness of the tool. A simple 6-class hierarchy was used for the experiment. The tool was shown to restructure the class design to improve cohesion and minimize code duplication.

Further work (O’Keeffe & Cinnéide, 2004) introduced more refactorings and different metrics to the tool. Along with method movement refactorings the ability to change a class between abstract and concrete was introduced and to extract or collapse a subclass from an abstract class, as well as the ability to change the position of a class in the hierarchy of the class design. A method was introduced to choose appropriate metrics to use in the metrics suite of the tool. The metrics used measured the methods and classes of the solution, counting the number of rejected, duplicate and unused methods as well as the number of featureless or abstract classes. Due to the possibility for the metrics to conflict with each other they were then given dependencies and weighted according to the authors’ judgment, as outlined in the method detailed before. Another case study was used to detail the action of the tool and the outcome was evaluated using the value of the metrics before and after the tool was applied. Every metric used either improved or was unchanged after the tool had been applied, indicating that the tool had been successful in improving the structure of the solution.

O’Keeffe and Ó Cinnéide continued to work on automated refactoring by developing the Dearthóir prototype into the CODe-Imp platform (Combinatorial Optimisation for Design Improvement). They introduced it initially as a prototype automated design-improvement tool (O’Keeffe & Cinnéide, 2006) using Java 1.4 source code as input. Like Dearthóir, CODe-Imp uses abstract syntax trees to apply refactorings to a previously designed solution, but it has been given the ability to implement HC (first-ascent or steepest-ascent) as well as SA. They based the set of metrics used in the tool on the QMOOD model of software quality (Bansiya & Davis, 2002). Six refactorings were available initially, and 11 different metrics are used to capture flexibility, reusability and understandability, in accordance to the QMOOD model. Each evaluation function is based on a weighted sum of quotients on the set of metrics.

The authors then conducted a case study to test how effective each function and each search technique is at refactoring software. The reusability function was found to not be suitable to the requirements of search-based refactoring due to the introduction of a large number of featureless classes. The other two evaluation functions were found to be suitable with the understandability function being most effective. All search techniques were found to produce quality improvements with manageable run-times, with steepest-ascent HC providing the most consistent improvements. SA produced the greatest quality improvements in some cases whereas first-ascent hill-climbing generally produced quality improvements for the least computational expenditure. They further expanded on this work (O’Keeffe & Cinnéide, 2008a) to include a fourth search technique (multiple-restart HC) and larger case studies. The functionality of the CODe-Imp tool was also expanded to include six additional refactorings. Similar results were found with the reusability function found to be unsuitable for search-based refactoring and all of the available search techniques found to be effective.

They subsequently (O’Keeffe & Cinnéide, 2007a) used the CODe-Imp platform to conduct an empirical comparison of three methods of metaheuristic search in search-based refactoring; multiple-ascent (as well as steepest-ascent) HC, SA and a GA. To conduct the comparison, four Java programs were taken from SourceForge and Spec-Benchmarks and the mean quality change was measured across the program solutions for each of the three metaheuristic techniques. These results were then normalized for each metaheuristic technique and then compared against each other. They analyzed the results to conclude that multiple-ascent HC was the most suitable method for search-based refactoring due to the speed and consistency of the results compared to the other techniques. This work was also expanded (O’Keeffe & Cinnéide, 2008b) with a larger set of input programs, greater number of data points in each experiment and more detailed discussion of results and conclusions.

At a later point, Koc et al. (2012) also compared metaheuristic search techniques using a tool called A-CMA. They compared five different search techniques by using them to refactor five different open source Java projects and one student project. The techniques used were HC (steepest descent, multiple steepest descent and multiple first descent), SA and artificial bee colony (ABC), as well as a random search for comparison. The results suggest that the ABC and multiple steepest descent HC algorithms are the most effective techniques of the group, with both techniques being competitive with each other. The authors suggested that the effectiveness of these techniques may be due to their ability to expand the search horizon to find higher quality solutions.

Mohan, Greer and McMullan (Mohan et al., 2016) adapted the A-CMA tool to investigate different aspects of software quality. They used combinations of metrics to represent three quality factors; abstraction, coupling and inheritance. They then constructed an experimental fitness function to measure technical debt by combining relevant metrics with influence from the QMOOD suite, as well as the SOLID principles of object-oriented design. The technical debt function was compared against each of the other quality factors by testing them on six different open source systems with a random search, HC and SA. The technical debt function was found to be more successful than the others, although the coupling function was also found to be useful. Of the three searches used, SA was the most effective. The individual metrics of the technical debt function were also compared to deduce which were more volatile.

O’Keeffe and Ó Cinnéide used steepest-ascent HC with CODe-Imp to attempt to refactor software programs to have a more similar design to other programs based on their metric values (O’Keeffe & Cinnéide, 2007b). The QMOOD metrics suite was used to compare against previous results, and an overall fitness value was derived from the sum of 11 different metrics. A dissimilarity function was evaluated to measure the absolute differences between the metric values of the programs tested, where a lower dissimilarity value meant the programs were more similar. CODe-Imp was then used to refactor the input program to reduce its dissimilarity value to the target program. This was tested with three open source Java programs, with six different tests overall (testing each program against the other two). Two of the programs had been refactored to be more similar to the targets, but for the third, the dissimilarity was unchanged in both cases. The authors speculated that this was due to the limited number of refactorings available for the program as well as the low dissimilarity to begin with. They further speculated that the reason for the limited available refactorings was due to the flat hierarchical structure in the program.

Moghadam and Ó Cinnéide (2011) rewrote the CODe-Imp platform to support Java 6 input and to provide a more flexible platform. It now supported 14 different design-level refactorings across three categories; method-level, field-level and class-level. The number of metrics had also been expanded to 28, measuring mainly cohesion or coupling. The platform was also given the option of choosing between using Pareto optimality or weighted sums to combine the metrics and derive fitness values.

Moghadam and Ó Cinnéide used CODE-Imp along with JDEvAn (Xing & Stroulia, 2008) to attempt to refactor code towards a desired design using design differencing (Moghadam & Cinnéide, 2012). The JDEvAn tool is used to extract the UML models of two solutions of code, and detect the differences between them. An updated version of the code is created by a maintenance programmer to reflect the desired design in the code and the tool uses this along with the original design to find the applicable changes needed to refactor the code. The CODe-Imp platform then uses the detected differences to implement refactorings to modify the solution towards the desired model. Six open source examples were used to test the efficiency of these tools to create the desired solutions. The number of refactorings detected and applied in each program using the above approach were collected, and in each case a high percentage of refactorings were shown to have been applied.

Seng, Stammel and Burkhart (Seng et al., 2006) introduced an EA to apply possible refactorings to a program phenotype (an abstract code model), using mutation and crossover operators to provide a population of options. The output of the algorithm was a list of refactorings the software engineer could apply to improve a set of metrics. They used class level refactorings, noting the difficulty of providing refactorings of this type that were behavior preserving. They tested their technique on the open source Java program JHotDraw, using a combination of coupling and cohesion metrics to measure the quality gain in the class structure of the program. For the purposes of the case study, they focused on the “move method” refactoring. The algorithm successfully used the technique to improve the metrics. They also tested the ability of the algorithm to reorganize manually misplaced methods, and it was successfully able to suggest that the methods are moved back to their original position.

Harman and Tratt (Harman & Tratt, 2007) argued how Pareto optimality can be used to improve search-based refactoring by combining different metrics in a useful way. As an alternative to combining different metrics using weights to create complex fitness functions, a Pareto front can be used to visualize the effect of each individual metric on the solution. Where the quality of one solution may have a better effect on one metric, another solution may have an increased value for another. This allows the developer to make an informed decision on which solution to use, depending on what measure of quality is more important for the project in that instant. Pareto fronts can also be used to compare different combinations of metrics against each other. An example was given with the metrics CBO (Coupling Between Objects) and SDMPC (Standard Deviation of Methods Per Class) on several open source Java applications.

Refactoring for testability

Harman (Harman, 2011) proposed a new category of testability transformation (used to produce a version of a program more amenable to test data generation) called testability refactoring. The aim of this subcategory is to create a program that is both more suited to test data generation and improves program comprehension for the programmer, combining the two areas (testing and maintenance) of SBSE. As testability transformation uses refactorings to modify the structure of a program the same technique can be used for program maintenance, although the two aims may be conflicting. Here a testability refactoring will refer to a process that satisfies both objectives. Harman mentioned that these two possibly conflicting objectives form a multi-objective scenario. He explained that the problem would be well suited to Pareto optimal search-based refactoring and also mentioned a number of ways in which testability transformation may be suited to testability refactoring.

Morales et al. (2016) investigated the use of a multi-objective approach that takes into consideration the testing effort on a system. They used their approach to minimize the occurrence of five well known anti-patterns (i.e. types of design defect), while also attempting to reduce the testing effort. Three different multi-objective algorithms were tested and compared; NSGA-II, SPEA2 and MOCell. This approach was tested on four open source systems. Of the three options, MOCell was found to be the metaheuristic that provided the best performance.

Ó Cinnéide, Boyle and Moghadam (2011) used the LSCC (Low-level Similarity-based Class Cohesion) metric with the CODe-Imp platform to test whether automated refactoring with the aid of cohesion metrics can be used to improve the testability of a program. They refactored a small Java program with cohesive defects introduced. Ten volunteers with varying years of industrial experience constructed test cases for the program before and after refactoring, and were then surveyed on certain areas of the program to discern whether it had become easier or harder to implement test cases for them after refactoring. The results were ambivalent but generally there was little difference reported in the difficulty of producing test cases in the initial and final program. The authors suggested that these unexpected results may stem from the size of the program being used. They predicted that if a larger, more appropriate application was being used, then the refactored program may produce easier test cases. The programmers surveyed also mentioned the use of modern IDE’s helped to reduce the issues with the initial code and alleviated any predicted problems with producing test cases for the program in this state.

Testing metric effectiveness with refactoring

Ghaith and Ó Cinnéide (2012) investigated a set of security metrics to determine how successful they could be for improving a security sensitive application using automated refactoring. They used the CODe-Imp platform to test the 16 metrics on an example Java application by using them separately at first. After determining that only four of the metrics were affected with the refactoring selection available, they were combined together to form a fitness function to represent security. To avoid the problems related to using a weighted sum approach to combining the metrics, they instead used a Pareto optimal approach. This ensured that no refactoring would be chosen that would cause a decrease in any of the individual metrics in the function. The function was then tested on the Java program using first-ascent HC, steepest-ascent HC and SA. The results for the three searches were mostly identical except that SA caused a higher improvement in one of the metrics. Conversely, the SA solution entailed a far larger number of refactorings than the other two options (2196 compared to 42 and 57). The effectiveness of these metrics was also analyzed and it was discovered that of the 27% average metric improvement in the program, only 15.7% of that improvement indicated a real improvement in its security. This was determined to be due to the security metrics being poorly formed.

Ó Cinnéide et al. (2012) conducted an investigation to measure and compare different cohesion metrics with the help of the CODe-Imp platform. Five popular cohesion metrics were used across eight different real world Java programs to measure the volatility of the metrics. It was found that the five metrics that aimed to measure the same property disagreed with each other in 55% of the applied refactorings, and in 38% of the cases metrics were in direct conflict with each other. Two of the metrics, LSCC and TCC (Tight Class Cohesion), were then studied in more detail to determine where the contradictions were in the code that caused the conflicts. Different variations of the metrics were used to compare them in two different ways This study was extended (Cinnéide et al., 2016) to use 10 real world Java programs. Two new techniques, exploring areas referred to as Iterative Refactoring Agreement/Disagreement and Gap Opening/Closing Refactoring, were used to compare the metrics and the number of metric pairs compared was increased to 90 pairs. Among the metrics compared, LSCC was found to be the most representative, while SCOM (Sensitive Class Cohesion) was found to be the least.

Veerappa and Harrison (2013) expanded upon this work by using CODe-Imp to inspect the differences between coupling metrics. A similar approach was used to measure the effects of automated refactoring on four standard coupling metrics and to compare the metrics with each other. Eight open source Java projects were used, with all but one of the programs being the same as those used in Ó Cinnéide et al.’s experiment. To measure volatility, they calculated the percentage of refactorings that caused a change in the metrics, and from these a mean value was calculated across the eight projects. The amount of spread between these values was calculated for each metric using standard deviation, as well as the correlation values between each metric. This experiment resulted in less divergence between metrics, with only 7.28% of changes in direct conflicting, but in 55.23% of cases the changes were dissonant, meaning that there was a larger chance that a change in one metric had no effect on another. They also measured the effect of refactoring with the RFC (Response For Class) metric on a cohesion metric and found that after a certain number of iterations, the cohesion will continue to increase as the coupling decreases, minimizing the effectiveness of the changes.

Simons, Singer and White (2015) compared metric values with professional opinions to deduce whether metrics alone are enough to helpfully refactor a program. They constructed a number of software examples and used a selection of metrics to measure them. A survey was then conducted and responded to by 50 experienced software engineers. They were asked on their opinion of the quality of the solutions by asking whether they agree or disagree that a solution was reusable, flexible or understandable. The metrics were corresponded to the quality attributes and correlation plots were produced to measure whether there was any correlation between the engineer’s opinions and the metric values. There was found to be almost no correlation between the two, leading the authors to suggest that metrics alone are insufficient to optimize software quality as they do not fully capture the judgments of human engineers when refactoring software.

Vivanco and Pizzi (2004) used search-based techniques to select the most suitable maintainability metrics from a group. They presented a parallel GA to choose between 64 different object-oriented source code metric. Firstly, they asked an experienced software architect to rank the 366 components of a software system in difficulty, from 1 to 5. The GA was then run for the set of metrics in sequential and parallel, using C++ for the GA and MPI to implement the parallel improvements. Metrics found to be more efficient included coupling metrics, understandability metrics and complexity metrics. Furthermore, the parallel program ran substantially faster than the sequential version.

Bakar et al. (2012a) attempted to outline a set of guidelines to select the best metrics for measuring maintainability in open source software. An EA was used to optimize and rank the set of metrics, which were listed in previous work (Bakar et al., 2012b). An analysis was conducted to validate the quality model using the CK metric suite (Chidamber & Kemerer, 1994) of Object-Oriented Metrics (also known as MOOSE – Metrics for Object-Oriented Software Engineering). The CKJM tool proposed by (Spinellis, 2005) was used to calculate the values of the CK metrics for the open source software under inspection. These values were then used in the EA as ranking criteria in selecting the best metrics to measure maintainability in the software product. This proposed approach had not yet been empirically validated, and had presented the outcome of ongoing research.

Harman, Clark and Ó Cinnéide (2013) wrote about the need for surrogate metrics that approximate the quality of a system to speed up the search. If non-functional properties of the system (e.g. if a mobile device is used) mean limited time or power, then it may be more important for the fitness function to be calculated quickly or with little computational effort, in which case approximate metrics will be more useful than precise ones. The trade-off here is that the metrics will guide the search in the direction of optimality while improving the performance of the search. This ability would be useful in dynamic adaptive SBSE, where self-adaptive systems may take into account functional as well as non-functional properties. Harman et al. had also discussed dynamic adaptive SBSE elsewhere (Harman et al., 2012).

Refactoring to correct software defects

Kessentini et al. (2011) used examples of bad design to produce rules to aid in design defect detection with genetic programming (GP), and then used these rules in a GA to help propose sequences of refactorings to remove the detected defects. The rules are made up of a combination of design metrics to detect instances of blob, spaghetti code or functional decomposition design defects. Before the GA was used, a GP approach experimented with different rules than can reproduce the example set of design defects, with the most accurate rules being returned. Once a set of rules were derived, they could be used to detect the number of defects in the correction approach. The GA could then be used to find sequences of refactorings that reduce the number of design defects in the program. The approach was compared against a different rules-based approach to defects detection with four open source Java programs and was found to be more precise with the design defects found.

Further work with this approach to design smell (defect) correction was then investigated (Kessentini et al., 2011; Ouni et al., 2013; Kessentini et al., 2012). First, (Kessentini et al., 2011) extended the experimental code base to six different open source Java programs, with the results further supporting the approach. Ouni et al. (2013) replaced the GA used in the code smell correction approach with a multi-objective GA (NSGA-II). They used the previous objective function to minimize design defects as one of two separate objectives to drive the search. The second objective used a measure of the effort needed to apply the refactoring sequence, with each refactoring type given an effort value by the authors. Kessentini, Mahaouachi and Ghedira (2012) extended the original approach by using examples of good code design to help propose refactoring sequences for improving the structure of code. Instead of generating refactoring rules to detect design defects and then using them to generate refactoring sequences with a GA, they used a GA directly to measure the similarity between the subject code and the well-designed code. The fitness function used adapted the Needleman-Wunsch alignment algorithm (a dynamic programming algorithm used to efficiently find similar regions between two sequences of DNA, RNA or protein. It can be used to efficiently compare code fragments) to increase the similarity between the two sets of code, allowing the derived refactoring sequences to remove code smells.

Ouni et al. (2012) created an approach to measure semantics preservation in a software program when searching for refactoring options to improve the structure. They used a multi-objective approach with NSGA-II to combine the previous approach for resolving design defects with the new approach to ensure that the resolutions retained semantic similarity between code elements in the program. The new approach used two main methods to measure semantic similarity. The first method measures vocabulary based similarity by inspecting the names given to the software elements and comparing them using cosine similarity. The other method measures the dependencies between objects in the program by calculating the shared method calls of two objects and the shared field accesses and combining them into a single function. An overall objective for semantics similarity is derived from these measures by finding the average, and this is then used to help the NSGA-II algorithm find more meaningful solutions. These solutions were analyzed manually to derive the percentage of meaningful refactorings suggested. The results across two different open source programs were then compared against a previous mono-objective and previous multi-objective approach, and, while the number of defects resolved was moderately smaller, the meaningful refactorings were increased.

Ouni, Kessentini and Sahraoui (2013) then explored the potential of using development refactoring history to aid in refactoring the current version of a software project. They used a multi-objective approach with NSGA-II to combine three separate objectives in proposing refactoring sequences to improve the product. Two of the objectives, improving design quality and semantics preservation, were taken from previous work. The third objective used a repository of previous refactorings to encourage the use of refactorings similar to those applied to the same code fragments in the past. The approach was tested on three open source Java projects and compared against a random search and a mono-objective approach. The multi-objective algorithm had better quality values and semantics preservation than the alternatives, although this approach did not apply the proposed refactorings to the code, leaving the refactoring sequences to be applied manually by the developer.

They further explored this approach (Ouni et al., 2013) by analyzing co-change that identified how often two objects in a project were refactored together at the same time and also by analyzing the number of changes applied in the past to the objects. They also explored the effect of using refactoring history on semantics preservation. Further experimentation on open source Java projects showed a slight improvement in quality values and semantics preservation with these additional considerations. Another study (Ouni et al., 2015) investigated the use of past refactorings borrowed from different software projects when the change history for the applicable project is not available or does not exist. The improvements made in these cases were as good as the improvements made when previous refactorings for the relevant project were available.

Wang et al. (2015) combined the previous approach by (Kessentini et al., 2011) to remove software defects with time series in a multi-objective approach using NSGA-II. The time series was used to predict how many potential code smells would appear in future versions of the software with the selected solution applied. One of the objectives was then measured by minimizing the number of code smells in the current version of the software and estimated code smells in future versions of the software. The other objective aimed to minimize the number of refactorings necessary to improve the software. The approach was tested on four open source Java programs and one industrial Java project. The programs were chosen based on the number of previous versions of the software available, as the success of the approach would depend on this input. The experimental results were compared against previous mono-objective and multi-objective approaches and were found to have better results with less refactorings, but also took longer to run.

Pérez, Murgia and Demeyer (2013) presented a short position paper to propose an approach to resolving design smells in software. They proposed using the version control repository to find and use previously effective refactorings in the code and apply them to the current design as “Refactoring Strategies”. Refactoring strategies are defined as heuristic-based, automation-suitable specifications of complex behavior-preserving software transformations aimed at a certain goal e.g. removing design smells. They described an approach to build a catalogue of executable refactoring strategies to handle design smells by combining refactorings that have been performed previously. The authors claimed that, on the basis of their previous work and other available tools, it would be a feasible approach.

Morales (2015) defined his aim to create an Eclipse plug-in to help with refactoring in a doctoral paper. He aimed to compare different metaheuristic approaches and use a metaheuristic search to detect anti-patterns in code. The plugin would then use automated refactoring to help remove the anti-patterns and improve the design of the code. Morales et al. (2016) addressed this aim with the ReCon approach (Refactoring approach based on task Context). The approach leverages information about a developer’s task, as well as one of three metaheuristics, to suggest a set of refactorings that affect only the entities of the project in the developer’s context. The metaheuristics supported are SA, a GA and variable neighborhood search (VNS). To test the approach, it was applied to three open source Java programs with sufficient information to deduce developer’s context. They adapted the approach to look for refactorings that can reduce four types of anti-pattern; lazy class, long parameter list, spaghetti code, and speculative generality. They also aimed to improve five of the quality attributes defined in the QMOOD model. The results showed that ReCon can successfully correct more than 50% of anti-patterns in a project using less resources than the traditional approaches from the literature. It can also achieve a significant quality improvement in terms of reusability, extendibility and to some extent flexibility, while effectiveness reports a negligible increment.

Mkaouer et al. experimented with combining quality measurement with robustness (Mkaouer et al., 2014) to yield refactored solutions that could withstand volatile software environments where importance of code smells or areas of code may change. They used NSGA-II on six different open source Java programs of different sizes and domains to create a population of solutions that used robustness as well as software quality in the fitness measurement. To measure robustness, they used formulas to approximate smell severity (by prioritizing four different code smell types with scores between 0 and 1) and importance of code smells fixed (by measuring the activity of the code modified via number of comments, relationships and methods) as well as measuring the amount of fixed code smells. They also used a number of multi-objective performance measurements (hypervolume, inverse generational distance and contribution) to compare against other multi-objective algorithms. To analyze the effectiveness of the approach and the trade-offs involved in ensuring robustness, the NSGA-II approach was compared against a set of other techniques. For performance, it was compared to a multi-objective particle swarm algorithm (as well as a random search to establish a baseline), and was found to outperform or have no significant difference in performance in all but one project. It is suggested that since this was the smaller project, the particle swarm algorithm may be more suited to smaller, more restrictive projects. It was also compared to a mono-objective GA and two mono-objective approaches that use a weighted combination of metrics (the same ones used above). It was found that although the technique only outperformed the mono-objective approaches in 11% of the cases, it outperformed them on the robustness metrics in every case, showing that while it sacrificed some quality, the NSGA-II approach arrived at more robust solutions that would be more resilient in a more unstable, realistic environment. This study was extended (Mkaouer et al., 2016) by testing eight open source systems and one industrial project, and by increasing the number of code smell types analyzed to seven.

They also experimented with the newly proposed evolutionary optimization method NSGA-III (Mkaouer et al., 2014), which uses a GA to balance multiple objectives through non-dominated sorting. They used the algorithm to remove detected code smells in seven open source Java programs through a set of refactorings. They tested the algorithm using different amounts of objectives (3, 5, 8, 10 and 15) to measure the scalability of the approach to a multi-objective and many-objective problem set. These results were then compared against other EAs to see how they scaled compared to NSGA-III. The NSGA-III approach improved as the amount of objectives used was increased, whereas the other algorithms did not scale as well. Three other MOEAs were compared; IBEA, MOEA/D and NSGA-II. The other MOEAs were comparable when the amount of objectives used in the search was smaller, but as the amount of objectives used was increased, the results became less competitive with NSGA-III. The search technique was also compared against two other techniques that used a weighted sum of metrics to measure the software. These techniques performed significantly worse than the NSGA-III approach. They extended the study (Mkaouer et al., 2015) by also experimenting on an industrial project and increasing the number of many-objective techniques compared against from two to four. The number of objectives was reduced to eight and changed to represent the quality attributes of the QMOOD suite as well as other aggregate metric functions.

They also looked at many-objective refactoring with the NSGA-III algorithm for remodularization (Mkaouer et al., 2015). They used four open source Java systems in the experimentation along with one industrial system provided by Ford Motor Company. They compared the technique against other approaches by looking at up to seven objectives, using objectives from previous work to look at the semantic coherence of the code and the development history along with structural objectives. Again, the approach outperformed the other techniques and more than 92% of code smells were fixed on each of the open source systems.

More recently, Ouni et al. (2015) adapted the chemical reaction optimization (CRO) algorithm for search-based refactoring and explored the benefits of this approach. They compared this search technique against more standard optimization techniques used in SBSE; a GA, SA and particle swarm optimization (PSO). They combined four different prioritization measures to make up a fitness function that aimed to reduce seven different types of code smells. The four measures were priority, severity, risk and importance. Each of the code smell types were given a priority score of 1 to 7 to represent their opinion of which smells are more urgent from previous experience in the field. The inFusion tool (a design flaw detection tool) was used to deduce severity scores to represent how critical a code smell is in comparison with others of the same type. The risk score was calculated to represent riskier code as code that deviated from well-designed code. Code smells that related to more frequently changed classes were considered more important as code smells that hadn’t undergone changes were considered to have been created intentionally. The approach was applied to five different open source Java programs and was compared against a previous study and a variation of the approach that didn’t use prioritization. The approach was superior using the relevant measures to the other two solutions compared against it. It was also shown to give better solutions in larger systems than the other optimization algorithms tested.

Amal et al. (2014) used an Artificial Neural Network (ANN) to help their approach choose between refactoring solutions. They applied a GA with a list of 11 possible refactorings to generate refactoring solutions consisting of lists of suggested refactorings to restructure the program design. They then utilised the opinion of 16 different software engineers, with programming experiences ranging from 2 to 15 years, to manually evaluate the refactoring solutions generated for the first few iterations by marking each refactoring as good or bad. The ANN used these examples as a training set in order to develop a predictive model to evaluate the refactoring solutions for the remaining iterations. Due to this, the ANN worked to replace the definition of a fitness function. The approach was tested on six open source programs and compared against existing mono-objective and multi-objective approaches, as well as a manual refactoring approach. The majority of the suggested refactorings were considered by the users to be feasible, efficient in terms of improving quality of the design and to make sense. In comparison with the other mono-objective and multi-objective approaches, the refactoring suggestions gave similar scores but required less effort and less interactions with the designer to evaluate the solutions. The approach outperformed the manual refactoring approach.

Refactoring tools

Fatiregun, Harman and Hierons (2004) explored program transformations by experimenting with a GA and HC approach and comparing the results against each other as well as a random search as a baseline. They used the FermaT transformation tool, and the 20 transformations (refactorings) available in the tool, to refactor the program and optimize its length by comparing lines of code before and after. The average fitness for the GA was shown to be consistently better than the random search and the HC search, while the HC technique was, for the most part, significantly better than the random search.

DiPenta (2005) proposed another refactoring framework, Evolution Doctor, to handle clones and unused objects, remove circular dependencies and reorganize source code files. Afterwards, a hybridisation of HC and GAs is used to reorganize libraries. The fitness function of the algorithm was created to balance four factors; the number of inter-library dependencies, the number of objects linked to each application, the size of the new libraries and the feedback given by developers. The framework was applied to three open source applications to demonstrate its effectiveness in each of the areas of design flaw detection and removal.

Griffith, Wahl and Izurieta (2011) introduced the TrueRefactor tool to find and remove a set of code smells from a program in order to increase comprehensibility. TrueRefactor can detect lazy classes, large classes, long methods, temporary fields or instances of shotgun surgery in Java programs and uses a GA to help remove them. The GA is utilized to search for the best sequence of refactorings that removes the highest number of code smells from the original source code. To detect code smells in a program, each source file is parsed and then used to create a control flow graph to represent the structure of the software. This graph can be used to detect the code smells present. For each code smell type, a set of metrics are used to deduce whether a section of the code is an instance of that code smell type. The tool contains a set of 12 refactorings (at class level, method level or field level) that are used to remove the code smells. A set of pre conditions and post conditions are generated for each code smell to ensure that they can be resolved beforehand. The paper used an example program with code smells inserted to analyze the effectiveness of the tool. The number of code smells of each type over the set of iterations was measured along with the measure of a set of quality metrics. In both cases, the values improved initially before staying relatively stable throughout the process. Comparison of initial and final code smells showed that the tool removes a proportion of them and also metric values show that the surrogate metrics are improved. The tool is only able to generate improved UML representations of the code and not refactor the source code itself, and this restriction was identified as an aim for future work.