An example of refactoring from a real (flawed) code base.

Writer, speaker, asker of questions and enthusiast of Extreme Programming - Clare is an ex high school maths teacher and a consultant with 20 years of software engineering experience. She is on a mission to awaken the inner geek in people everywhere - particularly those not traditionally encouraged to geek out.

In this article I walk through a set of refactorings from a real code base. This is not intended to demonstrate perfection, but it does represent reality.

This is a story about refactoring. It's the third item in the TDD red-green-refactor cycle and it's the thing we do all the time, right? Except when we don't.

I have an unruly code base which has suffered from refactoring neglect, so I've been bringing it back into line. In this article I will take a class that is too large, and make it smaller.

Language and approach C# I generally write code in whatever language suits the client, the project and the problem at hand. But the majority of my career has been in the .Net space, so C# is my comfort zone and my default go-to. The examples in this article are written in C#, but I hope the code snippets included are simple enough to follow even if you don't know the language. Test-driven development I'm a big fan of Test-Driven Development (TDD) , and the code presented here was mostly written using that approach. But I have to confess that after 20 years as an engineer, I have at least half a career of bad habits behind me and I don't always get things right.

Outline of the problem The story begins with a boring domestic chore. I've written some personal accounting software - Reconciliate. It works on the command line and performs the following actions: Loads my own comma-separated data: My record of recent bank and credit card transactions.

My predicted monthly and annual transactions (based on data in a central spreadsheet).

Any unreconciled previously-recorded data. Loads third-party comma-separated data (from bank and credit card companies). Reconciles the third-party data against my own data. Writes everything back to a central spreadsheet. I have some bugs that need fixing and some features to add. But my typical workflow is to arrive at the laptop late at night after putting my youngest to bed, with only a small amount of time before it will be my bedtime, and often a few weeks after the last time I saw the code. In these circumstances it's horribly easy to think things like, "Well I know this code is a mess, but I don't have time to get my head round it and fix it now..." Clearly this is not ideal. There's one class in particular - the ReconciliationIntro class - which is giving me a headache whenever I look at it. It's bloated and convoluted and impossible to fit in my head . This has created a nasty feedback loop: "Because this code is so hard to reason about, refactoring will take more time and energy than I have, so I'll put up with it even longer - even though it means I can't even make small changes any more because it takes so long for me to understand the code's current state and decide where the change should go." For instance, I want to add the ability to handle another credit card. In a lot of places I've used polymorphism and the strategy pattern to keep the unique behaviour of each credit card neatly encapsulated. But the ReconciliationIntro class is one place where, because of the poor design of the code and the lack of clear context separation, if I add another credit card I'll be making an already bloated class even worse. It contains four duplicated code paths for each of four types of data (Bank In, Bank Out, Credit Card 1 and Credit Card 2), and if I add Credit Card 3 right now I'll end up following the same anti-pattern. My overall aim is to replace this code with more generic code, via the strategy pattern . But there are four problems standing in my way: This one large class, ( ReconciliationIntro ), is taking too much responsibility. There's a lot of private nested code, which is hard to unit-test because it has no public interface. Within ReconciliationIntro , there's one large method that's doing too much. There are several methods in ReconciliationIntro which use the same repeated pattern but with different details. I plan to fix all of these problems, in the order listed above. This preparatory refactoring will get me to the point where I can easily encapsulate the behaviour of each credit card / account. As Kent Beck says, "Make the change easy, then make the easy change." This article is about tackling the first problem in the above list: This class is too large.

Why refactor? I can't easily reason about the ReconciliationIntro class because it has too many responsibilities. It was originally designed as the entrance hall to the software, and all it did was display a few messages and then instantiate the classes that do the main work. But over time a lot of other code has snuck in. I want to make it easier to add another credit card, so I'm going to start by breaking this large class into smaller classes. The benefits will be several: By moving groups of methods into separate classes I'll create clear contexts and minimise tight coupling. This means that:

I'll know where to go when I want to make changes or fix issues.



When I amend the code, I'll only have to work with small isolated sections that fit in my head - I won't have to understand the whole system.



Any manipulation of state will only happen locally in a small context - so I won't have to worry about one area of the system changing state in another.

How to avoid getting here in the first place Could I have avoided being in this position? Yes, of course. Rather than despairing about a lack of time, I could have made small continuous changes whenever I encountered problem areas. That's part of what the refactor stage of the red-green-refactor cycle is for, and it's a habit you can reinstate at any time. But it's worth acknowledging that when coding under pressure, you're not always in a fit state to make the right choices. Your code can quickly reach a level of crustiness that makes a refactor feel insurmountable. At these times it's always worth using small changes to make things better piece by piece, no matter what state your code is in. How do I do it? To my mind, the most important principle when refactoring (and when writing new code) is to move in tiny steps. I have several connected aims: At every step, I want the code to compile and the tests to run.

If a change causes any tests to fail, I want to be able to fix them instantly. By making small changes and small commits, I can see which change caused the tests to fail and I only need to rewind / examine a small amount of code to find the problem.

I can keep track of where I am in in my head. It's horribly easy to embark on a relatively simple refactor, only to find it has repercussions beyond your original intention. At this point, if you're not in the habit of keeping the code compiling and the tests passing at every step, you can find yourself lost in a rabbit hole that's difficult to exit: Your code isn't compiling and your tests aren't even running, let alone passing. For this to work, I need the code being refactored to be covered by tests. This has already been done before I start refactoring, although it's worth noting that one of my aims in a future refactoring will be to make the code even more testable. Refactoring, by Martin Fowler is a recommended further read and lays out some basic refactoring principles. He also emphasises the value of moving in tiny steps and building the code / running your tests after every small commit. Beyond these basic principles, I'll aim to follow a logical series of steps which are outlined below. How to read this article This is a real code base and a messy code base (hence the refactoring), so there's a lot of code. For those who want to know more, I do link to the code repeatedly. But I've deliberately kept the descriptions high-level, and you don't need to follow those links.

1. Rearrange methods I'll start by using regions to rearrange all the methods in the ReconciliationIntro class into sensible groupings (commit f2d9932) : Figure 1: ReconciliationIntro after methods rearranged into regions Principle: Always commit moves and renames separately from edits. It's important that I move code around like this as a separate task and in a separate commit, without editing the code in any way. When you rearrange methods, the resulting code difference will include many added/deleted chunks of code. If the code has been changed as well as moved, you can't easily tell whether lines have been deleted or moved, and whether moved lines are the same or different. As Dan Terhorst-North says, "A change should be structural or behavioural, but never both." Figure 2: What a commit looks like when you've moved code around Now that I've grouped my methods into what feels like a reasonable set of contexts, I'd like to pull these out into separate classes. But where to start? The file loading code contains the most duplication, and is causing the most pain. Ultimately I want to make this code more generic, but first I want to pull it out so I can see it without distraction. I'll extract a new FileLoader class. The other groupings will also become separate classes but they'll be a lot simpler, so I'll focus most of this article on the file-loading code.

Abbreviated method names The naming of anything - from methods to pubs to children - is a notoriously difficult enterprise. My tendency to verbosity is as evident in my method names as it is in the longwinded stories you'll hear if you meet me in a pub. Before refactoring, the duplication of code meant that method names were being used to differentiate between contexts - and this resulted in some long names. To avoid gigantic images I've abbreviated names in most of the diagrams, as follows: BBI : bank_and_bank_in ("bank" and "bank_in" represent two different views - the bank's, and my own)

: bank_and_bank_in ("bank" and "bank_in" represent two different views - the bank's, and my own) BBO : bank_and_bank_out ("bank" and "bank_out" as with "bank" and "bank_in" above)

: bank_and_bank_out ("bank" and "bank_out" as with "bank" and "bank_in" above) CC1 : cred_card1_and_cred_card1_in_out ("cred_card1" and "cred_card1_in_out" as with "bank" and "bank_in" above)

: cred_card1_and_cred_card1_in_out ("cred_card1" and "cred_card1_in_out" as with "bank" and "bank_in" above) CC2 : cred_card2_and_cred_card2_in_out (as with "cred_card1" above)

: cred_card2_and_cred_card2_in_out (as with "cred_card1" above) Merge_bd : Merge_bespoke_data

: Merge_bespoke_data BM : budgeting_months

: budgeting_months ccds : credit_card_debits

: credit_card_debits wpf : with_pending_file

: with_pending_file CC : cred_card

: cred_card DDs: direct_debits 2. Analyse relationships between file-loading methods So far all I've done is move some code around in the same class. I rebuilt the code and ran all the tests before I committed, but unless I was clumsy I'd expect my previous commit to be trivial. I need to be a little more thoughtful about what I do next. I've used one of my regions to identify methods to pull out into a new FileLoader class, but how can I be sure this will work? Are there any hidden dependencies? I'll discover this by drawing out the relationships between the methods I want to move. The ones to be moved are marked in blue. Figure 3: File-loading methods to be moved (abbreviations) I can see straight away that it's not a strictly self-contained context: Some of the blue methods are calling back out to methods (shown in black and green) that I want to keep in a separate class. There are a couple of ways I could handle this, which I discuss below. But now I'm at a place where I can define a plan of action.

3. Modify the methods that are staying behind I've identified two methods - Set_path and Recursively_ask_for_budgeting_months - which are called by file-loading methods: Figure 5: Methods called by file-loading methods (abbreviations) As long as those methods are not too tightly coupled with the file-loading class, I can either Make them public so I can call back to them from the new FileLoader class.

class. Move the calls out of the file-loading code and into other native ReconciliationIntro methods instead. I'll use the first approach for Recursively_ask_for_budgeting_months and the second for Set_path , which will bring me to the following situation: Figure 6: File-loading methods after moving (abbreviations) Note that the methods marked as FileLoader methods in the diagram will still be in the ReconciliationIntro class at the end of this step, but this will enable me to move them into the FileLoader class, which will happen in step 5 and step 6 below. Recursively_ask_for_budgeting_months will eventually be a public method in another class, but for now I just want to make sure I can call it from elsewhere. It turns out it already is public, which was done so that it could be tested. This in itself is a code smell - it's a sign that it would be better off as part of the public interface of a separate class. The other method called from the file-loading code is Set_path . This changes the value of an internal path variable, so I'll choose option 2: I'll call it separately and pass the resulting data into the file-loading method via a parameter. Principle: Do one thing at a time. Focus! In the process of editing Set_path , I've noticed a couple of other changes I want to make. Firstly, Set_path has the side effect of altering the _path member variable. It would be better if it was self-contained and just returned a new path. Secondly, currently Set_path is being called in two places - just before Create_pending_csvs is called, and much lower down in the chain of events, from within another method called Set_path_and_file_names . If I follow the chain of different methods involved here, I realise they're not called in a sensible order and there's a lot of redundancy. The code is hard to follow, and it would be tempting to refactor it now. But I'm in the middle of another refactoring, so I make a note (on my Trello board) and leave these changes for later. Otherwise I'll get distracted from the task at hand and could get the code into a strange state and lose track of what I'm doing. Focus is important! Note that these might not normally stay as separate commits (I could use micro-commits and then squash them into larger commits), but I'm keeping the small commits intact to make the steps clear. I compile and run the tests at every step: Modify the calling method ( Create_pending_csvs ) so that it takes a parameter, but initially give it a default value (commit acc3519) so that the code still compiles: private void Create_pending_csvs() { // Some code } ⇓ private void Create_pending_csvs( string path = "" ) { // Some code } Call Set_path separately before calling Create_pending_csvs . Take the resulting new _path member variable (see sidebar), and pass it into Create_pending_csvs (commit c5ebc2f) : case "1": { Create_pending_csvs(); } break; ⇓ case "1": { Set_path(); Create_pending_csvs( _path ); } break; Remove the call to Set_path from within Create_pending_csvs , and use the passed-in value instead of the member variable (commit 6df8f97) : private void Create_pending_csvs(string path = "") { try { Set_path(); var pending_csv_file_creator = new PendingCsvFileCreator(_path); ⇓ private void Create_pending_csvs(string path = "") { try { Set_path(); var pending_csv_file_creator = new PendingCsvFileCreator( path ); Finally, remove the default value from the Create_pending_csvs parameter - thereby forcing all clients to pass a value in (commit 2be56ea) . Note that by doing things in this order, I keep the code compiling at all times: private void Create_pending_csvs(string path = "") { // Some code } ⇓ private void Create_pending_csvs(string path = "" ) { // Some code }

4. Create covering tests for the new FileLoader class My first choice of method to move will be Bank_and_bank_out_ _Merge_bespoke_data_with_pending_file . And the first thing I want to do is copy any covering tests into a new test class for the about-to-be-created FileLoader class. There is already one test for this method - M_MergeBespokeDataWithPendingFile_ WillAddMostRecentCredCardDirectDebits - and its job is to make sure this method merges new direct debit data correctly with a 'pending' file (which is being built to contain all new transaction data). Note that I'm only moving tests, not writing new ones. People can get so used to the TDD concept of writing tests before writing code that they assume you need to write new tests whenever you work on code. This is often not the case when refactoring. Ideally my functionality will already be covered by tests, and when refactoring I'm not changing the functionality. So instead of writing new tests, I'm using the existing ones to verify the functionality still works as originally intended. This is a good time to review this test: It's a while since I wrote it, so I should be able to spot quickly whether it makes sense. I want my tests to be clear and easy to read - they should act as documentation for my system behaviour. The first thing I notice is that it contains an assert method - Assert_direct_debit_details_are_correct - whose name is inadequate. What is the definition of "correct"? I rewrite the test to make it easier to read, which involves quite a lot of changes. In the interest of keeping this a digestible read I won't go into the changes made, but you can view them in commit f090f26 and commit 6a6cece. Now that I've refactored the test I'll copy it into a new test class, along with some associated private helper methods. Note that although this and other tests are destined to operate on a new FileLoader class, they'll still act on the old ReconciliationIntro class until I'm sure my new test class has everything it needs. Also note that until everything is safely moved, my test code is duplicated. Automated refactorings It's worth being aware that there are many automatable refactoring tasks which require significantly more care when tools like Resharper are not available. For instance, with Resharper you can quickly and confidently rename a method that has clients throughout a large code base. But without this tool, such a change might be handled differently - for instance by creating a temporary wrapper method. I first create the new test class in the same file as the original (commit 491c795) , so that it's easy to see what I'm copying. Then I can get Resharper and Visual Studio to move everything into a new file for me (commit c9317c0) . Figure 7: Move BBO tests

5. Create new FileLoader class and move two methods Now that my test class is up and running, I can create the new FileLoader class. The method that's furthest down the chain - the lowest leaf in my tree of methods - is Bank_and_bank_out__ Add_most_recent_credit_card_direct_debits . This is a private method with no independent tests (it's tested via the public calling method), so I'm going to move both it and its caller ( Bank_and_bank_out__ Merge_bespoke_data_with_pending_file ). These will be the first two methods to be moved into my new class. Again, I'll move in baby steps to avoid my tests going red and keep my code building at all times. After each of the following steps I make sure the code builds and the tests are passing. This is the situation I begin with. A FileLoaderTests class has been created, but it is testing code that still lives in the ReconciliationIntro class: Figure 8: New FileLoader class part 1 (abbreviations) I start by creating a new FileLoader class. One of my file-loading methods ( Bank_and_bank_out__ Add_most_recent_credit_card_direct_debits ) is private, and will also be private at the end of this process. It will only ever be called by the method that's about to follow it into the new class. It's not individually covered by tests, so I can just copy it over to the new class and have it ready and waiting when its caller is moved (see commit 0341476) . But I will need to make it temporarily public: Figure 9: New FileLoader class part 2 (abbreviations) Now I can create an instance of my new FileLoader class in the ReconciliationIntro class and call its new public method, instead of the old private method. I can also delete the old private method (see commit bde2ae2) : Figure 10: New FileLoader class part 3 (abbreviations) Copy the original caller into the new class (see commit f0a5a59) . Note that at this point it's duplicated: Figure 11: New FileLoader class part 4 (abbreviations) Change the object I'm testing from being a ReconciliationIntro instance to being a new FileLoader instance. Point the tests at the new copy of the original caller. Note that because the private method it calls has also been copied, my tests will pass (see commit 3d573e3) : Figure 12: New FileLoader class part 5 (abbreviations) Now I can call the original caller directly from the ReconciliationIntro class (see commit 77c0b14) : Figure 13: New FileLoader class part 6 (abbreviations) Make the originally-private method private again, and delete the original caller. I'll delete the old test class too, as all its tests have now been duplicated in FileLoaderTests . (see commit 27f1a59) : Figure 14: New FileLoader class part 7 (abbreviations)

6. Move the other methods to the new FileLoader class Now I can move all the other methods. I'll handle them one by one, starting with the simplest and keeping an eye out for dependencies (between methods, and on any member data). I'll need to consider the following: Do I want to rename any methods?

Do I want to replace any parameter lists with new objects?

Are there any redundant parameters?

Are any of the internal nested callees altering state? What impact does this have? I could inline those lower down in the chain before moving them, but if I did I'd break the tests which call them as public methods. So instead I move them on their own. I do it in the order explained below, acting on one at a time and starting with those at the end of the chain - ie the outermost leaves on the following tree: Figure 15: FileLoader method tree (abbreviations) For each one, I use the following approach: Create a copy of the method in the destination class, keeping the original in place. Make the new method public.

Call the new method instead of the original.

Copy any covering tests into the new test class, and make sure they test the new code.

Delete the old method and the old tests. I already moved Bank_and_bank_out__ Merge_bespoke_data_with_pending_file and Bank_and_bank_out__ Add_most_recent_credit_card_direct_debits (commit 0341476 to commit 27f1a59), so now I repeat the same actions for each of the other methods (commit 7cd53f6 to commit 7ab95f2) . I don't commit every tiny step this time, but I still go through the same set of steps. After each step, I make sure the code builds and the tests are passing (apart from when I deliberately make a test fail). I refactored some tests earlier, and I can repeat those changes for tests that follow the same pattern. For some methods the move is very simple, because they have no test coverage. This is part of why I'm doing this refactor, to make that code more easily testable. A note on member variables The four load methods ( Load_bank_and_bank_in , etc) are all using two member variables from ReconciliationIntro : _input_output and _spreadsheet_factory . So when these methods are added to FileLoader , I need to introduce these variables as members in FileLoader too, by injecting them into the constructor. _input_output is already being injected, but when I introduce _spreadsheet_factory I initially give it a default value. This is so the code will still build, and is done before I switch over to using the new methods. This way I can fix all the places that need to inject a spreadsheet factory into FileLoader at my leisure. See commit ce559f2 and commit 7e86292 for the actual code changes , but here is a simplification focusing only on the relevant parts: internal class ReconciliationIntro { private readonly IInputOutput _input_output; public ReconciliationInterface Load_correct_files() { Load_bank_and_bank_in(); } public ReconciliationInterface Load_bank_and_bank_in() { var file_loader = new FileLoader(_input_output); file_loader.Load_bank_and_bank_in(); } } internal class FileLoader { private readonly IInputOutput _input_output; public FileLoader(IInputOutput input_output) { _input_output = input_output; } public ReconciliationInterface Load_bank_and_bank_in() { _input_output.Output_line("Some text"); } } ⇓ // Still compiles, because ISpreadsheetRepoFactory object has default value in constructor. internal class FileLoader { private readonly IInputOutput _input_output; private readonly ISpreadsheetRepoFactory _spreadsheet_factory; public FileLoader( IInputOutput input_output, ISpreadsheetRepoFactory spreadsheet_factory = null) { _input_output = input_output; _spreadsheet_factory = spreadsheet_factory; } public ReconciliationInterface Load_bank_and_bank_in() { var pending_file_io = new FileIO<BankRecord>(_spreadsheet_factory); _input_output.Output_line("Some text"); } } ⇓ // Now ReconciliationIntro can inject member vars into FileLoader and it will all work. // Other clients aren't injecting yet, but some time soon we'll amend them all. // Finally after that we can remove the default value from the FileLoader constructor. internal class ReconciliationIntro { private readonly IInputOutput _input_output; private readonly ISpreadsheetRepoFactory _spreadsheet_factory ; public ReconciliationInterface Load_correct_files() { var file_loader = new FileLoader(_input_output, _spreadsheet_factory ); file_loader.Load_bank_and_bank_in(); } }

7. Extract more new classes A note on the order of the commits My main motivation for this refactoring is to remove duplication and introduce a strategy pattern for the four data types in the file loading code. This means that if you look at the commits you'll see that after extracting the FileLoader class, I moved on to other things. But when I came back to write up this article, I couldn't bear to leave the ReconciliationIntro class in its still-bloated state, so I got on with extracting the rest of the classes - starting from commit 3446a54. When adding regions at the start, I identified some Get budgeting months functionality which creates its own clear context, so I extract these methods out into a new BudgetingMonthService class. This is very quick and simple because these methods only have one public entry point (see commit 6103f0b) . ReconciliationIntro is still too big, but the methods all call one another and now that I've spent more time refamiliarising myself with the code I'm not convinced my two remaining regions of User Instructions and Input and Debug Spreadsheet Operations are the best ways of dividing the remaining code. To help myself think about it, I use a spreadsheet to quickly illustrate the call hierarchy. Figure 16: Call hierarchy This allows me to see that there are THREE self-contained areas of code, not two: User instructions , Gathering file / path info and Debug mode switching code . I remove the original regions (commit 4c57927) and replace them with four new regions (commit 3446a54) . I rearrange the methods to fit in the new regions, and these will now translate into three new classes: Communicator , PathSetter and DebugModeSwitcher ( FileLoader and BudgetingMonthService are not shown on this diagram because they have already been extracted): Figure 17: Identifying final ReconciliationIntro classes I create these new classes gradually and safely using the same principles as for BudgetingMonthService , removing the new regions once they've served their purpose (see commit 7f464a4 to commit 7e118c1). It's worth noting that when extracting new classes from a larger class, I want them to be independent. For this reason I'm deliberately avoiding the anti-pattern which can sometimes arise, where extracted classes are automatically injected as a dependency into the constructor of the original class. The PathSetter class turns out to be non-trivial - see commit 2921220 to commit 398539a . This is due to the currently-slightly-tortuous nature of the path-setting code, something I noticed at the end of step 3. By extracting this code into a separate class and giving it its own clear context, I'm already making this code a bit better - but it will still need some attention. Finally, my ReconciliationIntro class is cleaner and simpler, and the original 41 methods have been reduced to three: Start , Reconciliate and Do_matching : Figure 18: Final ReconciliationIntro class