Everything Is Engineering Now

Three years of Practical Software Engineering at Lifelock

Copyright © 2012 by Cam Riley All Rights Reserved

Why Write This?

I was disappointed when I read "Coders At Work". None of the advice or conversation in the book was something I could use or apply at work the next day. Most of the people in the book were famous, or academics, and did not have the same day to day problems that I did as a Technical Manager and Technical Lead. Martin Fowler's book on Refactoring is a good example of something that has direct work application to software engineering. After reading a few pages of Refactoring, the next day I could improve how I worked and produced software.

This is not a tell all or a gossip story about Lifelock. My time at Lifelock was a positive experience which I enjoyed. What this is about is the day to day engineering issues and consequences we faced in a startup that was only a couple of years old and growing rapidly. This book is written for other Software Engineers and hopefully helps explain our decision making in what worked, what didn't and what we couldn't change due to organizational limitations.

At Lifelock I was employed as a Technical Lead and that position morphed into the position of Technical Manager. I wrote code every day for Lifelock, but the position was more than that. It was part production debugger, part environment debugger, constant support for Infrastructure and QA, as well as driving Engineering to produce higher quality code and spearheading automation efforts. I worked in a cube farm and often there was a line outside my cube of people needing my input for this or that. It was draining, hectic, stressful, rewarding and fun all at the same time.

Hopefully this book helps out other Technical Managers, Technical Leads and Software Engineers in organizations who deal day in and day out with legacy code, with feature sprints, with CTOs, with Vice Presidents, with Offshore groups, with Infrastructure, with QA and maybe even engineers who are unhappily caught in a death march.

Interviewing With Lifelock

In 2008 a software engineering friend who I had shared an office with at a previous job had moved across to a contracting firm and was working on a new project at Lifelock. That project became the death march known as Project Renaissance where the PHP website and the CMS were replaced with an Oracle stack and J2EE technologies. Oddly the front end systems were re-done in .Net which meant that engineers could not move up and down the stack as needed. There were specialist .Net front end engineers and specialist Java middleware engineers.

The company I was currently working for, Shutterfly, had run me through several little mini death marches of three to six months in length that I was tiring of. A death march was coined in the software engineering profession to describe projects where the timelines for completion were too short, where the requirements are vague and the estimates unrealistic. In these situations management - and the engineers - delude themselves that the go-live date is possible and will often put engineering and QA into crunch mode where people are working seven days a week and long hours each day to give the appearance of the dead line being achievable.

It is a little unfair to call those mini death marches but they felt like it. We were always late delivering the software and the quality was exceptionally low. The requirements were not vague. It was front-end work and the requirements were well written as wireframes. We tended to be overly aggressive with timelines despite padding hours into the project. Additionally, front end work is never complete until all the permutations an end user can conceivably click on is identified and consequently those edge cases often pushed development well into the QA period.

The final straw for me was when a project started to rewrite a front end system in Flex. This became a classic death march rather than just a poor quality system. We blew a time line and the engineering staff remained in crunch mode. The bugs kept piling up and piling up. My stress levels and nerves at that employer were never great to start with because of the project lengths and quality issues. Going through a divorce at the time contributed to the stress and the added hopeless-ness of working through another death march was too much. So I started interviewing.

A major concern I had when interviewing was that I didn't want to change jobs and go immediately into another death march. I was brought in for an interview by Limelight Networks which was a rising company in the Phoenix area. At the time they were in a small light industrial complex off 48th Street. Now they have the top floors of a brand new multi story office in downtown Tempe so their fortunes in the CDN market were definitely on the rise.

The project I was interviewing for had gone through multiple tech leads, had high turn over and was still not finished after over a year of work. They wanted someone to come in to get it completed and out into production. I can recall wincing at the description and asking at the time, "Is this a death march?" I suspect that question got me remembered a couple of years later when a friend of mine interviewed there and the interviewer mentioned to him that I had interviewed there. Phoenix is a small tech community in comparison to Washington DC and Northern Virginia. Nearly everyone has worked with someone else in some capacity so it is a good thing to bear in mind. Never burn bridges in Phoenix.

I interviewed with Lifelock when the Renaissance project was in its early stages. I met with several managers who left soon after I interviewed. I remember saying to the Vice President I interviewed with as I was leaving, "Are we going to work something out, because I am willing to do this" I cannot remember his exact words, but it was to the effect yes, we will try and make this happen. Unfortunately that management group soon left and with the chaos of the Renaissance death march I think my application was forgotten. Nearly six months after my initial interview I was given a job offer at Lifelock which I took. I knew some of the people I would be working with there and it sounded like fun, plus, I desperately needed to move on from where I was.

When I gave in my notice to go to Lifelock I left a project that was getting closer to completion but was still in crunch mode. On my last week there I came in on the weekends - both of them if I recall correctly - to help out the project. I did not see any end in sight and was happy to leave.

Experience In Quality

Given the experiences I had working with Spring, Javascript and Flex at Shutterfly there were things I did not want to let happen again. At Lifelock I would have more freedom to determine the path of how software was developed and what tools and processes we would use. At the previous place we were often hamstrung by the Redwood City office as we were treated as a junior office in a lot of respects. The Phoenix office was started as a near-shoring operation and was replacing a Costa Rica offshore operation due to lack of quality. Given the lack of unit testing the Costa Rican office was probably unfairly blamed for poor quality code.

I often joked that me and JIRA were not friends, but there was one obvious reality at Shutterfly; no unit testing produced low quality code. To make it worse, when a bug got into JIRA the only way to get it out was to resolve it - whether it was worthwhile or not - or have a Product Owner say it is no longer necessary. The alternative was to not let the bug into JIRA and keep it out by arguing it out. Arguing over each and every bug is a valid management technique for keeping the bug counts down but it causes too much friction. Another tech lead dealt with the low quality and high bug count by leaving a lot of them open and closing them a couple of projects later when the requirement provoking the bug was no longer relevant.

The first project we did at Shutterfly was a rewrite of a front end system where we ported it from struts to spring and the javascript code base from an old hodge podge system to a new inhouse Javascript MVC structure. The Javascript MVC code was very cool and one of the reasons why I wanted to work at Shutterfly. However, we inherited 400+ bugs for that subsystem, and then we added something like 400+ new ones of our own. Which is horrible. I think by the end of the project, it had accrued nearly 1200 bugs. An engineering team cannot operate under that kind of quality chaos.

When we got to the end of that project, we were in constant crunch mode and just closing bugs as fast as we could. I can recall one week where one engineer alone closed one hundred and fifty and I closed just over a hundred. I graphed it the rate of new bugs being created and the rate of bugs being closed and there was one weekend when we won, when we beat the inherent lack of quality of the system. It was too much of an emotional and physical toll though. We lost engineers in Phoenix because of it.

The tools to produce high quality code had been around since the early 2000's; namely JUnit. By 2007 there were test harnesses and ports of JUnit to every language and platform. There was no excuse for engineering not having that as part of their toolbox. When we did the Flex project a workaholic contractor came in to help us out as most of us were Java or Javascript engineers that were moving across the Flex platform. Even though unit testing was not a part of our culture, he set up all the right tools for us so we could unit test our code and we decided to write non-testable code instead. It was a failure of leadership on my part in Phoenix and a cultural failure of the engineering organization as it was accepted that the norm for an engineering project was that there would be one thousand plus bugs generated.

In 2002 I had worked on an NTCIP Driver project which was a lot of fun. The project included a mathematician who was a friend of mine. Java doesn't explicitly handle signed and unsigned integers so I was often running to the mathematician for help with bitwise operations to ensure that the PMPP packets were correct. He did the bitwise operations on his fingers with each finger representing a bit. After he had worked it out on his fingers we would turn it into a Java method. In that project I was terrified of a bit or a byte being out of place so we covered that project with all manner of unit and functional tests to ensure that it worked as the spec demanded.

We were given a variable messaging sign [VMS] to test the new Java NTCIP driver with. These are the large electronic road signs you see with yellow lights in them. They are usually attached to an overpass or orange trailers on the side of the road. The one we were given had a board in it which could handle the new protocol. We would constantly run functional tests against it and we discovered the board didn't handle the protocol. It turned out that manufacturer was beta testing on us by putting in their own alpha and beta boards. We ended up swapping in and out several boards before we got the whole thing working. It was the unit tests that gave us the confidence to say it it the board in the variable messaging sign that was not compliant and that our code was handling the spec correctly.

The point of the NTCIP Driver project story was that I wasn't a stranger to unit testing and had used it in the teams I had led prior to Shutterfly. I didn't press for unit testing at Shutterfly and didn't provide the leadership for it there either. That was a failure on my part. I did not want to make that mistake again. I was resolved that at Lifelock I would not give up code quality and allow myself to be in a position where I would shrug off one thousand plus bugs as being normal for a software project.

The other part that made working at Shutterfly unbearable was the hours. I am an aging Software Engineer who is 41 at the time of writing. I should know better but I was happy to pretend I am the Herculean guy that can work more and get it done by sheer force of willpower. It is conceit. I used to say, "You have to be fit to be a software engineer" because of the hours, stress and project pressures. I was wrong. You have to have the courage to say no and just dig your heels in. There are too many managers that say yes when they should say no and are happy to take advantage of engineer's good nature by having them work impossible hours for the sake of appearance. I was resolved at Lifelock that the work/life balance was going to be normal for myself and any engineers I was working with.

If there is anything I am proud of at Lifelock, it is that I achieved those two goals; code quality and work/life balance for the Tempe engineers. It was a hard slog to get there but totally worth it.

Post-Renaissance Lifelock

When I came to Lifelock in August of 2009 the death march Renaissance project had gone out to production and been bulldogged into a stable live system courtesy of long hours from the engineering, infrastructure and QA departments. Lifelock had gone through a boom in 2007 when they had started advertising on the right wing radio stations. It turned out that products Lifelock offered and the demographics and concerns of the right wing radio audience were a close match. Lifelock was a startup and one of the leaders in the Identity Theft Protection market. They were constantly trying to find which market wanted their products and what they needed to sell in order to resonate with a larger customer base.

The identity theft market became possible because more and more information was flowing through the Credit Bureaus. These are a regulated body dominated by a select few companies: Equifax, Experian and Transunion are the best known. The Fair and Accurate Credit Transaction Act allowed for individuals to place alerts on their credit histories if identity theft was suspected. The legislation intended to make fraudulent applications for credit difficult if not impossible. This meant if you had that flag on your credit history, and someone made a request for credit, then the credit bureau would ring you and say, "Is this you opening credit for …." which was hugely useful and pro-active. This started the identity theft protection industry and Lifelock was the one who recognized the market opportunities in that legislation.

Most startups have manual fulfillment and only begin to automate processes as the scale dictates or necessitates. Lifelock's initial product was to put the identity theft alerts on your credit history at the Credit Bureaus so the paying customer didn't have to worry about it. The alert only lasted for a short time so every three months someone from Lifelock would ring up the credit bureau's for each customer and have the identity theft alerts renewed on a customer's credit history.

As Lifelock's customer base grew, more and more of the fulfillment and billing processes needed to be automated, as Member Services were swamped with requests. The billing project was supposed to rectify all the tasks that did not need to be manual by making them an automated part of the billing system. True to their startup roots, Lifelock was using Paypal as their billing provider up until that time. If I recall correctly, in 2009, Lifelock was Paypal's largest recurring billing customer. By 2010 Paypal could not provide all the services that a growing Lifelock required so a new billing system was sought out. When I came to Lifelock I was placed on the Billing Project immediately.

The Billing Death March

I had left Shutterfly because of a death march. When I joined Lifelock I managed to drop myself into a death march immediately. Edward Yourdon suggests engineers caught in Death Marches to quit. That is not always feasible, certainly not in my situation as I had just left one company for another. It wasn't obvious until about two months into the project that we were dealing with a death march either. After we got stuck on the Service Bus work we realized the project was going nowhere fast.

Every morning Lifelock had a large scrum in a medium sized conference room. This was a leftover of the Renaissance project days when they were focused on getting the Renaissance software into production. All the engineers, QA, infrastructure, project managers, business analysts and middle managers would shuffle into this conference room and pack every square inch of space. The CTO generally ran the meeting and would go round the middle managers and project managers in turn. The rest of the technical people there were largely an audience except when a specific question was asked of them. I hated it.

I had been to about two weeks of these morning scrums and was feeling impatient. Not only was I being forced to wear uncomfortable business casual clothes but I was jammed in an uncomfortable meeting each morning. Soon after, I approached the new Vice President of Engineering and asked if I could hold the billing scrum separately to the morning meeting. She said that sounded good and that she disliked the big morning meeting as well. That large scrum disbanded and smaller scrums started occurring. The meeting had a place during Renaissance but had outgrown its usefulness.

The vendor for the new billing system was Metranet who had a .Net based product that was to be housed internally. The original approach was that Metranet would take over the billing and product catalog responsibilities. In Lifelock's existing software the recurring billing was handled in the middleware and the product catalog was Salesforce with the important pieces of data being replicated in the Lifelock database.

With the Renaissance project Lifelock had moved to a Service Oriented Architecture [SOA] structure where web services were the main method for calling between different subsystems. The goal was for our middleware and service bus to integrate with Metranet transparently via web services and our front ends and partners not notice that we had a new billing system and product catalog.

I am not sure why Metranet was chosen as the vendor for the billing system. I heard that it was in the middle for price, I had also heard that the investors in Lifelock were also investors in Metranet and that it was a case of eating your own dog food. Another thing I heard was that Metranet claimed everything was out of the box when requirements were hashed out and then when the project started suddenly everything was custom code. I don't know the truth of why Metranet was chosen.

I used to joke that we should have bought Metranet's sales people and thrown out their software. Metranet was very effective in convincing Lifelock that they were the best for our needs. This perception only started changing as it became apparent that the project was a death march and the majority of the quality issues were with Metranet's system and not Lifelock's code.

When I arrived at Lifelock the engineers were over-managed. They were being sucked into meetings all day. Young talented engineers, who should have been punching out code for six hours a day, were spending that same six hours stuck in meetings and not uttering more than five words an hour. Once I settled in at Lifelock I started the approach of I would go to the meetings and we would only bring in other engineers when we needed them. One young engineer made the comment once that when I arrived his calendar changed from one hundred percent meetings to zero percent.

I am not a fan of meetings. I dislike them as a forum for discussing issues as people tend to like to talk and you have to give everyone equal time. At Lifelock, quick hallway meetings to hash things out or to determine consensus were far better and for the most part I tried to achieve things that way. This method works well when there is limited management, but once more and more management starts piling on the number of meetings increase in frequency.

Until early 2012 the management structure of the engineering department at Lifelock was super flat. There was the Tempe group with myself as the tech lead and the Irvine group in California with an engineer as lead and we all reported to the VP of Engineering. It was remarkably effective as the VP of Engineering set strategy and the engineering groups set about implementing those strategies. This approach and flat structure showed empirical results as the engineering group was the highest morale of any group in Lifelock in 2011.

Clothing

When I first started at Lifelock I had been told we had to wear business shirts, business pants and dress shoes. This is not comfortable wear for a software engineer. I also have a muscled build so nothing in the business casual catalog really fitted me that well. I am certain business clothing is designed to be comfortable for the average business male who has a pudgy belly, chicken legs and stooped shoulders.

I had to go out and buy clothes specifically for Lifelock. I had one suit that I got married in, and outside of that I had t-shirts, jeans and work out clothing. I put up with wearing business clothes until I started noticing that a couple of upper managers were wearing plain t-shirts with dress pants and dress shoes. I think that is a massive fashion faux pas; it did not look good, but I decided that if they were wearing t-shirts, so would I.

None of my t-shirts are plain, or go with dress pants. So I wore jeans instead. It was like dominoes dropping. Within a couple of weeks all the engineers were coming into the office in t-shirts and jeans and soon after infrastructure were as well. It was unstoppable. Even though it sounds like I started that process in this telling, it was a group consensus and everyone kind of did it at the same time. Engineering and Infrastructure were just itching for the excuse not to wear business shirts.

We couldn't get the dress code dumbed down to shorts. Phoenix is a hot city and wearing jeans in the desert when it is 115F is not fun. I always thought it was strange in Australia that people would wear long pants and long shirts for business reasons when the Australian climate - other than Tasmania - is either hot or muggy. If Australia is hot, Phoenix is even hotter and during monsoon season it is brutally hot, humid and muggy. In those kind of environments I think shorts are more than acceptable.

Engineering Gets Macs

Windows machines are essentially crippled for business courtesy of all the anti-virus software that goes on them. Trying to run Weblogic on localhost and compile branches on the command line generally makes the machine unusable for extended periods. It is incredibly frustrating. When I interviewed with Lifelock I stipulated as part of my employment that I would get a Mac and wouldn't have to use a Windows machine. When I started that was ignored and I got a standard windows machine.

The designer guys down the hall had Macs. Supposedly it was because they are incapable of being creative with Windows machines. I chatted to the designers and asked how they got the Macs. I was told they were an exception. I determined that engineering would be the next exception to that policy and consequently we started politicking for Macs as well. We were rebuffed numerous times.

One day the CTO moved on and the current CFO took both the CFO and CTO roles for the interim period. I had lunch with him one day and in passing I was talking about our crappy machines and how they take a dive when we run functional tests against Weblogic on localhost. Our main complaint was that a task which should take thirty minutes ends up consuming a day. Very soon after that lunch we had someone from procurement asking all the engineers what machines we wanted.

Not all the engineers in Tempe wanted Macs. We had two dedicated .Net engineers but they were happy to upgrade their machines to brand new boxes with ample memory, CPU speed and hard drive space. Three of us decided that we wanted a strong separation between work and home. Consequently we refused laptops and took desktops. Two of us asked for Mac Pros, the third was the lone middleware engineer hold out who asked for a Windows desktop.

Once the Mac was in the organization they were unstoppable. Prior to the influx that started with engineering there had been the dictum that only Windows and Lenevo were supported. The IT Group probably could have stopped more Macs coming in after engineering got them but a curious thing started happening - Macbook Pros that came in started getting light fingered by executives and directors. One of our engineers loved Macs and was pretty miffed as he was the last to get a Macbook Pro due to someone higher in the chain than him seeing the new Macbook Pro intended for him and it getting appropriated for themselves.

Between upper management starting to get Macbook Pros and the rapidly expanding numbers of engineers in Tempe and Irvine with them; the Macs were there to stay. It was probably a good thing as soon after iPhones and tablets became common place, ousting the blackberry and the under powered Lenevo laptops.

We used subversion for our source code repository. I had been using svn from the command line for managing my source code changes but the Irvine engineers started using Cornerstone and it became the defacto standard for the Mac users in engineering. It worked well, though sometimes merging was dicey. Cornerstone wasn't always obvious if it resolved a conflict or not and you could mark something as resolved that still had >>>> in it. This is acceptable if it is in a java file or something that is compiled, but not so good when it is in an html file or a config file as it ends up being a runtime error.

Eclipse

Eclipse has been the dominant Integrated Development Environment [IDE] for a while. There was a time when it was slow and unstable but that ceased to be an issue by about 2005. Prior to that I used emacs and ant off the command line. With the tools for refactoring and introspection that eclipse has I could never go back even if I wanted to. It is a remarkably productive environment.

The only issue we hit was the m2eclipse plugin which appeared to cause instability when importing projects once we changed to a larger branch structure. Upgrading to m2e caused its own issues as it didn't recognize a lot of the custom goals and plugins we had created to support our build and artifact generation process. The m2e plugin was also not so great in working out exclusions in parent files that were not in the workspace either.

We checked in .project and .settings files with the projects. When I first came to Lifelock they were not checked in and there was a text file that was passed around to get the OSB and EAR projects set up correctly in eclipse. If you ever had to change to a new workspace you had to go and find the text file then go through all the steps to make sure all the eclipse projects for the OSB config, the wsdls, the ear, the ejb and the ws projects were correct.

Since everyone was on eclipse it was a quick win to start checking in the .project files and then it was simple to import the project from the filesystem after it had been checked out. As we went to the platform architecture with multiple ears, common jars and common EJB modules it became even more necessary as otherwise engineers would have spent twenty percent of the week setting up projects in eclipse rather than coding.

When we moved to maven we continued to check in the .project and .settings files. It became normal for engineers to import existing projects into the workspace and since we had no-one use IntelliJ or emacs/vim it was not a big deal. I know there are religious wars over whether to check in .project files or not, but I have always found it easier, and more efficient. It is possible open source projects with their diverse developer base may prefer not to, but in our case it made it easier to focus on developing than screwing around with the IDE.

The checking in of .project files extended to our automation projects as well. The majority of our eclipse based automation was functional testing for the front end and middletier though we had python and bash projects which supported different automation tasks. The Linux Engineers did not check in their project files as most of their stuff was one-off source files or a small grouping of source files that did not number more than five or so. Additionally they all used different IDEs such as vim or komodo.

The automation projects that engineering were involved in had the .project files checked in despite them being projects with only one or two files in them. Again, it was more efficient to import the project and have it all setup for you than muck around with the IDE project settings. If there is anything I have learnt from eclipse is that it is hard to do the same thing twice manually.

One of the things that used to drive me bonkers was people checking in class files or the compiled jars files in a target directory. Subversion didn't handle the constant change well and these files would always appear with the little green squiggle. Subversion is not particularly good at turning a previously checked in file to one that is now on ignore. You really have to identify the ignore files the first time. It is a weakness in Subversion that is annoying, but not really a show stopper.

Branching

After the Renaissance project there were multiple projects in subversion but the two important ones were the middleware and front end. Essentially we all developed off trunk. Tags were made to denote a production release. With the billing project and its incompatibilities with trunk a new branch was created off trunk for the middleware and front end. Infrastructure also provided a completely new development integration environment for the billing project as well. This was the start of Lifelock using a feature branch strategy.

When billing had been pushed into production in December of 2010 the next large incompatible branch came in and replaced two of our front end projects. From that point on we settled into the Agile approach of two to three week sprints with a production release at the end of it. Occasionally a larger feature set - such as the sales tax changes or product platform - would stay out of trunk for a period of three sprints or longer before coming in and being released to production.

We often toyed with the idea of Fowler's continuous integration or the kanban method of code production but our organizational structure did not support that speed of code movement through our system into production. Conway's law states; "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations."

Conway's law is true of processes too. Organizations which design a process will match the process to existing communication structures. The process to get code from a user story to production matched the organizational structure and did not change no matter how much engineering and management tried to speed it.

Processes are not inherently bad. They largely exist to try and make sense of chaos, however, chaos is not uniform, it is often concentrated in small areas and formalizing a process that is too heavy can penalize the areas that don't really need to be under that umbrella of chaotic inputs and outputs. Chaos is not static either. It can move or be forced out, so a process that made sense six months ago, may not now. Unfortunately, once a process is established it is like its own personal bureaucracy and defies elimination.

What Lifelock called feature branching is a little different to the standard definition of a feature branch. More accurately Lifelock used sprint branches. When a new sprint was started we would create a new Version in Jira and Greenhopper, add the user stories as JIRA tickets, and then create a new sprint branch out of Jenkins. The Jenkins job would version all the maven poms with the correct version in the Group Artifact Version [GAV] coordinates for the sprint, check the new poms in and then build the new feature branch via Continuous Integration in Jenkins. The feature branch artifacts being pushed up to Nexus and ready to be deployed to a DEV Environment.

If, for example, the new Sprint was for Sales Tax and was number two in the sequence of sprints for that project, the Jenkins job would create a branch named, SALESTAX_SP2-SNAPSHOT and then version all the poms in the branch with the same version name. That way when anything was checked into that branch, the continuous integration job in Jenkins for that branch - the CI job being created by Jenkins when the branch was created - would kick in and build artifacts with the SALESTAX_SP2-SNAPSHOT versioning.

When a sprint was finished, the feature branch would be merged into trunk. The version for trunk was TRUNK-SNAPSHOT. From there a release branch would be cut and those artifacts would go through stage and into production. A Lifelock feature branch included more than one feature. Often it was one to twenty User Stories that could cover the front end, the middleware, the database, batch jobs and business tasks. It was not unique to code changes and very cross functional.

One of the downsides of this form of branching was that some projects got a little too far out of sync with trunk and were difficult to merge. They were not impossible, just difficult. Maybe perforce or git might have made the process easier but the fact remained that some projects are sufficiently incompatible that they stayed out of trunk for three plus sprints and had difficulty merging back into trunk.

The Martin Fowler style of continuous integration where you are constantly merging into trunk and using feature toggles is a solution but one our organization could not support. We had engineers that had worked in a Fowler style environment or in Kanban structures and they loved it because of the constant flow of code getting into production. Lifelock has the hard stop of going through the inspection process with QA that was dictated by the CTO. We could not convince others that the functional testing that engineering, and later the selenium testing from QA, could cover the majority of the inspection steps. Everyone agreed they wanted testing automation, no-one believed we actually had it and that it had been used since 2009. Engineering could not convince the other groups that the middletier tests covered most of the inspection situations.

The positive side of Lifelock's branching methodology was that once you got into trunk, you knew your code was going out into production as no-one wanted to revert a merge. It was too large and messy. Getting into trunk was sometimes the hard part. Which sprint was going out next was often horse traded in multiple meetings between Project Managers, Product Owners, Vice Presidents and the CTO.

Engineering found that unless you had a Project Manager championing your changes and demanding their project is more important than any other Project Managers' you had a low probability of getting your feature changes into trunk and hence production. This was a bad thing as quality improvements were difficult to get out even if new products and features were not.

This limitation led to significant functionality being hidden in sprints or slid in as additional user stories by engineering in order to get those changes to production. Often it would be a subtask is an existing user story rather than its own user story which QA could focus on. It was unfortunate as it 'hid' important functionality and improvements and placed an extra burden on QA who were generally good natured about the changes as they understood the importance of quality improvements.

Knowing what I do now, and how Lifelock operates, what could we do to improve the branching process? I was heavily involved in the original decision to branch so I have to take responsibility for what we ended up with. Prior to branching all Lifelock's code was done in trunk. Often a bad commit would stop development or an artifact being built and readied for production. Back then there was only really one project going at a time. When the Billing Deathmarch started we had incompatible code that was going to be incompatible for at least six months so we branched. That practice morphed into the Sprint branch structure when Lifelock went to an agile process.

In 2012 we started using feature toggles and they worked really well. The sales tax project was the best example of this as it was in production for a month before finance wanted the sales tax functionality turned on at midnight of the first calendar day of July. When we flicked it on, everything worked as expected. I know we are way behind the curve there, Facebook and Flickr both have complex feature toggle functionality, being able to turn features on and off depending on location, IP, etc. A simple on/off toggle gave Lifelock what we needed. It is an excellent technology mechanism.

If I was to do the billing deathmarch again, I would put it behind a feature toggle and an adapter that could load whichever was the required billing integration piece. Development would continue in trunk but the new development work would be hidden behind the feature toggle which could be flipped on and off depending who needed it. This process is nothing remarkable. When prototyping against a couple of different third party APIs, one of our engineers did this instinctively. This process is what Martin Fowler calls continuous integration. In Fowler's description you are continuously integrating development code with production code. It would require a change of organizational mindset for Lifelock to start developing in that manner but it is achievable.

Unit Testing

When I started at Lifelock I was the new guy and had no idea what everything did. I started unit testing in order to learn the code base. Lifelock had complex rules for different promocodes to make sure that people couldn't fraudulently take advantage of partner relationships we have. I was unit testing one of these when I found a bug through exploratory unit testing. When I showed the issue to another engineer a fix was made quickly in trunk and it went out with the next release a couple of days later.

That was the beginning of a long hard slog of improving quality through unit testing at Lifelock. We started with zero unit tests and three years later we had close to four thousand unit tests. In August of 2012 we had ten projects with one hundred percent unit test coverage according to Cobertura reports running from Jenkins.

One hundred percent unit test coverage is not necessary, especially as we were using data transfer objects a lot which were incredibly simple, however, even those could be unit tested to ensure they did not cause runtime issues. A lot of our data transfer objects were sent through queues and topics which meant they needed to be Serializable. Unit testing that a data transfer object was an instance of Serializable became a cheap way of avoiding a runtime issue when that data transfer object was passed through a queue for the first time.

One of the reasons why I liked our engineering team to aim for one hundred percent was the element of completeness, professionalism and discipline. It meant the engineers had combed the code to ensure everything that could be tested was, and that mentality had a follow on effect for the code quality. Cobertura is slightly forgiving as it averages all the different parameters and rounds up. You can have redlines in your code, where a line is not covered, but as long as they are only a few, then Cobertura will give the big one hundred anyway. Which is something to be aware of.

I scripted up in Jenkins some reports which noted the number of unit tests in Subversion's trunk. The job went through all the projects for the common libraries, the front end projects and middleware projects. It then tallied up the unit test code coverage for each project. I would put these into our wiki and note the plus or minus change in percentage from month to month. Negative changes were noted with a red. This kept an eye on a project that was introducing new code without adding unit tests it also showed which projects were maintaining a high output of code coverage.

The code coverage report was originally only on trunk. I started adding other projects such as our product configuration project, sales tax and encryption projects to it despite their being in a feature branch status. The reason for that was these projects had high code coverage and high numbers of unit tests while they were in development and they were meeting their code complete sprint dates as well. I thought it was a good example of the speed of development unit testing enabled. Once they hit trunk it was a simple task to change over their subversion url and keep them in the report as production projects.

EJB3 was pretty nice in that the container took care of injecting the appropriate bean when it was referenced in another bean. We took advantage of that by mocking the injected bean and then flipping that reference over to public in the unit test setup via reflection so that the mocked object could replace it before being changed back to private again. One of our talented engineers created a little utility to do that. Along with JMock and that utility it became the standard mechanism for unit testing bean code.

When you unit test, you write unit testable code. There is a massive difference between what people write when they don't unit test and when they do. At Lifelock we had trouble getting the culture of unit testing in. Some engineers did not write unit testable code and it showed. When someone who did unit test came to that untested code, it was a nightmare.

Static classes were the big sore thumb. We had one nasty class from the pre-unit testing days called the TypeMap. It was a series of static classes and methods that used reflection. We got around it by creating an interface and then an implementation class that called the static methods. This meant we were not stopped by being unable to mock the TypeMap itself.

Another issue was long methods. Engineers who don't unit test tend to write methods that go on and on forever. One example of a long method was from a third party system we had to deal with. They had a method with; seventy five if statements, sixteen else statements, eight for statements, three try catch statements and eight while statements. That system was covered by functional tests, but still unmanageable code like that was in production. This is why you have to both unit test and functional test. One or the other is not enough.

Functional Testing

Functional tests are also known as integration testing, system testing, regression testing, acceptance testing etc. Functional tests run against the deployed system at runtime. We did this because J2EE containers are notoriously complex and even though they adhere to the spec there are still vagaries that can bite you unexpectedly. The other benefit is that functional testing exercises the system like an end user and you can quickly state if the system is 'working' or not.

I detest the term 'not working' as it can mean anything. Not working can mean nothing is responding in the system, it can mean I tried one little thing and the result wasn't what I expected so I gave up, or it can also mean a 98% pass rate but one or two minor bugs, and hence 'not working'. When a VP or someone else hears not working in a scrum blocker sirens go off in their head. Functional tests are very good for making 'not working' empirical. Broad statements like that can be qualified very quickly with functional testing.

One of the issues we hit constantly with testing was that engineers were mixing up what was a unit test and what was a functional test. The simplest way I could describe it was; if you are running tests and pull the CAT5 cable out the back of your machine and the tests fail - then it is a functional test. If you need a network connection for the test to pass it is not a unit test.

A mechanism we used to control functional tests being mixed in with unit tests was to limit the connectivity of the servers that Jenkins runs the continuous integration builds on. If a test tries to make a network connection and is denied it will fail the test. We saw this occurring on one feature branch, the problem was that the connection hung, so it hung the build as well and more and more builds piled up behind it. The end result was that the functional test got removed from that branch but at a bit of an initial cost.

We put our functional testing under their own projects for the Front End and Middle Tier. One of the benefits of running functional tests against the middleware was that those APIs did not really change that often. The middletier tests were particularly good for testing a code base on a new environment and for double checking a merge on localhost before checking it all in.

Our functional tests for the middletier would change the state of the system and then go back and check that the system's state was now what we expected. This included double checking the data in our database and third party systems. When you work in this manner you quickly realize that you need to make the entire system testable. Not just code around unit tests, but the public APIs as well need to be done in such a way that the system and its state is testable at any moment.

Consequently we added secure middleware APIs that could tell us the state of the system beyond what our front end customers and third party consumers required. The change of thinking here is that testability becomes a customer. APIs and functionality expand to include the testable nature of the system as well.

The functional tests also expanded into tools. As we made the system more and more interrogatable and testable we found that many of the tools we had written were of value to other groups as well. We exposed these through numerous dashboards which helped explain the system beyond the major interfaces that our customers, partners and members service agents interacted with the system.

We created a user interface to run the functional tests that was written in Wicket. It never really caught on though. There was some clever engineering and cool use of the JUnit API in that application. The functional test package names and tests names became clickable in the user interface and fired off the tests at the package and class name level. We used Wicket's ajax calls and JUnit's callbacks to put the response in a nice web interface that went green or red depending on the result at the test and package level. It is a shame it never caught on as it looked great.

We also added the ability to run the middletier functional tests through Bamboo and Jenkins. During the period where we were using Bamboo for continuous integration, myself and another engineer worked on the functional tests to make them lighter so we could run small groups of them hourly and daily on our development environments. We hoped that we could have them running constantly but our development environments by that time had degraded to a level of instability that gave our functional tests too many failures. System instability is better monitored with Nagios than the overhead of a functional test.

We started adding information to our EARs and WARs to make the interrogatable via a URL so we could get the branch location and maven version from them. We used this to then get the correct set of functional tests for that deployment, and hence environment. This was put together in Jenkins so you could run a Jenkins job with the environment as a drop down and choose a maven profile which tested specific functionality. The scripting would then support getting the correct functional tests and running them through JUnit.

We also managed to create a small subset of functional tests that became a Performance Test which ran against production every two hours and emailed out the response times to a mailing list. It became a good mechanism to get the feel of our production systems and how they were responding. These performance tests were good for determining at a glance whether an existing environment was degrading and whether new environments were comparable in performance to the existing ones.

Functional tests were very important at Lifelock for shaking out environmental issues. We had too many non production environments that needed to be supported and we never had enough middleware engineers to support all the weblogic, service bus and tomcat servers that these non production environments entailed. Running a quick functional test or suite of functional tests against an environment were very accurate for determining any issues with the environment or the deployments on that environment.

Improving the quality of the functional testing was as easy a having someone from QA sit with you and go over what the functional test does. When you have to explain it to someone and are checking the change of state in the database and in third party systems, you end up having to re-justify why data is going where it is and how that matches to user stories and other requirements. It works really well in improving what the functional tests are testing.

Functional testing is as important as unit testing. You cannot have one without the other. Functional tests were used constantly by engineering at Lifelock to ensure that the software code we developed did what engineering said it did. Using the functional tests this way meant we were able to say with certainty the both old and new features worked as intended before Service Delivery or QA received the artifact.

Code Reviews

We tried several mechanisms of doing code reviews and we never really found a simple maintainable and sustainable way to achieve it. The first thing we tried as a project team was loading all new code up on a projector in a conference room and as a group we would go over it. This was good, but it meant we needed the time away as a group. In the case of the death march we got inundated with work and were lacking time so this process kind of drifted away and we relied on unit and functional testing to ensure quality.

When the Atlassian suite was purchased we started using Crucible. One of our principal engineers used Crucible to manage the quality of the code coming from our offshore group when they started the batch migration. He used that as his main point of interaction with the offshore engineers and their code. It did improve quality but at great cost to his productivity. He is a talented engineer and most of his time went on code reviews.

One of the things we found with the Crucible code reviews was that the same errors were being made over and over. So we started documenting these on the wiki. When the issue popped up in a code review we would post the wiki page into the Crucible comment. I am not really a fan of Crucible. It only gives a small view of the code and it lacks the intimacy of a side by side code review. When you leave a comment on Crucible, only that comment gets fixed so it can pass a review. It doesn't start any dialog or introspection on the code and why the code was written a certain way.

The best code reviews are not through crucible but with someone sitting next to you while you review it. We don't do pair programming, that is by choice, but having someone next to you to review your code can be quite insightful. At the end of a sprint I had another engineer review my code which was complete as far I was concerned. The code satisfied the requirements, had unit test and functional test coverage, but once we went over it we managed to get the number of unit tests down and we improved the production code by reducing the number of methods and making it more obvious what we did. It can be valuable and fun when having another person - no matter their level of engineering skill - look at your code with you.

Unfortunately this is not always possible. Lifelock had an engineering group of about thirty engineers spread across three offices in Arizona, California and India. Doing side by side reviews with the Californian or Indian engineers is not really a possibility even with modern telecommunications technology and software. Crucible is a poor replacement for side by side code reviews but it is all we have for code reviewing code across groups. A reality is that proximity matters and I don't see anyway to get around that.

Software engineering tends to have a temporal view which is propagated by the myth of the 'hacker'. Code is quite remarkable in that it can get past QA and into production, and work well enough that a business is sustained and customers are happy, even if the code itself is not that good. A truism of software engineering is that if code gets into production it is going to hang around for a long time. So you have to write your software not to get past QA next week, but also for the poor schlub of a software engineer that is going to be wading through your code five, ten and even twenty years from now.

That might seem silly but there are still mainframe systems in production from the 1970s today. I know software that I developed in the early 2000s is still in production ten years later. Consequently, I try to make the code review ensure that the software and its comments are sufficient that I could look at this code in ten years time and know what it does or is supposed to do. Code reviews are not just about code.

JIRA and Agile

Because we had the entire Atlassian Suite when we moved to Agile we also adopted their Greenhopper software. This is like a plugin for JIRA that can reorganize tickets into something resembling a backlog and sprint. Atlassian's software is not that great and shows its 1990 and early aughts history. For instance traveling back and forth in the work and flow often gets stuck with "form resubmission" warnings. We made do with it despite their being better software to support sprint workflows.

Often engineers put hours on a User Story as one big lump and assume that it will cover the feature development work, the unit testing, the functional testing, the documentation etc. They then find that they have run out of time to get the feature complete and they then just do the feature coding work and forget about the rest.

We discovered that adding a series of subtasks under a user story and explicitly adding sub-tasks for unit testing, functional testing, wiki documentation, etc meant that you had a Scrum Master nee Project Manager asking you each morning if you had done your unit testing. It worked quite well as the hours we added as sub-tasks were always accepted - never questioned - and the unit testing, functional testing and documentation all became part of the normal user story completion tasks. We also found that some of the Scrum Masters who previously weren't aware of these tasks started demanding them - even defending them. These subtasks became something to be expected. Which was great.

It wasn't always that way. Agile was introduced by the VP of Engineering, largely in response to the billing death march and the difficulty of getting new code and features into production. It was probably the best thing that could have been done. Different groups were trying to organize their projects in an agile manner, doing scrums, doing stand ups, etc. A truth is, unless there is a buy in for it from upper management and other groups that are involved in projects, such as Project Management and Product Management, then it does not work. A second necessary component is a weekly meeting or document which allows everyone involved to see the state of every sprint, of every user story and the burndown charts. This simple layout enables everyone in the organization to know how everyone is doing.

When the billing death march started I used to get stuck in long days of meetings where a Use Case would be hammered out between us and the third party vendor as to what the requirements are. They were huge lists of text with multiple indents that went on for way too long. The meetings that were required to build these large use cases were exhausting for everyone.

As the billing project got going we tried to impose a scrum style morning meeting over it, but the timelines were getting blown so often, and there was middle management between the engineers and upper management that the project just ended in stasis. The second issue we faced was that when we did get going and started producing quality code at a good velocity, the vendor system we were integrated with was suffering from poor performance and low quality code. Which made our functional testing next to useless. Once we got on top of our system, that was the story of the project, slow times and low quality.

The change over to Agile occurred during the last part of the billing death march when it was being forced into production with the direct involvement of the VP of Engineering. The billing project didn't go agile though, it was still in the struggle of making the system production ready if not production quality. During the agile introduction Lifelock brought in a Scrum Coach who took everyone in the organization through what going agile meant. Engineering was all for it as we were sick of heavy requirements that were next to useless and then getting stuck in crunch mode until whatever the feature hit the arbitrarily set end date.

The agile manifesto states:

• Individuals and interactions over processes and tools • Working software over comprehensive documentation • Customer collaboration over contract negotiation • Responding to change over following a plan

So did Lifelock's adoption of Agile achieve these goals? Individuals over interactions and processes over tools I would argue no. Agile was put like a piece of vellum over the existing organization, processes and structures with new lines drawn between the old organizational responsibilities and the new ones. There was the promise for a while there that engineers would be scrum masters and the product owners would be the end users, but it didn't happen.

Project managers took the scrum master role and the product owners remained specialists like business analysts in older structures. There are places for project managers and product owners, but when someone is working on a billing system, there is no need. The engineers as scrum masters can talk directly to the product owners in finance and billing operations.

Working software over comprehensive documentation we achieved but that was not due to agile. It was engineering that put into place the rigor of unit and functional testing. It was not a result of the agile process or project managers wanting that in the sprints. We also documented heavily our projects and modules in the wiki as well. We did this as part of the sprint so that other groups would know what the batch jobs etc did without having to look at the code itself.

Customer collaboration over contract negotiation. Because engineers were not scrum masters and product owners were not end users, I have no idea. User stories came to engineering in sprints as a-priori. They were essentially immutable and we had to take them at face value as being what the end user wanted. If we did question what a user story was then a product owner would try to give an answer, and if they couldn't then they would go back to the end user. I would have loved to have the end user in some of our sprints but organizationally we did not have the will to do it.

Lifelock's adoption of agile did stop the big ugly contract negotiation pieces. When the billing death march started I got stuck in all day meetings where big nasty happy path use cases were hammered out. This was hopelessly inefficient and the adoption of user stories stopped this style of requirement gathering.

The final agile quality of responding to change over a plan did happen. One good thing about the adoption of sprints and user stories was that a user story being late or taking longer than thought was never questioned and it was ok to push a user story out into the next sprint if it wasn't feasible. The upper management of Products and Technology should be proud of this attitude as without it this is how death marches happen. Having a realist approach to the ability of complex systems to be estimated accurately and for working software to be delivered was the greatest benefit of adopting agile.

One thing I took out of the the experience was that agile will not get into a company from the bottom up. We tried and failed. Agile as a process did not work in Lifelock until the entire organization accepted it. I am left with the observation it has to be top down. Edward Yourdon claimed in his book about death marches that if you are in one then leave. It can probably be expanded further; if you are working somewhere that does not do agile, then you will be involved in a death march sooner or later. It might be wiser to leave and work somewhere else under an agile process.

WSDLs

The idea of service oriented architecture is alluring. It offers the promise of a very clean architectural separation of consumer and provider. The reality is that very few services are static and most are changing constantly with each sprint. While we may have decoupled the Front End and Middleware by adopting this approach, we ended up tightly coupling the build environment through maven to handle the constant change and volatility. We built all our library jars, EJB modules, ears and front end wars as part of the some compile and artifact generation process.

Everyone agrees that versioning a WSDL and services are a good idea, but there is no real good way to do it. You can use UDDI but that is overly complex and another piece of functionality in the service bus that you cannot unit test. You also have to convince your customers for the web service to go look up a registry first. Considering many customers and partners build the XML as a string with tokenization and then send it over HTTP rather than build a web service client application then it is far too complex for most.

The other mechanism is to version through the namespace or the package name. With annotations in Java you can usually do this quickly and effectively with @Webservice attributes. This puts the generated client code under a new package name, however you are left supporting backwards compatible methods and legacy code in your application. I can recall integrating with Netsuite many years ago and this was how they dealt with the versioning issues.

One possibility we tried was adding a version as an argument into the webmethod. So you could specify like; SomeService#get(id, version) where the version can be iterated. This still leaves you stuck with the return contract or XSD being the same no matter what the version. Usually you need versioning because your XSDs are either asking for more information or returning more information, so it makes this kind of versioning limited in usefulness.

Another possibility is to do what Salesforce has done and make an abstract webservice object; SObject that can be interacted with. The interface to the SObject does not change but you lose some of the advantages of the SOAP protocol and generating clients. Namely that the objects you are integrating with are strongly typed. With the SObject style of structure you end up passing in a lot of strings and requesting a lot of string names. I found that the SObject format required you to have the SOQL Explorer up all the time so you could correctly identify what you needed.

With the Salesforce type approach complexity gets handed off to the developers at either end and the compiled nature of the generated client code is not as useful. You also end up with public static final strings throughout your code that match some magic string in the system which will return the appropriate type or functionality. One thing you cannot guarantee is the technical expertise of your partners. Often there is one person working in Perl, or PHP, trying to make sense of "this WSDL junk" and ended up just using strings and regular expressions to build the request and response packets.

Static or unchanging WSDLs are sufficient as long as your business model does not change. Contracts between customers and partners are always a reflection of the current business model and not all the volatile parts of that business model can be abstracted away. I can recall being involved in a government job for a state highway system. The project included numerous deliverables of designs, testing procedures etc before any coding was done. That department could do that as the business model for providing highway services has been pretty static for the last thirty years or so even with new technology.

Not all problems are technological in origin. One way of dealing with customers or partners demands for their business case to be different and cause one WSDL to split into twenty different versions is just to resist it. We had one project where the Product Owner didn't allow the different partners to put their own permutation on what they wanted. What we provided was a uniform series of web service contracts across all partners. This is a valid technique for managing the WSDL versioning, only have one version and have them sufficiently abstracted that it covers all cases for the business need. This requires a lot of people skills rather than technological skills.

If you do expose a WSDL to a customer or partner there are two things you have to do; first generating the client code for the customer; second, write a user guide to remove any ambiguity about how to interact with that generated client code.

We found that when a WSDL was palmed off on us there was no guarantee it would compile. Most companies always had an example or two, but they were in Ant and for an older WSDL, or the example was in PHP or some other scripting language. I had one experience where with a third party we were looking to integrate with which had examples on Github; but I could not get it to generate a client jar cleanly with the maven plugins for axis, axis2 or JAX-WS. In the end, I had to use the maven ant plugin to generate the code in ant, and then compile it over in maven.

At that time, myself and another developer were trying out competitors for a particular integration. Of the two companies one supplied a pre-generated client jar, while the other just a WSDL. Another engineer had completed a prototype integration of the company that supplied the generated client jar, while I was still struggling to generate a client jar for the other company. The irony was that the company which supplied us a WSDL only probably had a better API but we had to get through the frustration of generating the client jar before we even got to that stage. It is just easier to generate the client jar for your future customers and partners and avoid that initial impression altogether. It is not hard to make generating client jars part of the build and artifact process.

The second thing that is necessary is a User Guide. This should include what every field is and what every field does in explicit detail. Some of the best documentation you will come across is from Acquiring Banks who expose APIs to auth, capture and refund money. When money is involved nobody likes ambiguity or confusion. If you get even a penny out, people get mad. Consequently the Acquiring Bank APIs have excellent documentation. As part of the artifact process it is a good idea to make the User Guides themselves part of the artifact and created each time with the corresponding client jar.

Given a choice and knowing what I know now would I use SOAP/XML or REST/JSON? I have to say the JSON is superior as a document transport format. Using SOAP requires using all the tools that go with it such as creating client jars and service factories. JSON is quickly readable and if the content structure is static can be converted quickly into a intuitive POJO with libraries like Gson or Jackson.

SOAP has a lot of overhead to it and is restrictive in how you can interact with it whereas JSON is far more flexible. For instance when we exposed the meta-data about the artifacts that were being deployed we chose JSON because it was easy to query and parse with bash scripts, python scripts and java code. With SOAP you have to have the client, then the service factory, then the module to return the data structure which is more complexity than a bash script cares for. You could just get the SOAP XML document directly via HTTP I suppose, but, why bother when JSON doesn't require the same overhead to interact with it.

Courtesy of Java annotations and the J2EE spec it was simple to create a webservice but we still had to generate the JAX-WS java class files and bundle them into the WAR file so they could be used in the EAR file. In our build process this was a weak link. It took a long time and we often had false positives of failed builds when the JAX-WS plugin got confused over the /var/tmp directory.

Interacting with JSON is also less sensitive to changes in the contract. Depending on what was calling the WSDL, if there had been a change in the WSDL contract the client code would often refuse to start and fail as a runtime issue. Deploying the wrong front end code to an environment which had an incompatible middletier would often give confusing errors. Another issue was that our functional tests used the WSDLs that the front end did and you could not run functional tests against a middletier with WSDLs of a different contract. Again JSON has it over SOAP/XML in this area but that advantage can be negated if you generate client code of WADLs.

REST is no panacea though. It has the same versioning issues as SOAP does. It appears that most people version through the URI which is not much different to SOAP being versioned through the namespace. Neither are a particularly good solution in that instance but they obviously work well enough. REST is more a convention so you can guess what the URI will be for a CRUD operation. SOAP is more laborious in this area but there is no real clear cut winner in those situations.

Maven

Ant is definitely more flexible than maven, but for a build tool and technology maven is far superior. The simple way that maven works out of the box with Nexus is also really nice, not just for engineering but also for service delivery.

We mavenized our complete system from ant over a period of about a month. It was mainly myself doing it though with help from others. We managed to slide that new build system in with minimal disruption. We found most of the problematic ears in the dev integration environments. Since we were doing skinny ears occasionally we would find missing jars when running functional tests over them that would lead to a runtime issue.

We could use maven plugins out of the box but we ended up having to modify several plugins to get the artifacts come out the way we wanted them. Fortunately the plugins were opensource so we could modify them to our idiosyncratic needs and build them out of Jenkins when we needed another change in them. There was only one plugin we created that did not exist in the maven plugin ecosystem. This was the OSB deployment plugin which we ported across from the Ant task we had created. Most other plugins we used were readily available and did not need any modification.

We used the skinny ear methodology of building the ears. All the dependencies were marked as provided in the parent poms dependencyManagement. The only dependencies marked as compile were in the EAR's pom.

We had multiple levels of parent poms. The very parent defined the nexus location, the scm plugin and universal plugins and reporting such as cobertura and pmd, as well properties that were relevant to all the projects that inherited from it. The front end and middletier had their own poms which was where the dependencyManagement was defined. The wars and ears had their own poms that inherited from these and added functionality specific to their needs.

We used the scm subversion revision output to populate the Weblogic-Application-Version in the manifest. Later on we added the maven version and the svn revision number together to make a unique string for the Weblogic-Application-Version. This helped middleware engineering and service delivery identify what was deployed where.

The manifest became where we compiled into the artifacts all the meta-data about where the artifact came from, which branch, which sprint, what was the last revision number and what version of supporting jars and modules were compiled into the jar. The maven properties were used to populate this information. We also exposed the manifest information through the ear as json so we could interrogate the ear and know what it was. We had a Jenkins job which queried for this json and ran through all our environments, identifying which ears were deployed where.

As we lumped more and more into our build branches the times got slower and slower for a branch to build. By 2012 we had the automation building the front end and middletier. We also had a project that was bringing our batch into the one build process as well.

The main reason for doing this was so we could build the supporting jars with the same project.version into the different projects that the front end, middletier and batch needed. This took the guess work out of it for service delivery. We did this as some of our jars were extremely volatile and were always having new work done in them.

We discovered that maven had a switch where it would work out what could be compiled in parallel on a multicore system. Since we had so many utility jars and ejb modules we were able to take advantage of this switch. Our compiles times dropped from twenty to thirty minutes depending on how heavily the jenkins slaves were being hit down to ten minutes.

With the addition of batch joining the others we will probably need to make our build script more dynamic and only build the parts of the system that had change done to them, currently it builds everything, every time. At Lifelock it is engineering that does all the scripting to support this kind of work. It is also engineering that checks all the continuous integration jobs and spends a lot of time in Jenkins.

My morning routine was checking over the continuous integration jobs and sending out reminders if any branches were in a broken state. Unfortunately the JAX-WS plugin had issues with clearing out the /var/tmp directory on the Jenkins slaves and on local builds on our macs. We got a lot of false positives that way where a branch broke not because of the code, but because of how a plugin interacted with the build machine.

Engineering worked heavily with infrastructure to make the build process for servers automated as well. The infrastructure guys wanted the ability to deploy the applications, such as the wars and ears, as rpms when they built their servers out of Satellite.

There is a maven rpm plugin and we got it working where we could build rpms. Unfortunately I left Lifelock before that work was completed and I never got to see the rpms being pulled down from Nexus and deployed on a fresh virtual machine. One of the issues with Weblogic is that is 1990s in technology. Tomcat is far more robust and flexible. For instance when we tried rpm'ing our artifacts we did so with the war application first.

The Weblogic container does everything through mbeans which require it to be running for any change of state to be recognize. Whereas the war rpms and their configuration rpms can just be pulled down and when tomcat is started up, wallah, it all works. Weblogic is not simple. Where do you dump the ear rpm? Worse, Weblogic configuration files get changed when you start up Weblogic. Now that Tomee exists, I don't see any real reason to use a traditional J2EE container like weblogic, websphere, glassfish or jboss.

Remote Interfaces

In J2EE the local interfaces use the same JVM. This leads to putting all your ears into the one admin server rather than splitting up the different containers by usage. We mainly used remote interfaces for functional testing, some client access by the batch applications and for talking across ears.

One problem we never really solved was that the remote interfaces got versioned because we used the Weblogic-Application-Version entry in the manifest of the EAR. We did this for the middleware engineering group and service delivery. Middleware wanted to be able to do hot swaps back and forth in production and service delivery liked being able to look into the console and see what had been deployed.

From what we could work out, the Weblogic-Application-Version put that version into the JNDI name for the remote interface. So if an external ear was referencing a versioned JNDI lookup for the remote interface and that ear was no longer available then the reference would go stale and that ear would start throwing runtime errors.

Webservices between ears are probably safer for that reason, but putting an @EJB(mappedName=RemoteInterface.class) was so much easier. We ended up getting around it by bouncing the managed servers after the ears had been deployed so they all came up together and attached to the correct remote interfaces.

One mechanism we came up with to get around this was pluggable EJB Modules. We started putting small amounts of work into an EJB module and use maven to pull it into an ear as an internal module so that we could use the local interfaces rather than having to use remote ones. One benefit we got from this approach was to share these EJB modules across ears.

Some batch jobs attached as clients to the remote interfaces exposed in the middleware. Spring's injection mechanism made it easy to wire up a remote interface into a batch job even if it was in XML. Lifelock's production middleware systems were not high performers when it came to throughput. Batch relies on getting as much data through as quickly as possible, so only operations which did not impact the middleware system heavily were done via remote lookup.

A major consumer of the remote interfaces were the functional tests. We used these to interrogate the state of the system after a normal business function had been performed. Our front end systems interacted with our middleware through webservices, so using remote interfaces meant that the methods which pulled back the state of the system was hidden from the front end applications. Their contract with the system was through the exposed webservices only.

The functional tests also comprised of a lot of tools. Often we bled over from test to tool as it was easy to correct or remediate the system being in a state that was causing issues. A common one was when the data between Salesforce and the main Lifelock database were out of sync. We had tools based around remote interfaces interacting with the system that could kick off the events that would correct the data mismatch. We ended up putting little UIs over these kinds of tools so others could use them.

EARs

The enterprise archive file is a glorified zip file; same as the jar and war file. The EAR contains jar files that have session, messaging and entity beans as well as normal library jar files and war files. We tended to split ears by singular functionality. While some ears were untouched once that functionality got into production, others were touched in every sprint by just about every project. Volatility was not uniform across the ears.

Originally the ears were built with ant but we later migrated to maven to build them. We adopted the skinny ear paradigm of building ears. All our dependencies were marked as provided in the parent poms and only in the ear pom were the necessary dependencies marked as compiled. This convention was not always followed and sometimes clashing jars would end up in the APP-INF/lib.

Maven always carries the problem that even when you are careful, jars that you don't want will end up in the classpath of the ear or war file. Fortunately you can exclude sub-dependencies but it can be a painful and frustrating series of steps to isolate the jar that is causing the ear not to deploy or run. SLF4J is a good example of this. Marking dependencies as compile can lead to conflicting logging libraries being packaged into APP-INF/lib.

Our client jars that we used for internal and external consumption of WSDLs were in a mix of axis, axis2 and JAX-WS. This depended on who did the generation of the client code and if the WSDL client code was generated first in axis2 or JAX-WS. Sometimes getting the namespaces lined up can be frustrating and it is a relief to get started. Usually following a convention for what mechanism generates the client jar is forgotten.

We had one issue where two clients jars were used in the same ear. One was from an axis generated client and the other from an axis2 generated client. The ear compiled and was packaged without issues, additionally it ran happily on the admin servers, but refused to deploy on a cluster. We managed to isolate it to a clash in the APP-INF/lib between the axis and axis2 dependency jars. We were careful in the future about this kind of mixing.

The simplest structure for an ear is one EJB module, one WAR file and maybe a client jar with the remote interfaces and the data transfer objects. When we started migrating older ears and placing new functionality in them we used multiple WAR files inside the ear to expose new functionality while keeping the old functionality present. We could have versioned the webservices in the old WAR context, but some of the legacy code was pretty poor and was designed for issues that we no longer had. It was easier to migrate the functionality from one WAR file to another inside the ear.

We started with one EJB module in each ear, but by the time we started splitting more and more functionality out into pluggable EJB modules the ears contained multiple EJB modules. For instance one ear that did a lot of heavy lifting had specific EJB modules to support smaller chunks of functionality such as encryption, configuration and taxation.

As we spent more and more time with the ear structure we started using its ability to support multiple jars, multiple EJB modules and multiple wars. Using maven to build the ear made this approach quite simple and without quality issues.

Weblogic

When Lifelock decided to do the Renaissance project they essentially chose an Oracle stack. The middleware was Weblogic and the back end system was an Oracle database. The front end was done in .Net which seemed odd. Normally it is a good idea to make your whole stack the same technology so engineers can move up and down the stack as needed.

The J2EE container came out of the 1990s of technology where you bought several multi-million dollar Sun servers with sixteen CPUs and tons of memory. To make up for having one big server, the idea was you would run multiple JVMs on this big iron which supported admin and managed servers. This was done for resiliency, lose one JVM and the services the machine was exposing did not go down.

Nowadays you don't need multiple JVMs on one big server as vendors like VMWare made it easy to make multiple virtual machines that mimicked an entire server. So you could spool up multiple VMs and put a single JVM on each. Even better, if the managed servers were running hot, you just gave them more CPU and memory from the VM configuration user interface.

With cloud infrastructure the VM concept has been taken further and horizontal elasticity is being done automatically and without an admin having to go in and change things. Even more amazing, infrastructure is being put behind an API with the likes of JClouds and the different toolkits that are coming from Amazon. Infrastructure is now a software engineering problem and consequently has to go into continuous integration, have unit tests, have functional tests, etc.

J2EE was also a response to CORBA and DCOM. These were distributed system designed for the purpose of there being multiple terminals interacting with a distributed system that spanned multiple data centers. I worked on a system that was grounded in CORBA. It had been designed with user interaction being done through a Java User Interface, if I remember correctly it was AWT.

The internet had overtaken this system and a browser based user interface was put over the top of it. The CORBA IDLs acted as stubs - same as J2EEs remote interfaces - and was how the war application in the tomcat container interacted with the CORBA system.

J2EE is a distributed system. The managed servers do not have to be housed locally with the admin server and using remote interfaces the ears and clients can talk with each other across a contiguous network.

The main issue is that J2EE has come out of 1990s technology and been overtaken to an extent by the internet and more recently cloud technology. The Tomcat container is simpler, and since it does not have to deal with mirroring ears across managed servers, or bean pools, it is consequently more robust. In my opinion tomcat is easier to configure and manage as well.

To cap it off Tomee is a tomcat container which supports CDI, JMS and JPA which are amazingly productive technologies. Given choice, I would not use a J2EE container such as Weblogic, etc. Tomcat is a simpler and more robust solution. That is without taking the service bus into account. Once that is summed in tomcat has it all over a J2EE stack like Weblogic.

Our problems with Weblogic were legion. When I first arrived at Lifelock we all had windows machines. They were crappy and under powered. It would take a frustrating forty minutes to log in because of all the crap that was on them. I never used to turn my machine off for that reason.

Starting a Weblogic instance on Windows was another exercise in frustration. The Weblogic instance on localhost would take twenty minutes to startup. Then you would make a code change and deploy again, and wait, and then do it again, and wait again. Once we got macs startup was just a couple of minutes on the Mac Pros. Which was like manna from heaven. The Macbook Pro folks had to beg for more RAM before they got that kind of startup times.

We had multiple environments which covered the standard needs; DEV integration, QA and Stage in addition to production. We had six environments in DEV which were a mix of clustered and single managed servers. The QA environments were all on a single managed server and stage mimicked production.

Weblogic is a bear to configure. We moved code with great speed through our system. We often had six to ten sprints going at once and every two weeks a sprint would end, merge into trunk, and then go out into production. We were often adding new configuration changes including queues, topics, supporting libs that needed to be in the Weblogic classpath, etc. Our numerous Weblogic environments were always out of sync and constantly had performance and configuration issues.

The real problem was that we had too many non production environments and too few middleware engineers to support the number we had. Often we would be down to one middleware engineer who we would then burn out by over working them, so they would leave and get a new job that was less stressful. For instance Pearson had something like nine Weblogic engineers in Phoenix, while we would have one. We got a bad name for burning middleware engineers out as well; which made it hard to recruit. It was vicious cycle.

We kept trying to slim down the number of environments to take some of the load off the middleware engineers that were supporting all these environments. We collapsed our backends so that all of DEV shared the same backend and all of QA and Stage shared another backend. This removed a lot of the difficulties in engineering and infrastructure managing multiple backend systems and trying to keep them in sync.

The design for the environments between 2010 and 2012 came through a design document I did. The goal of that document was to slim down the number of environments and back end systems. Consequently the design document only had DEV and STAGE. The idea was that DEV and QA would share the dev environments, working together during sprints in a DEV environment and doing regressions in stage once it had been merged to trunk.

If I was to do that again, I would not call those environments DEV. Language is important and saying DEV gave those environments that appearance being owned by engineering. I would have called them something like the SPRINT Environments which has a connotation of shared ownership.

One of the advantages of the DEV environments was that engineering helped managed them, so they tended to be a bit more stable than the QA environments which were reliant on the over worked middleware engineers to fix any issues. The code in the DEV environments was usually more robust as engineering could deploy there whenever they wanted, so bugs were always fixed there first and didn't require the Configuration Approval Form [CAF] to get the code up to the QA or STAGE environments.

The environment design did not include any QA environments but the Director of QA kept pushing for a Quick Test or QT environment. I would convince everyone that we didn't need it, we could do that work in DEV or STAGE, but I would skip a meeting, and then the QT environment would go back in. Which was frustrating. I talked people out of it three times but ultimately it went in. I skip a lot of meetings and am not a particularly patient meeting attendee so the QT environments became part of the structure.

The QT environments became the bane of Lifelock's middleware engineering and service delivery groups. They were rushed out, poorly configured - for instance we found pointbase running on them - and they were a cause of constant problems as they were not used constantly. They were a nightmare, and worse, there was not just one QT environment, there became five of them.

It was a bad decision and we were never able to get it changed. Once something like that gets in, it is hard to remove them as the organization and processes wrap around them and kind of get stuck in doing things that way. Whenever anyone asked, or even when they didn't I would say, "We should delete all the QT environments." We didn't though.

This is also why simpler and fewer is better. When the design of environments came up again in 2012 I deleted QT and deleted DEV as well. That meant we only had to manage one set of non-production back ends, and one series of non-production weblogic and tomcat servers. It would have made it simpler for everyone and let the middleware engineering folks concentrate on production which is what actually made us money.

One of the problems with Engineers is that we love writing software. So we assume this is the most important part of the whole process. It isn't true though. Software engineering is overhead and any code we produce that is not in production is money poorly spent. In lean development it represents waste and risk.

The environment design I did in 2012 made every environment releasable to production so we did not have the bottleneck of stage. The environments would be close enough to production in structure that we could release from them in confidence. This was becoming possible because of the server automation that was coming out of Infrastructure.

This environment design was not adopted, in fact Lifelock kept plugging along with the existing legacy environment design with the change that infrastructure was building servers and starting the process of updated Weblogic, Red Hat and the JVM to the latest.

Weblogic being a struggle to configure meant that we all had unique installs on our local machines. The Weblogic configuration also wrote the path right through all the configuration files. So effectively obtuse that running a perl script and replacing the path was not enough to use someone else's Weblogic container and domains.

We decided to check in the Weblogic container and a working and pre-configured domain into subversion. It was about 1.3 Gb and was for mac users only which isolated our couple of windows users. The container was put under /Library so that the path was the same for everyone rather than a user's home directory.

We had WLST scripts to configure the Weblogic container. Weblogic had to be running to accept the mbean changes that the WLST scripts were performing. The idea of Jython is a good one; all the best things of python with all the best things of java. Python has a lovely terse syntax that is eminently readable. Java has a massive number of libraries outside of the core java system.

Eclipse doesn't support Jython all that well. It certainly does not support WLST with anything approaching consistency. To get a WLST project in eclipse you have to create a Dynamic Web Project and then add the WLST facet. WTF. It is like the tools for the OSB in eclipse; half-arsed.

I wrote a WLST script back in 2010 to put in the datasources and queues. This was expanded by succeeding middleware engineers to include data stores and other queue configuration. We did some other WLST scripts such as flushing queues and listing all the queues in a container. It was Engineering doing those though, not Infrastructure. We also hit issues getting the WLST scripts into Jenkins.

Weblogic is not fun to configure, and WLST is not fun to program for. We toyed with using a maven plugin to configure queues as well. If I was to do it again, I would make Jenkins jobs in Java that used the MBeanServer to get into the Weblogic system and change its configuration.

Oracle Service Bus

As Lifelock expanded in 2009 contractors were brought in to fill positions in engineering and infrastructure. There were factions amongst the contractors as well as among the middle managers. One dividing line was those that liked the Oracle Service Bus [OSB] and those that did not. The idea behind an Enterprise Service Bus is that you can hide all the internal and third party systems that a modern business uses behind one set of unified interfaces. The benefits are supposed to be loose coupling, plus the ability to aggregate or transform data flowing across the service bus.

The service bus is heavily based on XML from the SOAP protocol and the concept that any XML flowing across the bus can be transformed by XSLT into something else. SOAP dates back to Microsoft embracing XML in the late 1990s and early aughts when it was the next big technology. One problem during that time period was that Microsoft machines were highly prone to being compromised by viruses and trojans when exposed to anything from the internet so Network engineers would close off every port they could; except for port 80. Microsoft's response was to make port 80 the new default remote method invocation port and send XML SOAP packets through it.

The promise of the Service Bus always sounds great. We interviewed one engineer at Lifelock who I was discussing our quality, development and support problems with the Service Bus and he was telling me how awesome it was because you can call PL/SQL, transform it with XSLT and then send it out as SOAP. Which sounds insanely simple. But it never is. When you actually have to do a project with a Service Bus, it is insanely slow, defies introspection and has poor quality and artifact output. The killer is; you can't unit test a Service Bus and as a result you cannot guarantee quality.

Another issue with the Service Bus is that Java Engineers always get stuck with it. The number of genuine Service Bus Engineers are insanely small and they tend to be contractors that you hire from Oracle etc. You rarely see positions being advertised for Service Bus engineers, instead it is always thrown on the software engineers who have been trained to read Java in IDEs. The crappy tools for reading XML that comes with Oracle Service Bus are a nightmare in comparison. It is more productive, and more efficient, in my opinion to have Java engineers working in Java rather than working in XML and a Service Bus.

Aside from the inability to unit test, the next really disgusting problem is that you cannot automate the creation of an Oracle Service Bus artifact. The sbconfig.jar is the artifact which is deployed to a service bus instance. It has to be generated out of eclipse. The crappy mechanism for automating it is to put eclipse and Weblogic onto the Jenkins server and use them to create the sbconfig.jar on the continuous integration server. This is not a satisfactory solution. We also found that an sbconfig.jar generated out of windows differed in the files, folders and structure to one that was generated out of a mac. That is not acceptable and grounds for dismissing the OSB.

We had issues with the customization files as well. They had to be generated out of localhost which is not good for automation as it is dependent on a human pulling the file from a running OSB. We managed to at least automate the deployment of them through maven by using the localhost:7001 as a token. Despite automating the deployments of the OSB as much as we could we still occasionally hit production issues that could only be solved by going into the service bus console and eyeballing the setup.

The architecture for the service bus in the technology stack meant that all the webservices went through the OSB then to the Weblogic System [WLS]. Creating web services in the WLS layer is a snap with EJB3. You throw some annotations on a java file, add the jaxws build step in the maven build and there they are. Because of the OSB's position we had to use it as a pass through for the majority of the services. Only a small number of service actually used service bus functionality like transformation or aggregation. The rest were just straight up and down of what was in the WLS anyway. It took three years to remove the pass through services and have the front end systems hit the WLS directly.

The tools Oracle provides to work with the Service Bus are very poor. For windows users there is the Workshop For Windows which is a customized copy of Eclipse 3.3 with the OSB tool bundled in. Since most of our engineers were on macs we had separate versions of Eclipse 3.3 on our systems with the correct plugin installed. One of our engineers checked into subversion a customized copy of eclipse with the OSB plugin installed. A lot of the newer engineers ended up using this for OSB work when they could no longer avoid it.

Another issue we hit with the Service Bus was the problem of automated deployments. Originally the build system was ant based and we later mavenized the build process. There were no off the shelf ant or maven plugins that were able to deploy the sbconfig.jar and the customization file. It was a manual process from dev, to QA to Stage and production. Shudder. One of our talented young engineers ended up writing an ant task that used MBeans to deploy the sbconfig.jar and the customization file. He did it while we were receiving training on the new billing system. It was a good use of time. Automating deployments was more important. When we moved to maven I ported his ant task across to a maven plugin.

When we started the billing death march we tried to use the service bus to aggreg