Cutting your losses

Originally published at Lux Group Tech Blog

Back in Feb 2017 I wrote about our decision to rebuild our entire architecture. It is established lore among the software community that re-writing software is something you should never do. This is not only because of Joel Spolsky’s excessive influence in software-land but because so many software engineers have experience of hating the system they work on, only to find that when they re-wrote it, they simply didn’t make it that much better.

Screenshot of small part of the legacy Back Office portal: a dazzling array of tabs, icons and drop down lists, most of which are irrelevant to the task at hand

There are some well-established ways of avoiding the re-write from some of the luminaries of the software world but sometimes you have to come to the conclusion that there’s nothing worth saving. At the Lux Group we were faced with a legacy platform with no discernible architecture, no automated tests, an outsourced engineering team and an overwhelming level of tech debt; we saw no real alternative and suggested the following justifications for doing a strategic rebuild (and doing it properly):

successful rebuild is normally undertaken under the following conditions:

* The company has gained enough experience to understand its business domain, the customers it is serving and the product that it wants to offer.

* The company is well-resourced, able to invest in more experienced engineers and to invest those engineers in building a product or platform for the future.

* The company stakeholders understand that building a product with solid engineering principles, built-in checks and balances and a high-performing team to run this software takes more time and tends to cost more than doing Rapid Application Development (RAD) in the same way that building and maintaining an architect designed home tends to be far more expensive to build and maintain than buying a kit home and renovating it yourself.

* The stakeholders understand that Software Engineering is Expensive and building things to last makes this even more expensive, therefore the company and team needs to be very selective around what features they opt to build.

It is now over a year later and the product luxuryescapes.com has been released without salvaging a single line of code from the original system (CRM aside). We release code every day, fix bugs before writing new features and the small amount of tech debt we have is under control. In retrospect we can see that the rebuild project could have easily spiralled out of control, the following were some of the principles we applied to ensure it didn’t.

Strong Product Team

We have this policy within our team: “if a feature doesn’t make sense to you, don’t do it”. This might sound blindingly obvious, but engineers and designers are often asked to work on stories and features that don’t make any sense to them. They cannot see why the customers would want that feature or how it benefits the product or the business. A strong Product Team can challenge senior stakeholders and sponsors to get to the root of the problem or design a holistic solution rather than applying band-aid after band-aid. An empowered engineer can expect a good explanation of the strategy under-pinning any features they are asked to deliver.

Architecture without an end state

In architecture without an end state, Michael Nygard describes how we should build architectures that are designed to be ever-changing along with the personnel and direction of the business. Lux Group is an ambitious company; by deciding to use micro-services we embraced the dynamic nature of the business and the ability to scale teams and iterate quickly. We embraced the challenges they created (reporting and atomic transactions), and avoided reverting to a monolith design when faced with these challenges.

When we sold some of our businesses, bought others and restructured the company and our business model, we had an architecture, team and process that was setup to embrace this change.

Iterative development and Continuous Delivery

From the first month of development we created an MVP that we treated as a production environment even though it was not live to the public. This MVP contained the ‘spine’ of the product (simple implementation of critical features like: search, add to cart, purchase, pay vendor). We only showcased software that had been released to that environment, and we released aggressively, with a Continuous Deployment approach. We treated every feature as an MVP, releasing early and iterating. When we failed, we demanded an RCA. When we finally launched the product, we already had the processes and discipline in place to continue the same practise.

Screenshot of Back Office portal on new platform: data is cleanly presented with consideration given to UI

Transparency

Every month we showed progress to the sponsors and we were honest about our setbacks and failures. By being honest, the delivery team did not get caught behind a lie and the sponsors saw the true rate of progress and an opportunity to consider our options: scope, time, resources.

Challenge every requirement

When you do a rebuild you get a fresh start. The conventional wisdom suggests you’re going to end up rebuilding every feature you already had so stick with what you’ve got — but this was not the case for us. We had many features that had been built over many years and were no longer that relevant. We were keen to highlight that every feature has a cost of ownership: paying an engineering team for build and maintenance as well as ongoing operational costs. Often the cost of ownership does not justify the profit that will be made from this feature. By costing and challenging every requirement you can help keep your product tight and the business focused on what it really needs. This is beautifully explained in one of my favourite blog posts The Tax of New.

What could we have done better?

We thought we had an elegant plan for solving the inevitable reporting problem that comes from having the distributed data of a microservice architecture. It turned out this plan didn’t work and we had to scramble a tactical solution while we learnt a lot more about pub/sub event-sourcing and re-constituting distributed data. If you do embark on microservices, ensure you solve this problem as part of your early MVP.

Would I do it again?

We can feel confident that no one at the company wishes we’d tried to incrementally work with what we had. We now have in place a high-performing team with a great product and platform that is being extended to build the next chapter of the Lux Group’s story. Joel Spolsky and Martin Fowler are probably correct in most cases; if you can salvage parts of your system while you rewrite others, you probably should. However, some systems are unsalvageable so don’t be afraid to start again from scratch.

Advertisements

Should we be building software?

Originally published as Journey of a Tech Stack on Lux Group’s technical blog

1_u0YNd-zYOZw3DwukKSh91A

A painting of the Tower of Babel under construction

Should we be building software?

Software engineering is difficult and expensive: this is well understood and is reflected in the commoditisation of software into hugely scalable packages (usually cloud-based) that allow us to achieve our business goals without actually writing any code.

Just 10 years ago if you wanted an e-commerce site you had to hire a developer and purchase hardware to run and manage your site. Similarly if you wanted to launch a brochure site to promote your business or a blog to publish your thoughts or promote your business you hired a developer; soon enough you began to realise how expensive software development was. It’s never set and forget; standards change, fashions change, sites need ongoing maintenance, development and redesigns.

Nowadays if you want a blog you go to Tumblr or Medium, if you want an ecommerce site you go to Shopify, BigCommerce or Magento and if you want a brochure site you go to Squarespace or WordPress; most companies should avoid writing custom bespoke software unless really necessary.

So, who does need to hire developers (or software engineers as we prefer to be called these days)? The answer to this is companies or startups that are trying to solve a new problem — that is, one that hasn’t yet been fully commoditised — hire engineers, QAs, product owners, visual designers and user experience experts to create new software that fills a need or solves a problem for their users.

What is the better way to build software?

One of the most popular contemporary approaches to building software for startups is the Lean Startup method which involves building the smallest increment of business value possible and validating that within the market. This is a relatively modern approach that can be used in companies of all sizes but also happens to be the approach that small startups on restricted budgets often do by default and have done for many years before Eric Ries had even dreamed of the term “Lean Startup”.

Phase I: The budget startup

At its genesis, a company normally has serious budget constraints and takes the fastest route to building software and delivering value to its customers. Here, at Lux Group, this is exactly how our initial technology stack was built: using PHP (the swiss-army knife of development languages), outsourced engineers and rapid application development, delivering features as quickly as possible, validating them in the marketplace and then moving on to the next feature. Sales figures were the priority, performance and user experience followed closely behind with technical integrity and system design for an uncertain and ever-changing future well down the list.

In software engineering, Software Entropy increases over time and, unless significant time and resources are invested into maintenance, systems eventually become harder to understand and maintain. If a company is lucky enough to still be in business by this time they may reach Phase II.

Phase II: The strategic rebuild

The world is littered with the corpses of companies that never made it to Phase II. Many of these companies may have over-engineered their tech stack, over-complicating a solution to a problem they did not yet fully understand. Indeed many of the companies that have been purchased by the Lux Group such as Brands Exclusive, Living Social and Pinchme were far more ambitious with their technical solutions than the Lux Group engineers, and yet were unable to compete in the market while a simple, stripped-down, lean approach suited us, allowing us to develop quickly, absorb these companies and move on to the next challenge.

Twitter famously rebuilt its platform after discovering that the original Ruby on Rails implementation was unable to cope with their traffic and replaced it with a queue based system that served their purpose better. Facebook took a different approach, creating an extremely sophisticated low-level solution to improve the speed of their php code without requiring a full rewrite of their application logic.

A strategic rebuild normally comes when a company is in a strong position. To some dominant companies this market strength can arguably be a disadvantage — in the 9 years I worked in Westfield’s Digital Labs we pivoted to a different business strategy roughly every 2 years, often requiring a complete change in stack, programming language and personnel. In a less well-resourced company this wouldn’t have even been a possibility; Westfield Labs is still actively experimenting with Westfield’s role in the digital world.

rebuild is normally undertaken when the software is unable to provide the return on investment it has previously and the company knows that significant investment is required to return the software to a healthy state. When technical debt grows high enough, a system becomes technically bankrupt and unable to deliver the features growth necessary to implement the business strategy.

successful rebuild is normally undertaken under the following conditions:

  • The company has gained enough experience to understand its business domain, the customers it is serving and the product that it wants to offer.
  • The company is well-resourced, able to invest in more experienced engineers and to invest those engineers in building a product or platform for the future.
  • The company stakeholders understand that building a product with solid engineering principles, built-in checks and balances and a high-performing team to run this software takes more time and tends to cost more than doing Rapid Application Development (RAD) in the same way that building and maintaining an architect designed home tends to be far more expensive to build and maintain than buying a kit home and renovating it yourself.
  • The stakeholders understand that Software Engineering is Expensive and building things to last makes this even more expensive, therefore the company (and team) needs to be very selective around what features they opt to build.

If these principles are understood and adhered to, a strategic rebuild will be far more cost-effective over the long-term than a rapid application approach as the system will be designed to grow and change over time, hence allowing the company to grow and change with it, protecting the company and its customers from the shock and upset of prolonged downtime while the system is rebuilt again.

Continuous Delivery

So far we have talked about rapid application development on a budget and more strategic development at a higher cost. Recent developments in cloud computing, virtualisation and automation have led to companies being able to iterate rapidly, without compromising quality, security or stability.

How?

By breaking applications into smaller components, writing extensive tests and investing in automation, companies are able to release changes to production many times a day, while managing risk to the stability of the system.

Overheads

This approach comes at a cost: there is a significant overhead to writing tests for the features you deliver. However, not writing the tests means that you are not managing your risk and are unable to guarantee stability. The division of applications into smaller components means these components often make the path to retrieving data more convoluted which, in turn, makes development of these features slower.

High-performing team

How do you maintain your ROI given these increased overheads?

Having a more complex and sophisticated architecture means you require a more sophisticated team to run it. Features are expensive to build so we should not waste our time building features of unproven benefit: Pareto’s Principle states that 80% of the value comes from 20% of the features. Changes should be smaller and incremental, with each change forming an experiment upon your customers and your team having the ability to gauge the success of that test and pursue the most promising route — the simplest example of this are the A/B tests that many companies run.

The delivery team needs to have all the components necessary for delivery: planning, design, user experience, engineering, operations etc all working together.

Solve problems rather than build features

Most importantly the delivery team needs to become expert at solving problems and the stakeholders or management team need to be able to give the delivery team the autonomy to solve these problems rather than ask the team to deliver ‘features’.

So, should we be re-building software?

Joel Spolsky describes rewriting software from scratch as the worst thing you can ever do and I agree that in most scenarios, this indeed is the case. However, sometimes there are exceptions: here at Lux Group we have an opportunity to work on rebuilding a subset of features that simplifies and changes our business requirements and we believe we satisfy the criteria outlined above to enable a successful build that hugely improves our capacity to deliver value to the business.

Should we be doing planning poker?

I got a message from an old colleague today with a link to a wiki page on Poker Planning and a question: “Why didn’t we do this?”

My initial reaction was that we did, we did it for years but after consideration I realised he was right – by the time he’d started working with us we had stopped using poker planning to estimate – or at least we only used it when we really needed it. So why stop? Did we lose our discipline?

The Scrum approach

Poker planning is a great way of using the wisdom of crowds to estimate stories. Points rather than days are normally used to keep the estimation abstract and these points are then tracked to calculate a team’s velocity to allow for capacity planning. In our early days of agile experimentation we followed the Scrum dogma quite rigidly – the team would gather together and examine features and help break them down to stories and then use planning poker to come up with an estimate, then the Product Owner would come along and select the stories he wanted the team to work on. More than once he commented on how he felt he was given the ‘illusion of choice’ whereby due to story size, team capacity and soft dependencies his options to select and order the backlog were severely limited.

This approach lead to us having a ‘well-groomed’ backlog but many of the groomed and estimated stories would languish in the backlog for a long period of time as small tokens of time wasted; growing stale and decreasing the signal to noise ratio. Furthermore, production bugs sat in their own backlog, prioritised separately and largely ignored by a product owner focused on building ‘the new’ and a team attempting to maintain its velocity.

A Product Owner’s role is to generate the maximum ROI through picking the stories which return the most business value for the amount of effort expended, however, the reality is often far more subtle than this; while each feature or story is technically an independent release of business value, more often than not each feature and story is part of a much bigger picture and really needs to be played in a specific order that the team understands. Backlog grooming, planning poker and sprint-stacking can become a transactional (rather than collaborative) affair leading to a poorly maintained and planned product.

Continuous deployment and a high-performing team

By the time the aforementioned colleague had joined the team we had restructured our approach. The Product Owner was embedded in the team, story and feature selection had become collaborative; there was no formal backlog grooming and commitments were agreed by small teams based upon the entire team having an in-depth understanding of both the business benefits of each feature and the estimated effort.
The roadmap was understood by the whole team and any up and coming work had either been spiked already or been analysed by the engineers and collaborated on (with UX and BA’s as necessary).

Furthermore, specialisation was respected and ownership of components encouraged: while we all love full-stack engineers, in a small team the reality is you may only have one real JS expert, one back-end expert and one HTML/CSS expert. Working on a maximum of one week iterations and with your definition of done as “released to production”, the abstraction of poker-planning-driven velocity becomes irrelevant – it’s easier for the team to work out what they can achieve in that sprint that week, estimating by days if necessary. With the onus on quality and user experience, fixing production bugs is treated as a priority and will often need to be prioritised at short notice which will impact deliverables so focusing on hard, velocity-based commitments can encourage a compromise on product quality.

So why did we stop Poker Planning?

When a product is well-established and development is lean and iterative, the team should be in control of its own destiny, understanding the vision and goals of the business and driving towards those goals with a roadmap for guidance if necessary. With the product owner embedded as part of the team and in constant collaboration with his team members the need for poker and velocity based capacity planning becomes irrelevant and the one important question becomes: “What can you (or even better, ‘we’) commit to delivering to production this week”.

With good stakeholders (or good stakeholder management) the sponsors will ask “what goals or improvements have we achieved” rather than “how many features did we complete”. With code being released to production on a weekly (or daily) basis the transparency that this approach delivers encourages trust between stakeholders, product owners and the delivery team, and negates the need for the transactional approach of poker planning, sprint stacking and velocity tracking.

SOA: An enabler for Continuous Delivery and innovation

Building on my experience at Westfield Labs, this presentation was delivered to Sydney CTO Summit and explores how implementing a Service Oriented Architecture allowed Westfield Labs to embrace a Lean, Agile approach to Product delivery.

The presentation covers how an SOA assisted with:

  • Management of backlog, particularly bugs
  • Managing build times
  • Cross-functional teams
  • Faster iterations
  • The ‘QA Paradox’ of better quality through reducing testers

We are the business

An ex-manager of mine once pointed out that we need to stop talking about “the business.” Doing so gives leverage to those claiming to represent “the business” and limits the influence of the engineering team. His was more a political observation than a call to change our mind state, but ever since then I have noticed how commonplace it is for colleagues to make vague assurances that “the business requested it,” or “the business want it like that.”

When someone uses the term “the business” they invoke shadowy high priests with absolute knowledge — but they could be referring to a clueless HIPPO, an opinionated sales or marketing exec or equally to any end user of a piece of software. The engineer should have every right to question those requirements and to request exactly whom “the business” refers to in this scenario.

At Westfield Labs we are very fortunate to work in a ‘digital’ department which combines product, design, end-users and engineering resources as equal collaborators. Most recently we have moved to truly cross-functional teams where the only direction given to us by sponsors and stakeholders is high-level: e.g. bring us more customers and more conversions through focusing on streams x and y. Sure, the product team provide the ultimate direction from a product perspective but only after close consultation and collaboration with all other relevant parties.

In this scenario it is not hard for an engineer to think of oneself as part of the business and I positively encourage my team to stop using the term ‘the business’ to refer to others as it implies that we are not an equal partner. To take it further, I actively encourage my engineers to question and understand business requirements and to shout out if they don’t make sense.

Talking recently to an engineer from a UK online retailer, he noted that his company “think of themselves as a marketing company, not a software company” and see the engineering department as a necessary expense to realising their feature requests. In a business that is so dependent on the quality of the implementation and the iterative improvements upon that implementation, it is naive to think that engineers are not equal partners.

Obviously it’s not so easy when you are working in an agency (and let’s face it, sometimes you are working from a spec and clearly not ‘equal partners’) but any enterprise that wishes to succeed in the digital age will ultimately depend on the quality of its implementation – and the feedback from those that are doing the implementing. Otherwise it will be made irrelevant by a competitor that does.

So, I entreat software developers everywhere: let’s stop talking about “the business” and start talking about: customers, stakeholders, sponsors, sales team, marketers, product team, whatever. Make it clear that we consider ourselves part of ‘the business’.

New look

I just changed the theme. Not because I didn’t like the old Chunk theme but because it didn’t credit the author on the home page and I thought that was important now that hissohathair is contributing as well. It appears that editing templates on wordpress.com is not an option so welcome truly minimal, hopefully we will have a long and fruitful relationship.

While browsing the 300+ themes available on WordPress I was reminded of the elegance with which WordPress separates content from presentation and the experiences I have had moving from a CMS toward a set of APIs that serve ‘structured data’.

At westfield.com.au we threw out our Teamsite CMS a few years ago and moved away from the idea of a meta-data driven ‘universal CMS’ toward an API that serves structured data to our Presentation Layer(s). As any OO programmer will appreciate, this allows us to store the behaviour and business rules with the data – as opposed to treating data as an anonymous blob – and we’ve never looked back. We avoided the problem of an ever-growing database schema and ‘Business Layer’ by implementing an SOA but more about that another time.

We’ve been able to provide content parity across all devices by fully embracing responsive web design and we can do this because (like the WordPress themes) there is a complete separation of data from the presentation layer. As we start to move toward providing richer editorial-style content we are really starting to reap the benefits of this approach; we store data in it’s small component parts and then tag and curate this data. This allows us to create simple editorial pages such as the recently launched Curations but as we start to discover more about our users and their tastes through the power of big data we can start to tailor our content to their individual tastes and look to serve our content algorithmically rather than through creating pages as one would have in traditional publishing.

From my perspective, well constrained CMS’s like WordPress that solve a clearly defined and well understood requirement will continue to flourish but if you’re looking to fulfil a unique requirement you’re going to have to write a lot of bespoke code so why not embrace it and give yourself that freedom and control. Leave the off-the-shelf CMS for the brochure-ware sites and the big publishing companies that have the money and patience to customise them.

 

The philosophy of Continuous Deployment

Over the past few years my team have been on a journey toward Continuous Deployment – this has involved changing our architecture, our attitude towards testing, and our tolerance toward slow build times but most fundamentally we have noticed it required a change in philosophy and our attitude toward quality, responsibility and ownership.

In our previous model our mantra was “you release what you test” and with the right phase gates and environments in place it allowed new releases to flow through the hands of QA and into the responsibility of Production Operations. Despite having a huge body of automated tests, our attempts to refine this process to become faster always foundered:

  1. we could patch additions into the release branch but this delayed the release while QA assured themselves we were ‘ready’

  2. any late patches or bug fixes to the release branch had to be merged back into the Master branch and were sometimes forgotten causing regressions on the next release

  3. the longer the delay to the branch release the more the product team tried to shoehorn more stuff into the release branch instead of Master and the vicious cycle continued

  4. meanwhile QA were always absorbed getting the branch release out and work on Master was delayed or quality suffered

A recent pivot allowed us to become far more aggressive toward enforcing a strict “no junk in the trunk (master)” policy and we found that when a team embraces Continuous Deployment, they start to embrace a philosophy of personal responsibility and develop the following behaviours:

  • ownership of services or code bases

  • collaboration with the operations team (devops)

  • making sure their code is well-tested and production ready

  • releasing code in small releasable chunks

  • thinking about backwards compatibility

  • collaborating and communicating with your team and the business users that your code changes will affect

  • accepting that you cannot live in isolation on your long-running branch without paying the price in conflicts

  • realising that the QA team are not your bug finding monkeys

  • checking your production logs for errors

  • writing the correct level of automated tests to limit regressions without negatively impacting productivity

  • monitoring your systems vital statistics (in all environments)

  • deciding when to back-out and when to fix-forward (and having a plan for this)

  • having a nuanced understanding of risk, quality assurance, testing and release strategies and that some features and bugs can be released immediately while others need in-depth testing

  • understanding that you no longer need a “definition of done” because your work isn’t done until it’s successfully released into a stable production environment.

All this comes under the umbrella of personal responsibility and that comes from the knowledge that when a developer chooses to release a piece of code, that code will go to production and they will accept responsibility for it. Along with that responsibility is the pride and reward in knowing that one’s code is released – it’s live, in the wild, and changing users lives (hopefully for the better).

Recently we experimented with adding a traditional formal sign-off to our release process from business owners outside the product team and the results have been interesting. We were faced with either applying a moratorium to Master while sign-off was approved or creating a release branch and letting Master flow. Either way we were faced with similar issues:

  1. Developers are no longer able to release their own code so Master starts to differ significantly from what is in production.

  2. A “one size fits all” approach is used for release sign-off – whether it’s fixing a typo or refactoring your payment system you still need sign-off as developers are no longer trusted to weigh the risk themselves.

  3. The QA team and release manager become responsible for the release, not the individual that wrote the code. The theme of personal responsibility is broken: the developer has moved onto a new task.

  4. The release manager starts to manage what goes into the release: features and fixes are forced to languish on branches until the release manager is ready to merge them.

  5. As the amount of code ready to deploy grows, release managers get nervous and need more time to test on the release branch.

  6. More and more time is spent managing ‘the release’ than writing and delivering features

  7. As stakeholders realise that releases take longer, they start to hold up the release for “one last fix” rather than wait another week. A full regression is then required by a nervous release manager and the vicious cycle continues.

  8. When you push out your release it has batches of changes in it. If production is affected, it is far more difficult to work out which change caused the problem.

The end result: your release cycle is slower, your team is less productive and less engaged, your QA team is over-loaded and your releases are potentially more buggy and harder to fix.

Continuous Deployment is not just about releasing code fast – it’s about having a team that takes pride in their work and feels responsibility for the quality, stability and effectiveness of the live product.

Our focus now is on improving visibility, accountability and automation so we can better provide rich, descriptive release notifications to stakeholders and regain the capacity to release from within the team rather than have to ask permission from an approval body from outside the team.