Background

Pipedrive, like many startups, started with PHP-monolith which has since grown into legacy code that we are attempting to get rid of. It is harder to test & deploy on legacy code than with small node/go services and it’s more difficult to grasp, because of so many features that were written more than 5 years ago. Because of this, we have an informal agreement to not add any new functionality to this repo anymore.

Despite its faults, the PHP-monolith still serves the biggest chunk of public API traffic (>80%) and more than 20 internal services depend on it, making it critical for the business. The challenging part is to keep performance high as we make it more decentralized.

In a 2018 mission that I participated in, we finally “dockerized” it, but it still remained in PHP version 7.0, but at the beginning of 2019 the planets aligned and a Backend performance boost mission became possible.

Motivation

First, I personally wanted to get into the field of Performance Engineering so this was a great chance. Second, anyone within the company can pitch and launch missions so this was a good opportunity. Finally, as I later found out, our manager was also interested in improving performance.

The triggering moment happened when accidentally, during a debug of some endpoint, I decided to try blackfire to understand how function calls are made and I saw a dismal picture of translations taking a huge percentage of the total request. I thought it would be fantastic to rewrite this part and then learn how to lead a mission during the process.

Preparations

Since I was afraid of pitching a mission without any slides to a 100+ audience and because Pipedrive is a data-driven company, I had to start researching what metrics I could use for the mission goals.

What is the percentage of public API requests with a GET method that is slower than 400ms? What about POST/DELETE?

What’s the latency of requests for API that is used to add data in bulk? What if it isn’t average, but a 3-sigma coverage (99.7% of users)?

What’s the maximum time that an API call can wait? Why?

What’s the % of API calls that do more than 70 DB requests?

How many DB requests are there on average / 99% of users?

What’s the maximum number of DB calls that one API call can generate?

How many API calls are lost due to DB going down? Due to insufficient memory of the process?

How much memory does PHP process consume on average?

I won’t give you the exact numbers I got, but it was a grim picture indeed. Even though we had a pretty low average API latency (160ms), I still saw some of our clients with endpoints loading from multiple seconds up to several minutes (bulk API).

The main goal of the mission was to decrease the percentage of sluggish GET requests by 20% (from 5.2% to 4.1% of total request count).

The secondary goal was to decrease the number of insufficient memory errors by 80%. From a business side, the goal also aligned with a decrease in client churn rate.

The mission plan was to deal with the global speedup first and then work on individual endpoints.

Non-standard lightweight Trello + Teamgantt to plan the mission

First failure

I planned a mission for 4 developers, but only one brave soul volunteered and so we started our journey with just the two of us.

First I wanted to get an easy win and just get rid of these slow translations. The main complexity was that the gettext function loaded binary .mo files that were coming from a third-party translation system and unpacking them on every request. Or so I thought.

After a full rewrite, with a nice unit-test coverage, I got a 43% speed-up on my machine, but as it turned out, in production gettext loaded translations only on the first attempt and then used an in-memory cache afterwards. In the runtime, it was still fast.

It was a first-week failure with the only positive being a better understanding of the developer experience.