Lately, I’ve been working on optimizing the memory of some of our backend PHP applications and wanted to share some of the tricks that I have come across, especially dealing with large set of data using PHP.

Always cap your internal in-memory caching

In day to day PHP coding, you probably have come across some simple functions that simply retrieve data from a data source and apply some business logic to transform the data into a certain structure and return that for a caller to consume.

class Category { public function getById($id) { // this can be very expensive db operation $raw = $this->getNestedTree($id); $processed = $this->process($raw); return $processed; } ... }

At some later point, you might find that the consumer code calls it repeatedly within a single request. In the example below, product and category have an M-1 relationship, so exporting different products with the same category results in the same category data being retrieved and processed multiple times.

class ProductExporter { public function export($page) { $category = new Category(); $products = $this->getProductsByPage($page); $export = array(); foreach ($products as $product) { // getById is being called N times $export[‘category’] = $category->getById($product['category_id’]); // get other aspects of a product to be exported ... $this->writeToCSV($export); } } }

Obviously having repeated computation for the same data degrades performance, so in order to avoid that, we can simply cache the data in the memory, like this

class Category { private $cache = array(); public function getById($id) { if (!isset($this->cache[$id])) { $raw = $this->getNestedTree($id); $processed = $this->process($raw); $this->cache[$id] = $data; } return $this->cache[$id]; } ... }

For most of the normal web requests, this might work quite well, given the size of data to be cached is small.

However, as soon as you start using it for long running background job that handles a large amounts of data, you will likely hit your PHP memory limit quickly.

To avoid such a potential issue, you can create a new cache class that cap your in-memory cache to a fixed size. In PHP, it can be something like

class CappedMemoryCache { public $cap = 100; public $gcProbability = 1; public $gcDivisor = 100; private $cache = array(); public function get($key) { return isset($this->cache[$key]) ? $this->cache[$key] : false; } public function set($key, $value) { $this->cache[$key] = $value; $this->garbageCollect(); } private function garbageCollect() { if ($this->shouldGarbageCollect()) { while (count($this->cache) > $this->cap) { reset($this->cache); $key = key($this->cache); $this->remove($key); } } private function shouldGarbageCollect() { return (mt_rand() / mt_getrandmax()) <= ($this->gcProbability / $this->gcDivisor); } }

This is a very simple and generic FIFO implementation. There are various other more efficient algorithms (LRU, MRU, LFU just to name a few), but in order to pick the best caching algorithm, you need to know locality/pattern of the data you are caching, which is not always obvious depending on the type of data you are dealing with. No matter what caching algorithm you choose, the simple fact is that you cannot hold arbitrary size data entirely in PHP memory.

Use layered caching

With the cache being capped to a certain size, it is unavoidable that some data are to be discarded from the cache to make room for new data, and when the discarded data is required again, you have to do the expensive query again. Using layers of caches can mitigate this issue.

By inserting another external caching layer (eg. with larger space and ability to set ttl on the data, such as redis), data is written through all the layers, and the fastest cache be refilled by pulling data from the next faster layer. This setup can potentially help you avoid expensive repeat queries altogether.

class Layered_Cache { private $_layers = array(); public function addCache (Cache $cache) { $this->_layers[] = $cache; return $this; } public function set($key, $value, $expires = null) { foreach ($this->_layers as $cache) { $cache->set($key, $value, $expires); } return true; } public function get($key) { foreach ($this->_layers as $cache) { $value = $cache->get($key); if ($value !== false) { // key found, place it into faster layers if necessary and then return it foreach ($this->_layers as $faster) { if ($faster === $cache) { break; } $ttl = $cache->ttl($key); $faster->set($key, $value, $ttl); } return $value; } } return false; } ... }

DB Resources

Calling mysqli_free_result() once you are done with the result data, especially when dealing with long running functions when the resource is not freed automatically upon exiting function scope.

function longRunningProcess() { ... $result = mysqli_query($query); ... // process the raw db data into the format you want $data = formatData($result); ... // free the result before doing more heavy lifting mysqli_free_result($result); // this can potentially run for a long time heavy_lifting($data); }

In the above example, the mysqli_query() call runs the query in MySQL and transfers the entire data set from MySQL to PHP’s internal buffer, ready to be consumed by PHP.

Although we have already used the DB result in formatData() , the memory holding the buffered data is not freed until the function returns, and therefore is completely wasted and not available to be used by the heavy_lifting() function.

Calling mysqli_free_result() explicitly here notifies the PHP that the buffered DB result can be freed now and that memory can be reclaimed to be used for heavy_lifting() .

Free large blocks of memory

Similar to freeing the db resource as early as possible, it can be a good idea to free variables holding large chunks of data as well (especially in a long-running loop).

for ($i=0; $i < 5; $i++) { $data = openssl_random_pseudo_bytes(1000000); echo "peak_memory_usage = ” . memory_get_peak_usage(true) . “

”; doSomething($data); //unset($data); } echo “for loop completed, memory now at ” . memory_get_usage(true) . “

”; function doSomething($data) { echo “size:” . strlen($data) . “

”; }

Now when you run the above script you get something similar to

peak_memory_usage = 1310720 size:1000000 peak_memory_usage = 2359296 size:1000000 peak_memory_usage = 2359296 size:1000000 peak_memory_usage = 2359296 size:1000000 peak_memory_usage = 2359296 size:1000000 for loop completed, memory now at 1310720

Notice the peak memory doubles for the second iteration and onward? That is because after the first iteration, $data variable always holds a large chunk of memory containing the data computed in the previous iteration while at the same time the openssl_random_pseudo_bytes() needs to allocate another large chunk of memory for its result.

If you limit your PHP memory resource to 2M, you’ll get out of memory error:

$ php -d memory_limit=2M test.php peak_memory_usage = 1310720 size:1000000 Fatal error: Allowed memory size of 2097152 bytes exhausted (tried to allocate 1000001 bytes) in test.php on line 4

The unset() call tells PHP to free up the memory we know we do not need anymore, to reduce the peak memory usage. Now uncomment the unset($data) at line 7 and re-run the script, it should succeed this time, with constant memory usage.

$ php -d memory_limit=2M test.php peak_memory_usage = 1310720 size:1000000 peak_memory_usage = 1310720 size:1000000 peak_memory_usage = 1310720 size:1000000 peak_memory_usage = 1310720 size:1000000 peak_memory_usage = 1310720 size:1000000 for loop completed, memory now at 262144

This is not a case of “memory leaking”, it simply requires too much memory to carry out the work. If the data is not upper bounded, you will definitely still hit the memory limit, and you need to find a better way to carry out the work without generating such a huge data set first. This trick is meant to be used as a quick fix in situations where it might be too complex or impossible to rewrite the data processing code.