In the last week, one of my websites – which is recently launched – has suffered from major performance problems, although there is little traffic on it. The website is built on Symfony2 and it acts as directory for financial/forex companies, I started investigating the problem, and the reason was quickly identified, the pagination was the culprit!

One of the tables pulled RSS feeds from multiple financial news sources and saved them in our CMS content table, and in about one month the table contained approximately 50,000 rows… the number of rows is relatively not that much, but the queries that are used in the pagination functionality (which is knp_paginator bundle) cause me some problem, especially that I am deploying the site on VPS with 2GB RAM..,

Background on limit/offset performance

Many developers that are dealing with MySQL has misunderstanding of limit/offset functionality, its not necessary that your query become efficient just because you applied limit to it. MySQL process query conditions, apply the ordering then it retrieves the desired number of rows, so internally MySQL may perform full table scan, use filesort of the resultset in a file or temporary table, then retrieve the desired number of rows, so the applying limit/offset comes at later stages of the query execution.

If you are using an index on your columns (especially for “order by” clause), you will get a better performance, but still when applying large offset (lets say 100,000 rows with limit 10), MySQL will have to process 100,000 then throw away unneeded rows to return just your desired 10 rows!

That’s why it’s problematic to apply pagination on large dataset on MySQL, and when you are using ORM you have to make some extra effort to optimize your pagination queries.

Paging Problems in Symfony2 knp_paginator

Usually when we need to paginate one of our table rows in Doctrine ORM with Symfony2, we would use Knp_paginator, since it’s easy to use, and provide you with simple paging functionality with just a couple lines of code. However, when I looked more closely in the way it performs I found some points that form performance bottlenecks in the way it operates, especially for large data sets.

For the purpose of clarification, I will be using cms_content table that have the following structure:



mysql> desc cms_content; +--------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------------+--------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | category_id | int(11) | NO | MUL | NULL | | | is_published | tinyint(1) | NO | MUL | NULL | | | slug | varchar(255) | NO | MUL | NULL | | | created_at | datetime | NO | MUL | NULL | | .... +--------------+--------------+------+-----+---------+----------------+

Counting Query:

The columns that I frequently use in the queries are is_published, category_id and usually the ordering is based on created_at column.

In order to get number of pages available, any paging library construct query that counts the results based on the parameters passed, the simplest counting query will look something like that:



SELECT COUNT(id) FROM cms_content where category_id = 3 and is_published = true order by created_at desc;

+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+ | 1 | SIMPLE | cms_content | ref | IDX_A0293FB812469DE2,secondary_index | secondary_index | 5 | const,const | 13972 | Using index | +----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+



mysql> show index from cms_content; +-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | cms_content | 0 | PRIMARY | 1 | id | A | 27620 | NULL | NULL | | BTREE | | | | cms_content | 1 | IDX_A0293FB812469DE2 | 1 | category_id | A | 1 | NULL | NULL | | BTREE | | | | cms_content | 1 | secondary_index | 1 | is_published | A | 1 | NULL | NULL | | BTREE | | | | cms_content | 1 | secondary_index | 2 | category_id | A | 1 | NULL | NULL | | BTREE | | | | cms_content | 1 | secondary_index | 3 | created_at | A | 27620 | NULL | NULL | | BTREE | | | +-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ 6 rows in set (0.25 sec) Here I have created an index for the columns frequently used, which are is_published, category_id, created_at to utilize indexing and improve all queries, see the indexes applied:

so when you explain this query in order to check its performance you will see:as you can see, this query is fairly optimized, because it uses a covering index (see Using Index in the Extra cell in the explain result), that mean that MySQL performs this query by just looking into index without reading full data row of the table.

however, when I searched for the count query used by knp_paginator bundle, I got this complex query:



SELECT COUNT(*) AS dctrn_count FROM ( SELECT DISTINCT id0 FROM ( SELECT c0_.id AS id0, c0_.is_published AS is_published1, c0_.slug AS slug2, c0_.guid AS guid3, c0_.rss_link AS rss_link4, c0_.category_id AS category_id5, c0_.created_at AS created_at6 FROM cms_content c0_ WHERE c0_.is_published = true AND c0_.category_id = 3 ORDER BY c0_.created_at DESC ) dctrn_result ) dctrn_table

+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+ | 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away | | 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 26775 | Using temporary | | 3 | DERIVED | c0_ | ALL | IDX_A0293FB812469DE2,secondary_index | IDX_A0293FB812469DE2 | 4 | | 27944 | Using where; Using filesort | +----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+

Fetcing Data Query:

and here is the explain result for the previous query:as shown, the query is not optimized as all! the use of subquery eliminated the use of the indexes and required a file sort to handle processing its result, also there is unnecessary distinct keyword that impose some overhead on the query.On my machine this query took 1.04 seconds to execute compared with 0.01 seconds for the previous count query, although they have the exact same logic.

The query that retrieves the data of pagination, usually implemented by simply adding limit/offset clauses with the desired parameters. As I explained earlier, this can be a performance problem for large number of pages because it causes MySQL to process too much rows then discard most of them.

For example, here is the default query to retrieve data of the page number 5,000 of my cms_content table:



SELECT DISTINCT c0_.id AS id0, c0_.created_at AS created_at1, c0_.slug as slug2 FROM cms_content c0_ WHERE c0_.is_published = ? AND c0_.category_id = ? ORDER BY c0_.created_at DESC LIMIT 5 OFFSET 24995

this query may seem as simple query, but it will cause MySQL to process large number of rows, and perform filesort, and will lead to high random I/O operation especially if you want to retrieve more data columns such as title, guid…etc, and this cost will increase as the number of pages gets larger.

One of the best ways to optimize such queries is by using deferred joins, that is joining with smaller resultset, as following:



select id, created_at, slug from cms_content inner join ( select id from cms_content where category_id = 3 and is_published=1 order by created_at desc limit 5 OFFSET 24995 ) as small_tbl using (id)



show status like 'last_query_cost'; despite how this query structured, its very efficient one because it minimize the cost of I/O operations. you can verify that by checking last_query_cost after executing it, and comparing it to the other approach

Doctrine Code

This method is used to make MySQL read as little data as possible, if you check the sub-query, it’s using a covering index and retrieves small resultset with very low cost (I/O operation), then full data row will be fetched according to the small number of IDs retrieved by the sub-query.

In order to apply the optimized pagination queries in my Symfony project, I had to write my own code, and remove knp_paginator calls.

First, I wrote a method to get the count of results used in paging, placing it in the ContentRepository class:



public function getContentCountByCategory($cat_id){ $qb = $this->createQueryBuilder('c') ->select('count(c.id)') ->where('c.categoryId = :category_id') ->andWhere('c.isPublished = :published') ->setParameter('category_id', $cat_id) ->setParameter('published', true) ; return $qb->getQuery()->getSingleScalarResult(); }

this method produces an optimized version of the count query.

Second, I had to write method to retrieve results with deferred join, I tried to use query builder, but unfortunately doctrine doesn’t support subquery in “from” clause, so I wrote a native SQL query with ResultSetMapping as following:



public function getContentQueryPyCategory($cat_id, $lang, $page=1, $limit=10) { $offset = ($page-1)*$limit; $rsm = new ResultSetMapping(); $rsm->addEntityResult('Webit\CMSBundle\Entity\Content', 'c'); $rsm->addFieldResult('c', 'id', 'id'); $rsm->addFieldResult('c', 'created_at', 'createdAt'); $rsm->addFieldResult('c', 'slug', 'slug'); $query = $this->getEntityManager()->createNativeQuery("select id, created_at, slug from cms_content inner join " . "(select id from cms_content where category_id = ? and is_published=? order by created_at desc limit ? offset ?) as small_tbl" . " using (id)", $rsm); $query->setParameter(1, $cat_id); $query->setParameter(2, true); $query->setParameter(3, $limit); $query->setParameter(4, $offset); return $query->getResult(); }

This method with produce an efficient query, that reduces I/O operations inside MySQL, and the query cost will be minimal even when the number of pages increase.

After implementing this solution, the queries used to produce the page number 3000 took about 46ms compared to approximately 440ms in the original paginator code, more important, the memory usage of MySQL have been utilized because the queries uses the indexing in a more efficient way.

Conclusion

In this post, I discussed the performance problems of pagination in doctrine & Symfony2 project. I applied some techniques that are more efficient in handling limit/offset operations in pagination. Although, there are other workarounds that give good performance, like this good post. This piece of code made my application perform better, hope it helps others who face similar situation.