Follow @vlad_mihalcea Imagine having a tool that can automatically detect if you are using JPA and Hibernate properly. Hypersistence Optimizer is that tool!

Introduction

In this article, we are going to how the JPQL DISTINCT keyword behaves depending on the underlying entity query type.

Domain Model

Let’s assume we are using the following Post and PostComment entities in our application:

The Post entity is mapped as follows:

@Entity(name = "Post") @Table(name = "post") public class Post { @Id @GeneratedValue private Long id; private String title; @Column(name = "created_on") private LocalDate createdOn; @OneToMany( mappedBy = "post", cascade = CascadeType.ALL, orphanRemoval = true ) private List<PostComment> comments = new ArrayList<>(); //Getters and setters omitted for brevity public void addComment(PostComment comment) { comments.add(comment); comment.setPost(this); } }

The addComment method is useful for synchronizing both ends of the bidirectional @OneToMany association. For more details, check out this article.

And the PostComment entity looks as follows:

@Entity(name = "PostComment") @Table(name = "post_comment") public class PostComment { @Id @GeneratedValue private Long id; @ManyToOne(fetch = FetchType.LAZY) private Post post; private String review; public PostComment() {} public PostComment(String review) { this.review = review; } //Getters and setters omitted for brevity }

By default, all @ManyToOne and @OneToOne associations are to be fetched eagerly, which is most often a terrible idea. That’s the reason we used the FetchType.LAZY strategy for the @ManyToOne association.

Test data

Now, let’s add some test data which we are going to use to demonstrate how the DISTINCT keyword works based on the underlying JPQL query type:

Post post1 = new Post(); post1.setTitle( "High-Performance Java Persistence eBook has been released!" ); post1.setCreatedOn( LocalDate.of(2016, 8, 30) ); entityManager.persist(post1); post1.addComment(new PostComment("Excellent!")); post1.addComment(new PostComment("Great!")); Post post2 = new Post(); post2.setTitle( "High-Performance Java Persistence paperback has been released!" ); post2.setCreatedOn( LocalDate.of(2016, 10, 12) ); entityManager.persist(post2); Post post3 = new Post(); post3.setTitle( "High-Performance Java Persistence Mach 1 video course has been released!" ); post3.setCreatedOn( LocalDate.of(2018, 1, 30) ); entityManager.persist(post3); Post post4 = new Post(); post4.setTitle( "High-Performance Java Persistence Mach 2 video course has been released!" ); post4.setCreatedOn( LocalDate.of(2018, 5, 8) ); entityManager.persist(post4);

DISTINCT with JPQL scalar queries

When using a scalar projection like the one in the following example:

List<Integer> publicationYears = entityManager .createQuery( "select distinct year(p.createdOn) " + "from Post p " + "order by year(p.createdOn)", Integer.class) .getResultList(); LOGGER.info("Publication years: {}", publicationYears);

The DISTINCT keyword is needed to be passed to the underlying SQL statement, and Hibernate is going to output the following result:

SELECT DISTINCT extract(YEAR FROM p.created_on) AS col_0_0_ FROM post p ORDER BY extract(YEAR FROM p.created_on) -- Publication years: [2016, 2018]

Therefore, for scalar queries, the DISTINCT JPQL keyword is needed to be passed to the underlying SQL query since we want the result set to remove duplicates.

DISTINCT with JPQL entity queries

The DISTINCT keyword has a different purpose when it comes to entity queries. Without using DISTINCT , the JPA specification states that the returning entities resulting from a parent-child JOIN might contain object reference duplicates.

To visualize this behavior, consider the following JPQL query:

List<Post> posts = entityManager .createQuery( "select p " + "from Post p " + "left join fetch p.comments " + "where p.title = :title", Post.class) .setParameter( "title", "High-Performance Java Persistence eBook has been released!" ) .getResultList(); LOGGER.info( "Fetched the following Post entity identifiers: {}", posts.stream().map(Post::getId).collect(Collectors.toList()) );

When running the JPQL query above, Hibernate generate the following output:

SELECT p.id AS id1_0_0_, pc.id AS id1_1_1_, p.created_on AS created_2_0_0_, p.title AS title3_0_0_, pc.post_id AS post_id3_1_1_, pc.review AS review2_1_1_, pc.post_id AS post_id3_1_0__ FROM post p LEFT OUTER JOIN post_comment pc ON p.id=pc.post_id WHERE p.title='High-Performance Java Persistence eBook has been released!' -- Fetched the following Post entity identifiers: [1, 1]

As illustrated by the log message, the returned posts List contains two references of the same Post entity object. This is because the JOIN duplicates the parent record for every child row that’s going to be fetched.

To remove the entity reference duplicates, we need to use the DISTINCT JPQL keyword:

List<Post> posts = entityManager .createQuery( "select distinct p " + "from Post p " + "left join fetch p.comments " + "where p.title = :title", Post.class) .setParameter( "title", "High-Performance Java Persistence eBook has been released!" ) .getResultList(); LOGGER.info( "Fetched the following Post entity identifiers: {}", posts.stream().map(Post::getId).collect(Collectors.toList()) );

When executing the JPQL query above, Hibernate will now generate the following output:

SELECT DISTINCT p.id AS id1_0_0_, pc.id AS id1_1_1_, p.created_on AS created_2_0_0_, p.title AS title3_0_0_, pc.post_id AS post_id3_1_1_, pc.review AS review2_1_1_, pc.post_id AS post_id3_1_0__ FROM post p LEFT OUTER JOIN post_comment pc ON p.id=pc.post_id WHERE p.title='High-Performance Java Persistence eBook has been released!' -- Fetched the following Post entity identifiers: [1]

So, the duplicates were removed from the posts List , but the DISTINCT keyword was also passed to the underlying SQL statement. For this SQL query, the DISTINCT keyword serves no purpose since the result set will contain unique parent-child records.

If we analyze the execution plan for the previous SQL statement, we can see that a quicksort execution is being added to the plan:

Unique (cost=23.71..23.72 rows=1 width=1068) (actual time=0.131..0.132 rows=2 loops=1) -> Sort (cost=23.71..23.71 rows=1 width=1068) (actual time=0.131..0.131 rows=2 loops=1) Sort Key: p.id, pc.id, p.created_on, pc.post_id, pc.review Sort Method: quicksort Memory: 25kB -> Hash Right Join (cost=11.76..23.70 rows=1 width=1068) (actual time=0.054..0.058 rows=2 loops=1) Hash Cond: (pc.post_id = p.id) -> Seq Scan on post_comment pc (cost=0.00..11.40 rows=140 width=532) (actual time=0.010..0.010 rows=2 loops=1) -> Hash (cost=11.75..11.75 rows=1 width=528) (actual time=0.027..0.027 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on post p (cost=0.00..11.75 rows=1 width=528) (actual time=0.017..0.018 rows=1 loops=1) Filter: ((title)::text = 'High-Performance Java Persistence eBook has been released!'::text) Rows Removed by Filter: 3 Planning time: 0.227 ms Execution time: 0.179 ms

The quicksort execution adds an unneeded overhead to our statement execution since we don’t need to eliminate any duplicates since the result set contains unique parent-child row combinations.

Using the hibernate.query.passDistinctThrough JPQL query hint

To avoid passing the DISTINCT keyword to the underlying SQL statement, we need to activate the hibernate.query.passDistinctThrough JPQL query hint as illustrated by the following example:

List<Post> posts = entityManager .createQuery( "select distinct p " + "from Post p " + "left join fetch p.comments " + "where p.title = :title", Post.class) .setParameter( "title", "High-Performance Java Persistence eBook has been released!" ) .setHint("hibernate.query.passDistinctThrough", false) .getResultList(); LOGGER.info( "Fetched the following Post entity identifiers: {}", posts.stream().map(Post::getId).collect(Collectors.toList()) );

When running the JPQL with the hibernate.query.passDistinctThrough hint activated, Hibernate executes the following SQL query:

SELECT p.id AS id1_0_0_, pc.id AS id1_1_1_, p.created_on AS created_2_0_0_, p.title AS title3_0_0_, pc.post_id AS post_id3_1_1_, pc.review AS review2_1_1_, pc.post_id AS post_id3_1_0__ FROM post p LEFT OUTER JOIN post_comment pc ON p.id=pc.post_id WHERE p.title='High-Performance Java Persistence eBook has been released!' -- Fetched the following Post entity identifiers: [1]

Therefore, the DISTINCT keyword is no longer passed to the SQL query, but entity duplicates are removed from the returning posts List .

If we analyze the execution plan for the last SQL query, we can see that the quicksort execution is no longer added to the execution plan:

Hash Right Join (cost=11.76..23.70 rows=1 width=1068) (actual time=0.066..0.069 rows=2 loops=1) Hash Cond: (pc.post_id = p.id) -> Seq Scan on post_comment pc (cost=0.00..11.40 rows=140 width=532) (actual time=0.011..0.011 rows=2 loops=1) -> Hash (cost=11.75..11.75 rows=1 width=528) (actual time=0.041..0.041 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on post p (cost=0.00..11.75 rows=1 width=528) (actual time=0.036..0.037 rows=1 loops=1) Filter: ((title)::text = 'High-Performance Java Persistence eBook has been released!'::text) Rows Removed by Filter: 3 Planning time: 1.184 ms Execution time: 0.160 ms

Conclusion

Because the DISTINCT JPQL keyword has two meanings based on the underlying query type, it’s important to pass it through to the SQL statement only for scalar queries where the result set requires duplicates to be removed by the database engine.

For parent-child entity queries where the child collection is using JOIN FETCH , the DISTINCT keyword should only be applied after the ResultSet is got from JDBC, therefore avoiding passing DISTINCT to the SQL statement that gets executed.

Insert details about how the information is going to be processed DOWNLOAD NOW