Retail Reference Architecture Part 1: Building a Flexible, Searchable, Low-Latency Product Catalog

Series: Building a Flexible, Searchable, Low-Latency Product Catalog Approaches to Inventory Optimization Query Optimization and Scaling Recommendations and Personalizations Product catalog data management is a complex problem for retailers today. After years of relying on multiple monolithic, vendor-provided systems, retailers are now reconsidering their options and looking to the future. In today’s vendor-provided systems, product data must frequently be moved back and forth using ETL processes to ensure all systems are operating on the same data set. This approach is slow, error prone, and expensive in terms of development and management. In response, retailers are now making data services available individually as part of a centralized service-oriented architecture (SOA). This is a pattern that we commonly see at MongoDB, so much so that we’ve begun to define some best practices and reference architecture specifically targeted at the retail space. As part of that effort, today we’ll be taking a look at implementing a catalog service using MongoDB as the first of a three part series on retail architecture. Why MongoDB? Many different database types are able to fulfill our product catalog use case, so why choose MongoDB? Document flexibility: Each MongoDB document can store data represented as rich JSON structures. This makes MongoDB ideal for storing just about anything, including very large catalogs with thousands of variants per item. Dynamic schema: JSON structures within each document can be altered at any time, allowing for increased agility and easy restructuring of data when needs change. In MongoDB, these multiple schemas can be stored within a single collection and can use shared indexes, allowing for efficient searching of both old and new formats simultaneously. Expressive query language: The ability to perform queries across many document attributes simplifies many tasks. This can also improve application performance by lowering the required number of database requests. Indexing: Powerful secondary, compound and geo-indexing options are available in MongoDB right out of the box, quickly enabling features like sorting and location-based queries. Data consistency: By default, all reads and writes are sent to the primary member of a MongoDB replica set. This ensures strong consistency, an important feature for retailers, who may have many customers making requests against the same item inventory. Geo-distributed replicas: Network latency due to geographic distance between a data source and the client can be problematic, particularly for a catalog service which would be expected to sustain a large number of low-latency reads. MongoDB replica sets can be geo-distributed, so that they are close to users for fast access, mitigating the need for CDNs in many cases. These are just a few of the characteristics of MongoDB that make it a great option for retailers. Next, we’ll take a look at some of the specifics of how we put some of these to use in our retail reference architecture to support a number of features, including: Searching for items and item variants Retrieving per store pricing for items Enabling catalog browsing with faceted search Item Data Model The first thing we need to consider is the data model for our items. In the following examples we are showing only the most important information about each item, such as category, brand and description: { “_id”: “30671”, //main item ID “department”: “Shoes”, “category”: “Shoes/Women/Pumps”, “brand”: “Calvin Klein”, “thumbnail”: “http://cdn.../pump.jpg”, “title”: “Evening Platform Pumps”, “description”: “Perfect for a casual night out or a formal event.”, “style”: “Designer”, … } This type of simple data model allows us to easily query for items based on the most demanded criteria. For example, using db.collection.findOne , which will return a single document that satisfies a query: Get item by ID db.definition.findOne({_id:”301671”}) Get items for a set of product IDs db.definition.findOne({_id:{$in:[”301671”,”452318”]}}) Get items by category prefix db.definition.findOne({category:/^Shoes\/Women/}) Notice how the second and third queries used the $in operator and a regular expression, respectively. When performed on properly indexed documents, MongoDB is able to provide high throughput and low latency for these types of queries. Variant Data Model Another important consideration for our the product catalog is item variants, such as available sizes, colors, and styles. Our item data model above only captures a small amount of the data about each catalog item. So what about all of the available item variations we may need to retrieve, such as size and color? One option is to store an item and all its variants together in a single document. This approach has the advantage of being able to retrieve an item and all variants in a single query. However, it is not the best approach in all cases. It is an important best practice to avoid unbounded document growth. If the number of variants and their associated data is small, it may make sense to store them in the item document. Another option is to create a separate variant data model that can be referenced relative to the primary item: { “_id”: ”93284847362823”, //variant sku “itemId”: “30671”, //references the main item “size”: 6.0, “color”: “red” … } This data model allows us to do fast lookups of specific item variants by their SKU number: db.variation.find({_id:”93284847362823”}) As well as all variants for a specific item by querying on the itemId attribute: db.variation.find({itemId:”30671”}).sort({_id:1}) In this way, we maintain fast queries on both our primary item for displaying in our catalog, as well as every variant for when the user requests a more specific product view. We also ensure a predictable size for the item and variant documents. Per Store Pricing Another consideration when defining the reference architecture for our product catalog is pricing. We’ve now seen a few ways that the data model for our items can be structured to quickly retrieve items directly or based on specific attributes. Prices can vary by many factors, like store location. We need a way to quickly retrieve the specific price of any given item or item variant. This can be very problematic for large retailers, since a catalog with a million items and one thousand stores means we must query across a collection of a billion documents to find the price of any given item. We could, of course, store the price for each variant as a nested document within the item document, but a better solution is to again take advantage of how quickly MongoDB is able to query on _id . For example, if each item in our catalog is referenced by an itemId, while each variant is referenced by a SKU number, we can set the _id of each document to be a concatenation of the itemId or SKU and the storeId associated with that price variant. Using this model, the _id for the pair of pumps and its red variant described above would look something like this: Item: 30671_store23 Variant: 93284847362823_store23 This approach also provides a lot of flexibility for handling pricing, as it allows us to price items at the item or the variant level. We can then query for all prices or just the price for a particular location: All prices: db.prices.find({_id:/^30671/}) Store price: db.prices.find({_id:/^30671_store23/}) We could even add other combinations, such as pricing per store group, and get all possible prices for an item with a single query by using the $in operator: db.prices.find({_id:{$in:[ “30671_store23”, “30671_sgroup12”, “93284847362823_store23”, “93284847362823_sgroup12” ]}}) Browse and Search Products The biggest challenge for our product catalog is to enable browsing with faceted search. While many users will want to search our product catalog for a specific item or criteria they are looking for, many others will want to browse, then narrow the returned results by any number of attributes. So given the need to create a page like this: We have many challenges: Response time: As the user browses, each page of results should return in milliseconds. Multiple attributes: As the user selects different facets—e.g. brand, size, color—new queries must be run on multiple document attributes. Variant-level attributes: Some user-selected attributes will be queried at the item level, such as brand, while others will be at the variant level, such as size. Multiple variants: Thousands of variants can exist for each item, but we only want to display each item once, so results must be de-duplicated. Sorting: The user needs to be allowed to sort on multiple attributes, like price and size, and that sorting operation must perform efficiently. Pagination: Only a small number of results should be returned per page, which requires deterministic ordering. Many retailers may want to use a dedicated search engine as the basis of these features. MongoDB provides an open source connector project , which allows the use of Apache Solr and Elasticsearch with MongoDB. For our reference architecture, however, we wanted to implement faceted search entirely within MongoDB. To accomplish this, we create another collection that stores what we will call summary documents. These documents contain all of the information we need to do fast lookups of items in our catalog based on various search facets. { “_id”: “30671”, “title”: “Evening Platform Pumps”, “department”: “Shoes”, “Category”: “Women/Shoes/Pumps”, “price”: 149.95, “attrs”: [“brand”: “Calvin Klein”, …], “sattrs”: [“style”: ”Designer”, …], “vars”: [ { “sku”: “93284847362823”, “attrs”: [{“size”: 6.0}, {“color”: “red”}, …], “sattrs”: [{“width”: 8.0}, {“heelHeight”: 5.0}, …], }, … //Many more SKUs ] } Note that in this data model we are defining attributes and secondary attributes. While a user may want to be able to search on many different attributes of an item or item variant, there is only a core set that are most frequently used. For example, given a pair of shoes, it may be more common for a user to filter their search based on available size than filtering by heel height. By using both the attr and sattr attributes in our data model, we are able to make all of these item attributes available to search, but incur only the expense of indexing the most used attributes by indexing only attr . Using this data model, we would create compound indices on the following combinations: department + attr + category + _id department + vars.attr + category + _id department + category + _id department + price + _id department + rating + _id In these indices, we always start with department, and we assume users will chose the department to refine their search results. For a catalog without departments, we could have just as easily begun with another common facet like category or type. We can then perform the queries needed for faceted search and quickly return the results to the page: Get summary from itemId db.variation.find({_id:”30671”}) Get summary of specific item variant db.variation.find({vars.sku:”93284847362823”},{“vars.$”:1}) Get summaries for all items by department db.variation.find({department:”Shoes”}) Get summaries with a mix of parameters db.variation.find({ “department”:”Shoes”, “vars.attr”: {“color”:”red”}, “category”: “^/Shoes/Women”}) Recap We’ve looked at some best practices for modeling and indexing data for a product catalog that supports a variety of application features, including item and item variant lookup, store pricing, and catalog browsing using faceted search. Using these approaches as a starting point can help you find the best design for your own implementation. Learn more To discover how you can re-imagine the retail experience with MongoDB, read our white paper . In this paper, you'll learn about the new retail challenges and how MongoDB addresses them. Learn more about how leading brands differentiate themselves with technologies and processes that enable the omni-channel retail experience. Read our guide on the digitally oriented consumer Read Part 2 >>