The curious challenge of categorizing women’s clothing

A multi-part series on delivering structured e-commerce data on demand

Sign-up for a 30-day trial today!

Or book a personal chat with us about your data needs

Categorization can be a bane or a boon to retailers. They are the backbone of any product catalog; a neat, organized map of all products in your inventory that users can find.

Beauty in miniature. A sample of our extensive master category list

Just like in a library, organizing your products neatly by category is incredibly important as it is a crucial product discovery tool for users. If the search box on your retailer site doesn’t provide satisfactory results, users often turn to the category list to further refine results. If your customers don’t find their product, or even if they get the impression that the product isn’t available, they will head over to another retailer.

In short, if you have poor categorization, you're effectively handing over your revenue to competing retailers.

But that doesn't mean that categorization is an easy task, by any means. In fact, it can be incredibly challenging to algorithmically categorize a product.

Take for instance, pants.

Pants are ubiquitous. Its very evident that pants are pants, just by looking at them.

But what if the product in question also has the word “belt” in them?

That’s a bit more tricky. If you’re running a keyword analysis, the algroithm might pick up “belt” and mis-classify the pant as a belt. This is relatively easy to fix on an individual level — but when you’re dealing with similar levels of ambiguity for millions of products — the task becomes a lot harder than you think.

Of course, as humans, its pretty evident that these are pants. But to an algorithm, this presents a conundrum: without contextual knowledge, given just the name, how would it categorize this product, given that both “belt” and “pant” turn up in the product title? Remember: if you miscategorize, you’ll miss out on opportunities to sell this product in either category.

The obvious way to handle this problem would be to hand it off the to the best computers ever — human beings. This is easier said than done though. To begin with, categorizing hundreds of thousands of SKUs can take months of manpower, which comes at high cost; Mechanical Turk, SamaSource and the like offer great solutions if you can afford them.

The other problem is that manual categorization involves tagging a product with one among thousands of category keywords. Human beings can’t actively recall lists this large, so the less popular categories are likely to suffer from poor categorization.

Furthermore, when you attempt to increase the specificity of categories (i.e. use even lower sub-categories), accuracy drops, sometimes exponentially. Knowing when you have reached the limit of accuracy is important as well.

Unfortunately, this problem remains open-ended. At Semantics3, we adopt a variety of approaches to keep our categorization relatively accurate.