Introduction

A common use case for ML is performing item classification for a particular dataset. For example: Classifying fruits for an online green grocer. However, in certain scenarios, classifying is more complex than choosing a category that best describes an object from a given list of categories . Categories can have relationships with each other which dictate how these categories are selected and how they are used to define an object. I propose a way to organize a dataset into a hierarchical structure of categories, and a novel Boolean logic ensemble method to classify such a dataset.

Problem Statement

I was tasked to code a system for a food app, in which dishes could be classified into categories of cuisine using only the names of the dishes themselves. I was provided a user labelled dataset of 2000 dish items. I had to overcome the following limitations:

The labelled dataset was extremely unbalanced The user labels were not always reliable (if one user labelled dish a as belonging to cuisine x, another user would label a similar dish b as belonging to cuisine y) The dataset only contained 2000 datapoints, which is way too small to train any sort of neural net [2] ( certain neural nets perform better with unbalanced datasets [1] ) The system was to be as computationally efficient as possible.

Approach

It is important to understand that cuisines can be sorted by their geographical locations, and then organized into a hierarchical structure. I surveyed the most common cuisines in my city, and using the geographical locations, developed the following tree hierarchy:

Cuisine Tree

The categories now have a ancestor/descendant relationship with one another. Additionally the categories are divided into 4 groups, from level 1 to level 4. These groups correspond to the depth of the category in the tree.

The first issue to tackle was to find a way to standardize the labels generated by the users, and to weed out any ambiguity in the labels in the dataset. The dataset was in the following format:

ID Dish Name Cuisine 1 Chicken Tikka Pakistani 2 Chocolate Brownie Sizzler Western

The dataset was then edited, by splitting the single cuisine label into 4 labels corresponding to the levels, to fit the hierarchy defined in the cuisine tree above.

ID Dish Name Level 1 Level 2 Level 3 Level 4 1 Chicken Tikka Eastern Asian Desi Pakistani 2 Chocolate Brownie Sizzler Western – – –

The dishes in the dataset now belong to atleast 1 of 2 categories, Eastern and Western in level 1. This is because all other categories are descendants of either of the 2 categories.

The next challenge is to deal with the unbalanced nature of the dataset [3]. The dataset was analysed to see which categories had significant enough presence in the population to ensure that any ML model would be able to confidently identify them correctly. For example: Out of the 2000 datapoints in the dataset, only one is categorized as Sudanese. Which means only 0.05% of the dataset contains Sudanese dishes. We remove Sudanese as a category and collapse it into it’s parent category, African.

The dish categorization is changed from:

ID Dish Name Level 1 Level 2 Level 3 Level 4 203 Kuindiong Eastern African Sudanese –

To:

ID Dish Name Level 1 Level 2 Level 3 Level 4 203 Kuindiong Eastern African – –

In a normal classifier, this would result in a loss of information and would make the classifier less useful. However, I will show how this loss of information is reduced by coming back to this example later on.

By following the above steps for all categories, we successfully “trim” our cuisine tree. The result is as follows:

Trimmed Cuisine Tree

Once the ambiguous and unbalanced nature of the dataset is resolved, it might be tempting to put the data through a multi-class multi-label classifier. However, this would present 2 issues:

1. There is no simple way to make a ML model to understand the relationship between categories, and to ensure that all the predictions for the different levels follow the same “lineage” [4].

Wrong prediction lineage

Dish Name Level 1 Level 2 Level 3 Level 4 Roasted Peking Duck Eastern European Desi Chinese

Correct prediction lineage

Dish Name Level 1 Level 2 Level 3 Level 4 Roasted Peking Duck Eastern Asian East Asian Chinese

2. The more significant issue is that there are a lot of null values in the dataset. Dropping all the datapoints with null values would significantly reduce the already small dataset size. The null values would have to be filed with a placeholder value.

Dataset with placeholder value “Neither”.

Dish Name Level 1 Level 2 Level 3 Level 4 Chicken Wild Mushroom Crepe Westren European Italian Neither

This presents another issue. The placeholder value would be treated the same as the the other categories, even though it does not exist within the set hierarchy of the cuisine tree. Additionally, the placeholder values do not have correlation with any features of the predictor variable (the dish name) in the dataset. This is because the null values, that the placeholder value replaces, exists, because of unreliable user labeling, or because of the “trimming” that was performed on the dataset. This results in a dataset that has too much noise for any simple classifier to deal with.

Solution

The solution is to use multiple smaller homogeneous classifiers for each “level” of categories in the cuisine tree. This means that each classifier is a multi-class single-label classifier, simplifying the classification problem, while also reducing the possible noise introduced to each classifier.

Training Classifiers

When a dish name is fed to the classifiers to get a prediction, the individual predictions from the classifiers are fed to a logic system which rejects or accepts the prediction, based on the predictions from the other classifiers and the hierarchy defined in the cuisine tree. The logic operates in the following way:

Set recent_ancestor = None Set confidence_level = x Set n = 1 Set max_level = y (in this example y = 4 because we have 4 levels within the cuisine tree hierarchy) Get predictions from level n classifier If n = 1:

6.1. Get best prediction p from predictions

6.2. set recent_ancestor = p

6.3. n= n+1 else if n>1 AND n<= max_level:

7.1. sort predictions from level n classifier based on confidence score in descending order

7.2. iterate through sorted predictions

7.3. If any prediction p is descendant of recent_ancestor AND has confidence score >= x:

7.3.1. set recent_ancestor = p

7.4. n=n+1 else if n = max_level:

8.1. output recent_ancestor as final prediction

8.2. exit Go to step 5

Logic Flow

Once a final result is obtained, the lineage of the result can be determined using a depth first search of the cuisine tree [5]. For example: If the final result is Chinese, the resulting lineage would be: Eastern, Asian, East Asian, Chinese

Correct Cuisine Lineage

Similarly if the final result is East Asian, the resulting lineage would be: Eastern, Asian, East Asian. This means that none of the predictions from the level 4 classifier met the criteria set in the logic above, and hence were rejected.

Using this method, even if the dish’s exact category cannot be predicted by the classifiers, the parent category can be still predicted with a high degree of confidence. This system is designed to output a more generalized but accurate result, in the absence of a precise result.

Lets take the example of the Sudanese dish “Kuindiong” from earlier in the article. When the cuisine tree was trimmed, the Sudanese category was collapsed into its parent category African. This means this system cannot classify any dish as Sudanese, but it can classify a Sudanese dish as African. This result is preferred over incorrectly predicting any dish as Sudanese.

An example use case for the end user would be as follows: A user searches for Sudanese dishes from a website whose dishes are labelled using this system. This system does not have any dish labelled as Sudanese in it’s system. Using the full non-trimmed cuisine tree, it shows the user all the dishes that belong to the nearest parent category of Sudanese, which would be African. The system shows the user all dishes labelled as African, the user now has an option to choose from a more relevant pool of dishes, even if the user cannot be shown precisely what he/she searched for. This is the best case scenario because of the limitations provided by that dataset that the classifiers were trained on.

Limitations and Future Work

The most obvious limitation of this system is that the relationships between categories has to be strict, known and can be represented via data structures. The implementation above has the further limitation that it only works as long as the relationships are in a hierarchy in the form of a tree.

The other limitation is that the above implementation does not support true multi-label classification. I.E., multi-label classification in each level. Taking the example of cuisines, the system can not classify fusion dishes. Fusion dishes are a mix of multiple cuisines, like Italian and American.

Even with these limitations I believe my system handled the limitations of the dataset quite well.

Moving forward I want to test the system on a more complex yet still strict hierarchy. For example: Using the already established animal taxonomy tree to classify newly discovered species of animals.

I also want to modify the system so that it is able to classify fusion dishes, possibly by using multiple multi-class multi-label classifiers at each level [6].

Finally I want to apply this concept to a classification task in which the categories are organized in more complex relationships, as compared to categories in a strict hierarchy. For example: Predicting the best treatment path for a patient. The categories can have multiple converging and diverging relationships with each other. These relationship could be represented via semantics graph [7], and fuzzy logic could be used, instead of Boolean logic, to compensate for the increase in complexity.

The relevant code for this system can be found in my GitHub repo.

References

Wang, Shoujin, et al. Training Deep Neural Networks on Imbalanced Data Sets.

“Impact of Dataset Size on Deep Learning Model Skill And Performance Estimates.” Machine Learning Mastery, 6 Aug. 2019, machinelearningmastery.com/impact-of-dataset-size-on-deep-learning-model-skill-and-performance-estimates/.

Luque, Amalia, et al. “The Impact of Class Imbalance in Classification Performance Metrics Based on the Binary Confusion Matrix.” Pattern Recognition, vol. 91, July 2019, pp. 216–231, http://www.sciencedirect.com/science/article/pii/S0031320319300950, 10.1016/j.patcog.2019.02.023. Accessed 5 Oct. 2019.

Rodriguez, Jesus. “What’s New in Deep Learning Research: Neural Networks That Detect Relationships Between Objects.” Medium, Towards Data Science, Aug. 2018, towardsdatascience.com/whats-new-in-deep-learning-research-neural-networks-that-detect-relationships-between-objects-4758e07b7e64. Accessed 5 Oct. 2019.

Wikipedia Contributors. “Depth-First Search.” Wikipedia, Wikimedia Foundation, 13 May 2019, en.wikipedia.org/wiki/Depth-first_search.

Kartik Nooney. “Deep Dive into Multi-Label Classification..! (With Detailed Case Study).” Medium, Towards Data Science, 8 June 2018, towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff. Accessed 5 Oct. 2019.

“Semantic Graphs: What They Are and Why You Should Care – DavidVandegrift.” Davidvandegrift.Com, 2016, davidvandegrift.com/blog?id=62. Accessed 5 Oct. 2019.

‌

As always, if you have any questions, suggestions, requests, corrections, criticisms, any feedback at all, please comment below. Any and all feedback is highly appreciated. Thank you for reading and I hope you found this article insightful.