Ontology2: the Real Semantics Book

Table of Contents | Ontology2 home page

Notes on the Microsoft Concept Graph

Introduction For a long time there has been a battle in Natural Language Programming between statistically oriented systems that are trained from large quantities against systems as opposed to grammar oriented systems motivated by the modern linguistics associated with Professor Noam Chomsky. Although some modern NLP systems start by parsing sentences, using something like the Stanford Parser. I've long been interested in the problem of Named Entity Resolution and, in particular, the approach used by DBpedia Spotlight, which collects a large number of possible surface forms (for instance, both "NYPD" and "New York City Police Department" refer to this concept.) Spotlight is different from some other systems in that it resolves entities to concepts, rather than simply marking phrases in the text that correspond to particular roles(for instance, in a sentence like "Frank Boltz was an employee of the NYPD", the pink phrase is a person's name while the second is the name of an organization -- this assignment could be made even if we had no idea what these entities are.) I like to think of this kind of system as a "magic magic marker", which highlights phrases in text to tag them with either general or specific meanings. A system like spotlight works in two phases, first finding places where the surface form dictionary matches the text, and then determining which interpretations are correct. (For instance, the word "Kate" could be a surface form for the first name of a women having over one of over 1000 names derived from Catherine -- an inspection of the context is necessary to narrow it down to a particular Kate.) The Microsoft Concept graph is a database of surface forms and possible interpretations. Rather than resolving phrases to specific concepts (say "Pikachu" to Pikachu) it tags phrases with general concepts such as "Pokemon" or "Character". Produced with technology similar to that used to create word embeddings such as word2vec, the concept graph is positioned as a tool useful for understanding short texts such as search queries and tweets. This chapter is based on my notes of a preliminary investigation of the Microsoft Concept Graph intended to be a rapid evaluation of the product for text analysis applications.

Overview Let's look at the first 15 lines of the file to get a quick sense of the contents: label surface form score factor age 35167 free rich company datum size 33222 free rich company datum revenue 33185 state california 18062 supplement msm glucosamine sulfate 15942 factor gender 14230 factor temperature 13660 metal copper 11142 issue stress pain depression sickness 11110 variable age 9375 information name 9274 state new york 8925 social medium facebook 8919 material plastic 8628 supplemental material cds 8175 (Note that I added the header at the top, the actual file has no header) The file as a whole has 33,377,320 lines and is sorted by descending score. The label is a label which could be applied to a text phrase, the surface form is the phrase, and the score is a measure of the strength of association between the two. In the top 15 lines we can already see some sense of the diversity of concepts, such as "factor" which represents various properties that an object could have, as well as "state" and "material". Already we can see a few examples where the results are strange such as the concept "free rich company datum" (which seems to represent a property that a company could have) and the issue "stress pain depression sickness" (which seems ill formed and a bit verbose)

Analysis of labels Already we see one quality that most classifications based on unsupervised machine learning lack: the categories (labels) are meaningfully named, at least mostly. Speaking of labels, the system assigns 5,376,526 different labels. The most commonly assigned labels are here: rank number of members label 1 364111 factor 2 203549 feature 3 201986 issue 4 172106 product 5 158829 item 6 142963 area 7 137435 topic 8 133715 service 9 122903 activity 10 112387 information 11 110915 event 12 108940 company 13 102032 common search term 14 92337 program 15 91842 technique 16 88835 application 17 88342 organization 18 84534 case 19 83271 method 20 82397 name 21 80643 project 22 77880 option 23 75264 parameter 24 73788 tool 25 68767 group 26 64969 term 27 62168 problem 28 61827 material 29 61768 variable 30 56243 technology 31 55607 place 32 55161 measure 33 54641 artist 34 53449 community 35 51183 element 36 50445 aspect 37 50411 player 38 49954 condition 39 48435 concept 40 47923 system 41 46679 function 42 44516 task 43 43694 brand 44 42806 initiative 45 42075 device 46 41886 component 47 41500 datum 48 39146 person 49 37843 site 50 37505 resource The most common labels are short words with broad meanings. The one multi-word label that appears in the top 50, "common search term" turns out to be a strange one, containing search terms that would be used to find pirate software such as "adobe photoshop crack" and "age of empires 3 serial". A look at some of the least used labels shows that there is plenty of room at the bottom: rank number of members label 5376507 1 0168monuments 5376508 1 01527 1441 5376509 1 012 agonists 5376510 1 0 10v control application 5376511 1 00v diode 5376512 1 00portable item 5376513 1 00 later flst 5376514 1 00db features 5376515 1 00 construction equipment vehicle 5376516 1 0067j once pmma particle 5376517 1 0 027 in delivery microcatheter 5376518 1 00163192contaminated debris 5376519 1 0 014 inch guidewire 5376520 1 000 square foot building 5376521 1 00054j a medical device embodiment 5376522 1 0003j non invasive medical imaging technique 5376523 1 0002j thermoplastic 5376524 1 0002j hydrocarbon 5376525 1 0002highly absorbent article 5376526 1 00 01 04 There are 2,364,966 labels which have only one surface form and, looking at the examples above, these are often gibberish; for practical work, it's clear that a large number of junk records could be removed.

Deep drill into a few labels Let's take a look at the top 25 "factors": label surface form score factor age 35167 factor gender 14230 factor temperature 13660 factor size 6709 factor stress 6433 factor education 6256 factor cost 5661 factor smoking 5532 factor location 5247 factor diet 5205 factor ph 5160 factor weather 4604 factor weight 4157 factor genetic 3844 factor climate 3756 factor income 3727 factor ethnicity 3708 factor obesity 2975 factor humidity 2945 factor time 2940 factor culture 2904 factor environment 2688 factor type 2268 factor experience 2211 factor lifestyle 2177 Note that the "factors" are related to what we could call "predicates" in the RDF world, being attributes that something could have -- it reads like a list of possible independent variables that could affect people. If we look at other labels assigned to age, we get a list of very similar looking concepts: label surface form score factor age 35167 variable age 9375 characteristic age 4494 demographic variable age 3703 information age 3465 risk factor age 3154 demographic datum age 2682 demographic characteristic age 2579 demographic factor age 2541 demographic information age 2433 datum age 1834 patient characteristic age 1834 demographic age 1573 continuous variable age 1374 parameter age 1321 personal information age 1261 personal characteristic age 1109 confounding factor age 1086 covariate age 1071 patient factor age 773 baseline characteristic age 649 potential confounder age 641 criterion age 630 clinical datum age 574 issue age 566 These concepts form a messy categorization much like a folksonomy, for instance, the distinction between a continuous vs a discrete variable is potentially interesting: label surface form score continuous variable age 1374 continuous variable bmi 133 continuous variable weight 125 continuous variable height 86 continuous variable income 74 continuous variable blood pressure 56 continuous variable patient age 54 continuous variable birth weight 44 continuous variable body mass index 41 continuous variable temperature 35 continuous variable hemoglobin 31 continuous variable tumor size 26 discrete variable gender 26 continuous variable time 25 continuous variable expenditure 21 continuous variable age at diagnosis 20 continuous variable vital sign 20 continuous variable education 19 continuous variable gestational age 19 continuous variable biochemical result 19 continuous variable patient s age 19 discrete variable marital status 19 discrete variable thickness 14 discrete variable group status 10 discrete variable count datum 8 discrete variable brazil nut harvest method 8 discrete variable mortality 7 discrete variable road 7 discrete variable river access 7 discrete variable specific management practice 7 discrete variable return of spontaneous circulation 6 discrete variable presence of somatic mutation 6 discrete variable land cover 6 discrete variable burnt area 6 discrete variable location 5 discrete variable presence 5 discrete variable asa 5 discrete variable anticipated career choice 5 discrete variable fare class 5 discrete variable method of research used 5 The concept graph picks up this distinction, but looking carefully note that "thickness" isn't necessarily a discrete variable. Although a number of interesting categories exist in the graph, a considerable amount of cleanup would be necessary to create useful classifications.

Precision and Recall Analysis Let's take a detailed look at a label which ought to have a well-defined list of values, specifically, "chemical elements" rank label surface form score 1 chemical element carbon 137 2 chemical element oxygen 112 3 chemical element nitrogen 77 4 chemical element iron 63 5 chemical element gold 49 ... ... ... ... 27 chemical element fluoride 11 ... ... ... ... 34 chemical element oxygen carbon gold molybdenum 6 ... ... ... ... 39 chemical element heavy metal 5 ... ... ... ... 50 chemical element cr 3 ... ... ... ... 64 chemical element trace element 2 ... ... ... ... 88 chemical element 7 li 1 ... ... ... ... 143 chemical element helium 1 144 chemical element neon 1 Note that the surface forms fit into a number of categories: (i) chemical elements by name, (ii) chemical elements by abbreviation, (iii) phrases that could stand in for some class of chemical element (ex. "heavy metal", "rare earth"), (iv) isotopes (ex. "7 li"), (v) names of ions (ex. "fluoride"), and (vi) crazy misses (ex. "oxygen carbon gold molybdenum") Many of these are examples of the kind of "near miss" situations that turn up in any kind of classification, particularly when language is involved. If we imagine however, that we're looking for a list of chemical elements, and we're willing to consider abbreviations to be valid surface forms, there turn out to be 70 elements identified by name and 9 with abbreviations and 65 surface forms that are not chemical elements. That gives us precision of 79/144 = 54.8% and recall of 70/118=59.3% for names and 9/118=7.6% for abbreviations Considering that the concept graph discovered the concept of "chemical element", this is impressive, but if you really need a list of chemical elements you're better off getting them out of DBpedia, where the query select count(*) { ?element dct:subject dbc:Chemical_elements } gets 100% recall with 92% precision because this query gets a few results such as gets 100% recall with 92% precision because this query gets a few results such as Chemical_Element and Transfermium_Wars . In either case one would need to apply curation to get a perfect list: DBpedia comes much closer for this well-defined concept, but the Microsoft Concept graph discovers concepts on its own.

A few sample surface forms Just to give you a sense of what you will find, I'll show a few examples of what you can get if you look at the labels assigned to surface forms. "Foot" is a good example of polysemy, that is, a word having multiple meanings: label surface form score area foot 210 extremity foot 190 body part foot 180 symptom foot 126 animal disease foot 83 unit foot 66 personal weapon foot 60 measurement foot 42 part foot 35 physical feature foot 34 feature foot 31 side effect foot 30 dry area foot 29 site foot 28 event foot 27 Note that the current version of the Microsoft Concept Graph makes no attempt to distinguish between polysemous concepts, but they are working on this for the next phase. Such a classification isn't as simple as picking "the right choice" because a particular use of the word foot could be as an "area", "extremity", "body part" and "personal weapon" at one time. The Concept Graph is rich in knowledge about the biomedical domain. We get a nice set of categories for a common drug: label surface form score antidiarrheal agent loperamide 120 antimotility drug loperamide 88 medication loperamide 86 antimotility agent loperamide 66 antidiarrheal drug loperamide 32 over the counter medication loperamide 31 antidiarrheal medication loperamide 30 agent loperamide 28 compound loperamide 26 antidiarrheal loperamide 26 anti diarrheal agent loperamide 23 antidiarrhoeal drug loperamide 22 medicine loperamide 19 anti cancer drug loperamide 19 opiate loperamide 18 over the counter medicine loperamide 18 over the counter anti diarrheal medication loperamide 18 antiperistaltic agent loperamide 17 antidiarrhoea medicine loperamide 17 synthetic opiate loperamide 16 I also often see nice (if repetitive) results when the surface form is a drug catgegory: label surface form score medication calcium channel blocker 468 vasodilator calcium channel blocker 116 agent calcium channel blocker 108 antihypertensive drug calcium channel blocker 45 medicine calcium channel blocker 39 pharmacological agent calcium channel blocker 39 antihypertensive calcium channel blocker 30 antihypertensive medication calcium channel blocker 29 compound calcium channel blocker 27 vasodilators calcium channel blocker 23 therapy calcium channel blocker 22 pharmacologic agent calcium channel blocker 21 smooth muscle relaxant calcium channel blocker 20 blood pressure medication calcium channel blocker 13 cardiac medication calcium channel blocker 13 antihypertensives calcium channel blocker 12 cardiovascular drug calcium channel blocker 11 nonantimicrobial medication calcium channel blocker 11 combination calcium channel blocker 10 antihypertensive agent calcium channel blocker 10 (The trouble is, however, that medical applications are going to be held to account for errors, thus a high level of accuracy will be required.) One area where precision is less important is pop culture, and good results can be had for relatively obscure topics: label surface form score artist mf doom 15 rapper mf doom 3 producer mf doom 2 successful artist mf doom 2 american emcee mf doom 2 underground hip-hop producer mf doom 2 hip hop legend mf doom 2 name mf doom 1 act mf doom 1 musician mf doom 1 hip hop artist mf doom 1 record producer mf doom 1 signee mf doom 1 talented artist mf doom 1 rap artist mf doom 1 contemporary musician mf doom 1 others artist mf doom 1 intelligent conscious, talented rapper mf doom 1 prestigious great-producer-okay-rappers mf doom 1 influence mf doom 1 powerhouse artist mf doom 1 popular alternative rapper mf doom 1 Finally, I'll show an example of what I call a "critical error", the kind of small mistake which has an outsized effect: label surface form score dictator adolf hitler 27 person adolf hitler 25 leader adolf hitler 22 historical figure adolf hitler 21 powerful speaker adolf hitler 12 individual adolf hitler 11 good leader adolf hitler 9 nazi leader adolf hitler 7 name adolf hitler 6 charismatic leader adolf hitler 4 ruler adolf hitler 4 german leader adolf hitler 4 high ranking nazi leader adolf hitler 4 man adolf hitler 3 figure adolf hitler 3 key individual adolf hitler 3 military dictator adolf hitler 3 famous leader adolf hitler 3 madman adolf hitler 3 psychopath adolf hitler 3 Note that the system classifies Adolph Hitler as a "good leader" as well as many other things; 19 labels look good (95% precision), but the 1 bad one could deeply offend somebody -- a different situation than the gobbledygook (but not actively harmful) labels we see on so many topics.