SIAM Data Mining 2012 Conference

Note: This would have been up a lot sooner but I have been dealing with a bug on and off for pretty much the past month! From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011 , most of my recent conferences had been more “big data” and “data science” oriented, and I wanted to step away from the hype and just listen to talks that had more substance. Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams “fun” rather than “business”, but I managed to make time for both. The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to run there, wait in line, guide myself through crowds, wait in line, get my food, eat it, and run back to the conference in 90 minutes on a weekend. After lunch on the first two days was another plenary session followed by breakout sessions. The evening of the first two days was reserved for poster sessions. Saturday hosted half-day and full-day workshops. Below is my summary of the conference. Of course, such a summary is very high level my description may miss things, or may not be entirely correct if I misunderstood the speaker.

Plenary Talks





Bharat Rao from SIEMENS provided the first plenary talk bright and early the first day of the conference. I only got to see the first half as I could not wake up. His talk was about privacy preserving data mining in medicine using matrix factorization. Although privacy has become an important issue in data mining, I do not totally buy that it is entirely necessary. The idea is that observations should not personally identifiable. I personally do not agree that such privacy measures are necessary when only a computer system is using the data, and not an individual person. Besides, with such massive amounts of data, someone digging through gigs and gigs of personally identifiable data to find one person’s data does not seem like a viable threat. My thoughts are similar to those on the Netflix grand challenge dataset lawsuit.

The second plenary talk came from Noshir Contractor . The main point of his work seemed to be how to build effective teams using graphs and data about each of the candidates for such a team. This did not excite me itself, but it was the data his team used that excited me and some of the stuff they learned from it. The first part of the talk discussed research into NSF grants and the types of collaboration that are more likely to lead to the awarding of such grants. His group found that women were more likely to be collaborators on awarded proposals and that multidisciplinary teams were more likely to be funded. Some analogous work involved the detection of “gold farmers” on the MMORPG game Everquest 2 . Gold farming involves gathering and selling virtual goods with real cash. Interestingly, Contractor’s group found that the graph signatures present in gold farming are remarkably similar to those present with drug trafficking. There were a few other interesting tidbits that the group found. They found that a great number of players only play with friends and are somewhat disconnected from the rest of the game graph. Also, male-male relationships and female-male graph links were very common, but female-female links were uncommon. Contractor hypothesized that the male-male relationships were obvious (men are more likely to play computer games) and that women often play the game with men because it was the only way for them to get time with their significant others.

The last plenary talk came from Susan Dumais from Microsoft Research who discussed temporal dynamics and information retrieval . The talk basically discussed how to mine concepts important concepts over time from data streams. One part of her research was discovering the staying power of certain words. Susan has noticed four distinct word behaviors based on how the density of the word’s usage changes over time: fast, hybrid, medium, and slow. Susan’s research also studies how often people revisit certain webpages and why. Presumably revisits are an alternative measure of influence to in-links and out-links used in PageRank (remember, Microsoft has its own anti-Google search engine). Studying temporal behavior of web visits and keyword usage is important because current methods consider only a snapshot of the web with very little evolution. Susan stated that a great page is defined as a mixture of bags of words that are formed based on page changes. Such research is important because query relevance changes over time. For example, a query of US Open refers to golf at certain times of the year and tennis at others. The query March Madness should probably return ticket prices before the event, scores during the event, and Wikipedia or sports articles recapping the event after the event.

Social Media





Pattern Mining





The Thursday afternoon session I attended had a very generic name considering all of data mining is about finding patterns. Really, it should have been called “association rule mining.” Unfortunately, this session was fairly dry and was my least favorite of the conference. The one talk that really stood out to me discussed how to mine association rules out of long temporal events . Such association rules consisted of “episodes” which were partial orders on the graph of the event. The type of association rules considered were basically motifs — subsequences of interesting events that occurred within a long event.

Kernels and Classification





The first two talks in this session discussed multi-label classification , which is distinct from multi-class classification. In multi-class classification, we have multiple classes and each instance can belong to one, and only one class. In multi-label classification, each instance can belong to one or more classes/labels. Multi-label classification exploits correlation information among labels whereas independent classifiers do not. The first talk discussed how to use multi-label classification when there are multiple objectives. For example, when buying a cell phone, we may want to minimize price, and maximize battery life. The second talk discussed dimension reduction for multi-label classification and coupling feature selection with modeling . Another talk attempted to study the theoretical principles behind pruning and grafting in decision trees . The C4.5 software does pruning and grafting, but its theoretical properties are not well understood. The last talk discussed augmenting matrix factorization with graph information and other metadata prior to building a model. For example, for a movie recommendation problem, one factor would be a movie and another factor would be a user. These factors can be combined into a Bayesian model that can be scaled up better than other existing methods.

Transfer Learning





As I mentioned earlier, the goal of transfer learning is to map a model used in one domain to another similar domain. The classic example is classifying images using models trained on text data and some labeled images — both domains are reduced to a common set of concepts. The talks in this session mainly talked about advances in latent variable analysis. I kept finding myself confused and wondering, “why is this considered groundbreaking?” The work presented in this session basically used existing models for transfer learning. The first few talks discussed using Latent Dirichlet Allocation (LDA) to map data into concepts, and then the third talk discussed Hierarchical Latent Dirichlet Allocation (hLDA) which could be used for taxonomies and hierarchies of concepts. Although Transfer Learning is very useful, I did not find it to be all that groundbreaking. Of course, using text and images as the source and target domains is not incredibly interesting. I think Transfer Learning could be revolutionary if it could be applied to two very different domains.

Full Day Workshop: Text Mining





Of course, if there is a text mining talk, I will attend it. The workshop was led David W. Berry from University of Tennessee, Knoxville. The keynote speaker was Malu Castellanos from Hewlett-Packard Labs . Malu’s talk was amazing. She discussed a live customer intelligence system that is used for intent and sentiment analysis on various channels. Working with text is not easy. She began with a discussion of the many challenges in sentiment analysis including deceitful adjectives (despicable is negative, but Despicable Me is a proper noun that is not negative), dependency relations (wicked as slang for “good” vs. wicked witch), comparisons (x is better than y), spam, sarcasm, coreferences (use of the word it), special expressions and emoticons (LOL, ;-)), and context dependencies (predicable movie is negative whereas predictable weather may be positive). What was particularly illiuminating about Malu’s talk was that she was fairly candid about how complex HP’s sentiment analysis system is. The system does not use one model for sentiment. Different models are used to handle different kinds of tweets and based on their classifications, these tweets are ushered off to other models for further classification. For example, comparative statements are treated distinctly by the system. There may be a naive Bayes step that classifies the text as comparative or not, and then sends the tweet for further processing. She mentioned something about using special processing such as linear programming and generalized additive models (GAM) to take words such as BUT, AND etc. into account. GAMs seem rare to encounter in text mining. Some other features of the system include sentiment intensity (really good vs. good) and clustering similar words by using temporal histograms (tomorrow and 2morrow have similar usage patterns).

The first talk was from David Skillicorn , who recently published a book about mining large datasets. He discussed how to pick documents out of a corpus that are the most interesting. The second talk was given by a brave undergraduate student on query expansion . He did a very good job, but what was strange about this talk was that it used… Latent Semantic Indexing (…from 1990…) rather than one of the more useful and iterative models such as LDA. This brings me to my first personal “weird moment” about this workshop. There was very little discussion about modern (post 2000) topic models. This is very strange to me. Just a few months earlier, topic models were all the rage at KDD 2011. After the lunch break, there were talks about incremental online clustering of documents and discovery of patent trolls. The final sessions of the afternoon discussed extraction of hierarchies for increasing performance of multi-labeled classifiers and automatically evaluating text summarizers. Only one of the presentations in this workshop seemed to be attached to a paper.

I do not want to be critical because I am sure a lot of work goes into planning such events. I just found this workshop to be a bit weird. A lot of the methods used in the papers were quite old fashioned for text mining (LSI, regression) and the applications were also quite old-school (patents and legal documents just scream the old-fashioned use of information retrieval… library cataloging). It also seemed like a disproportionate number of the speakers had a prior relationship with the workshop chair. I am also not used to a workshop with so few associated papers.