We recently caught up with Kang Zhao, Assistant Professor at the Management Sciences department, Tippie College of Business, the University of Iowa. His work applying Machine Learning to the world of online dating has generated significant coverage (Forbes, MIT Technology Review, UPI, among others), so we wanted to know more!...



Hi Kang, firstly thank you for the interview. Let's start with your background...

Q - What is your 30 second bio?

A - As you mentioned, I am an Assistant Professor at the Management Sciences department, Tippie College of Business, University of Iowa. My research focuses on business analytics and social computing, especially in the context of social networks and social media. I also hold a PhD in Information Sciences and Technology from Penn State University.



Q - How did you get interested in Data Science / Machine Learning?

A - That dates back to my grad school days. I was involved in research projects that leveraged data from online social networks and social media. It is amazing that nowadays all the large-scale and distributed interactions among people are available online thanks to the advances of online social networking/social media sites. Such data not only reveals who is talking to whom (i.e., helps us build a regular social network based on "knowing" or simple interaction), but also the time and the content of their online communication, which enable us build other social networks based on the nature of interactions (such as support network, information spread network). All these made me believe that the availability of such data will bring a brand new perspective to the study of people's social behaviors and interactions.



Q - What was the first data set you remember working with? What did you do with it?

A - My first research project using a real-world dataset was about collecting and analyzing data about humanitarian agencies and their networks. The scale of the data was actually "tiny" (several mega bytes) but the data did show us some interesting patterns on the topological similarities between different networks among these organizations (e.g., communication and collaboration networks), which inspired us to develop a simulations to model the co-evolution of multi-relational networks.





Kang, very interesting background and context - thank you for sharing! Next, let's talk more about Machine Learning in Social Networks and Social Media.



Q - What excites you most about bringing Machine Learning and Social Networks / Social Media together?

A - It is about the opportunity to do better prediction. With larger-scale data from more sources on how people behave in a network context becoming available, there are a lot of opportunities to apply ML algorithms to discover patterns on how people behave and predict what will happen next. Such prediction can help to validate/test existing theories about people's social behaviors at an unprecedented scale. It is also possible to derive new social science theories from dynamic data through computational studies. Besides, the education component is also exciting as industry needs a workforce with data analytics skills. That's also why we at the University of Iowa have started a bachelor's program in Business Analytics and plan to roll out a Master's program in this area as well.



Q - What are the biggest areas of opportunity / questions you want to tackle?

A - I want to better understand and predict social networks dynamics at different scales. For example, dyadic link formation at the microscopic level, the flow of information and influence at the mesoscopic level, as well as how network topologies affect network performance at the macroscopic level.



Q - What Machine Learning methods have you found most helpful?

A - It really depends on the context and it is hard to find a silver bullet for all situations. I usually try several methods and settle with the one with the best performance.



Q - What are your favorite tools / applications to work with?

A - I use JUNG, a Java framework for graph analysis, Mallet for topic modeling, lingpipe for text analysis, and Weka for data mining jobs.



Q - What publications, websites, blogs, conferences and/or books are helpful to your work?

A - I usually keep an eye on journals such as IEEE Intelligent Systems, numerous IEEE and ACM Transactions, Decision Support Systems, among many others. As for conferences, I found the following helpful for my own research: ICWSM, WWW, KDD, and Workshop on Information Technologies and Systems. I also enjoy several conferences related to social computing, such as SocialCom and SBP.





Improving our ability to make predictions is definitely very compelling! Now, let's discuss how this applies in some of your research...



Q - Your recent work on developing a "Netflix style" algorithm for dating sites has received a lot of press coverage ... what question / problem were you trying to solve?

A - We try to address user recommendation for the unique situation of reciprocal and bipartite social networks (e.g., dating, job seeking). The idea is to recommend dating partners who a user will like and will like the user back. In other words, a recommended partner should match a user's taste, as well as attractiveness.



Q - How did Machine Learning help?

A - In short, we extended the classic collaborative filtering technique (commonly used in item recommendation for Amazon.com or Netflix) to accommodate the match of both taste and attractiveness.



Q - What answers / insights did you uncover?

A - People's behaviors in approaching and responding to others can provide valuable information about their taste, attractiveness, and unattractiveness. Our method can capture these characteristics in selecting dating partners and make better recommendations.



Editor Note - If you are interested in more detail behind the approach, both Forbes' recent article and a feature in the MIT Technology Review are very insightful. Here are a few highlights:

Recommendation Engine (from MIT Tech Review) - These guys have built a recommendation engine that not only assesses your tastes but also measures your attractiveness. It then uses this information to recommend potential dates most likely to reply, should you initiate contact. The dating equivalent [of the Netflix model] is to analyze the partners you have chosen to send messages to, then to find other boys or girls with a similar taste and recommend potential dates that they've contacted but who you haven't. In other words, the recommendations are of the form: "boys who liked this girl also like these girls" and "girls who liked this boy also liked these boys".



The problem with this approach is that it takes no account of your attractiveness. If the people you contact never reply, then these recommendations are of little use. So Zhao and co add another dimension to their recommendation engine. They also analyze the replies you receive and use this to evaluate your attractiveness (or unattractiveness). Obviously boys and girls who receive more replies are more attractive. When it takes this into account, it can recommend potential dates who not only match your taste but ones who are more likely to think you attractive and therefore to reply. "The model considers a user's "taste" in picking others and "attractiveness" in being picked by others," they say.



Machine Learning (from Forbes) - "Your actions reflect your taste and attractiveness in a way that could be more accurate than what you include in your profile," Zhao says. The research team's algorithm will eventually "learn" that while a man says he likes tall women, he keeps contacting short women, and will unilaterally change its dating recommendations to him without notice, much in the same way that Netflix's algorithm learns that you're really a closet drama devotee even though you claim to love action and sci-fi.



"In our model, users with similar taste and (un) attractiveness will have higher similarity scores than those who only share common taste or attractiveness," Zhao says. "The model also considers the match of both taste and attractiveness when recommending dating partners" ... After the research team's algorithm is used, the reciprocation rate improves to about 44% - a better than 50% jump.



Finally, for more technical details, the full paper can be found here.

Editor Note - Back to the interview!...

Q - What are the next steps / where else could this be applied?

A - We want to further improve the method with different datasets from either dating or other reciprocal and bipartite social networks, such as job seeking and college admission. How to effectively integrate users' personal profiles into recommendation to avoid cold start problems without hurting the method's generalizability is also an interesting question we want to address in future research.





That all sounds great - good luck with the next steps!… You are also working on other things - your work on sentiment influence in online social networks (developing a "Good Samaritan Index" for cancer survivor communities) has been well documented ... could you tell us a little more about this work?



Q - What question / problem were you trying to solve?

A - We tried to find who are the influential users in an OHC (Online Health Community). Here we directly measure one's influence, i.e., one's capabilities to alter others' sentiment in threaded discussions.



Q - How did Machine Learning help?

A - Sentiment analysis is the basis for our new metric. We developed a sentiment classifier (using Adaboost) specifically for OHCs among cancer survivors. We did not use off-the-shelf word list because sentiment analysis should be specific to the context. Some words may have different sentiment in this context than usual. For example, the word "positive" may be a bad thing for a cancer survivor if the diagnosis is positive. The accuracy rate of our classifier is close to 80%.



Q - What answers / insights did you uncover?

A - When finding influential users, the amount of contributions one has made matters, but how others react to one's contributions is also extremely valuable, because it is through such reactions inter-personal influence is reflected and thus measured.



Q - What are the next steps / where else could this be applied?

A - We would like to further investigate the nature of support in OHCs, so that we can build users' behavioral profiles and better design such communities to help their members.





Very interesting - look forward to following all of your different research paths in the future! Finally, it is advice time!...



Q - What does the future of Machine Learning look like?

A - This is a tough question. I don't know the exact answer but I guess ML will develop along two directions. The first would be on the algorithm side--better and more efficient algorithms for big data, as well as machine learning that mimics human intelligence at a deeper level. The second would be on the application side - how to make ML understandable and available to the general public? How to make ML algorithms as easy to use as MS Word and Excel?



Q - Any words of wisdom for Machine Learning students or practitioners starting out?

A - I am not sure whether my words are of real wisdom, but I'd say for a beginner, it is certainly important to understand ML algorithms. Meanwhile, it is equally important to develop the right mindset--a data scientist needs to be able to come up with interesting and important ideas/questions when given some data. In other words, one must learn how to answer the question-- "Now we have the data, what can we do with it?". This is very valuable in the era of big data.





Kang - Thank you so much for your time! Really enjoyed learning more about your research and its application to real-world problems.

Kang can be found online at his research home page and on twitter.



Readers, thanks for joining us!



what it takes to become a data scientist

what skills do I need

what type of work is currently being done in the field

If you enjoyed this interview and want to learn more aboutthen check out Data Scientists at Work - a collection of 16 interviews with some the world's most influential and innovative data scientists, who each address all the above and more! :)