Event Details

Information Retrieval and the Semantic Web





Schedule

Tuesday







9:30am Welcome by Hari Vasudev (Head, Yahoo India R&D)

9:40am Opening by Ron Brachman (Head, Yahoo! Labs) (Video-recorded)

9:50 am School overview by Peter Mika (Senior Scientist, Yahoo! Labs)





10:00am - 12:45 pm Tom Heath (Open Data Institute)







From Data *on* the Web to Data *in* the Web







From structured data for search engine optimisation of Web pages to open data released by governments and corporations, the volume of data accessible for reuse has exploded in recent years. Mirroring this explosion has been a growing understanding of the economic, social, and environmental benefits of data sharing, and the Web remains unparalleled as a discovery and distribution medium for this data. This talk will examine the Web from a data-centric perspective and introduce the key concepts, principles and technologies that underpin the evolution from a Web of documents to a Web of data. Along the way I'll attempt to unpick where the classic Web and classic information retrieval differ from what we're trying to achieve with a Web of linked and open data, and conclude by discussing open research questions in the field.

Lunch: 12:45-1:45pm

1:45pm - 5:00pm Roi Blanco (Yahoo! Labs)

IR & Web search introduction

Information Retrieval is a the core of modern search engines, which makes it one of the most engaging modern technologies, and the most dominating form of information access. At their core, information retrieval techniques cover a broader appeal, spanning tools from the realm of information theory to optimized structures for fast data access. In this tutorial we will review the roots of information retrieval and web search systems, from both a theoretical and practical perspective. We will mostly focus on search over unstructured data, how to store it, how to access it and how to model user search behavior. We will review the foundations of probabilistic models of user finding information and give an overview of the key ingredients of modern search engines, which range from crawling to indexing to fighting spam.

Wednesday

9:30am - 12:45pm Peter Mika (Yahoo! Labs)

Semantic Search introduction

Semantic search is not one single type of application but rather, refers to a broad range of systems, which involve the use of semantics. In this tutorial, we aim to provide a comprehensive overview on the different types of semantic search systems, and discuss the differences in the techniques underlying them. Both the application of Semantic Web technologies to the IR problem and vice versa, the application of IR techniques to Semantic Web problem are covered by this tutorial. In particular, focus is given to three topics of semantic search which have attracted much interest recently.

• The first is one is Semantic-enabled Document Retrieval, i.e. the application of Semantic Web technologies to the IR problem.

• The second is Semantic Data Retrieval, which concerns with (the application of IR techniques to) the retrieval of semantic data.

• While the use of semantics is the essential theme in these two major components of the tutorial, Hybrid Search is a complementary part that illustrates the convergence of search paradigms.

We will discuss the query and data representations typically used in modeling these tasks and some of the common ranking approaches applied in the literature.

Lunch: 12:45-1:45pm

1:45pm - 5:00pm Edgar Meij (Yahoo! Labs)

Entity Linking and Retrieval



The session begins with a detailed overview of entity linking, which addresses identifying and disambiguating entity occurrences in unstructured text. I will introduce the fundamental concepts and principles underlying entity linking, and detail state-of-the-art algorithms including unsupervised solutions, graph-based methods, and feature-based approaches in a machine learning setting. I will continue with applications of entity linking for IR and conclude this part with a discussion of evaluation methodologies and initiatives. The second part focuses on entity retrieval and begins with a study of scenarios where explicit representations of entities are available in the form of, e.g., Wikipedia pages or RDF triples. I will then continue in a setting with more complex queries, requiring evidence to be collected and aggregated from massive volumes of unstructured textual data (with the potential help of some structured data). Such complex queries require a combination of techniques from both entity linking and entity retrieval. Two main families of models are discussed: generative language models and discriminative feature-based models. Both the entity linking and entity retrieval parts are anchored in recent evaluation efforts conducted at standard benchmarking campaigns such as INEX, TAC, and TREC and test collections, tasks, evaluation methodology, and experimental results from these evaluation initiatives are discussed.

Thursday

9:30am - 12:45pm Yi Chang (Yahoo! Labs)

Machine learning for IR



In this section, we will cover: 1.) General machine learning ranking algorithms for web search engines; 2) machine learning ranking for time sensitive contents: how to formulate an effective machine-learning framework to handle freshness together with topical relevance and other factors. We will cover the steps including content crawling, feature generation, ranking evaluation and learning algorithms.

Lunch: 12:45-1:45pm

1:45pm - 5:00pm Maya Ramanath (IIT Delhi)

RDF Knowledge-Bases

In this part, we will focus on RDF knowledge-bases -- their construction, querying and useability. For construction, we will look at some of the different techniques people have been working on: i) extraction from unstructured text, ii) extraction from semi-structured web-sources, such as Wikipedia, iii) human computing techniques, iv) subject-specific knowledge-bases (such as opinion-bases). For querying and useability, we will first start with querying using SPARQL, and motivate the need for ranking. The framework will be: usage of IR ranking techniques which are traditionally for unstructured queries on unstructured data on structured queries and structured data. Further, we will look at some common features that search engines currently provide, and see how the same can be adapted to the setting of knowledge-bases. We will conclude how querying can be made easier by allowing users to use keywords or natural language questions to ask queries.

5:15-6:15PM Instructors/Speakers

Plenary Panel on Future Research Directions

Friday

9:30am - 12:45pm Ganesh Ramakrishnan (IIT Bombay)

Scaling up entity extraction and search over entities and relations

Entity relationship search at the Web scale or even at the Enterprise level depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being (i) fast and effective entity extractors and disambiguators, (ii) the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices and (ii) use of annotations and efficient indices for effective and efficient entity-oriented search.

After providing a brief introduction to our prior work on entity annotation, disambiguation and entity-based search, we will focus on specific approaches we explored for scaling them up. In particular, we present three of our approaches geared toward scaling up operations in this area: (a) the translation of rule based annotation to operations on the inverted index, to achieve an order of magnitude speedup (EMNLP 2006, ICDE 2008, Infoscale 2008, CIKM 2008) over the standard document-at-a-time rule-based annotation paradigm. (b) the design of RAM data structures for spotting and and disambiguating entity mentions (WWW 2012), and highly compressed disk-based annotation indices (WWW 2011). These data structures cannot be readily built upon standard inverted indices. We present a Web scale entity annotator and annotation index. Using a new workload-sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. We present how the disk-based annotation index enables entity-centric snippet oriented search (WWW 2011).

Lunch: 12:45-1:45pm

1:45pm - 4:00pm Product/Business Challenges (Hanumantha Rao Susarla, Shankar Umamaheshwaran, Rahul Singh, Rajiv Verma: Yahoo! India R&D)

Entity extraction driving commerce from content (30 minutes)

Inspirational content such as, an article on home remodeling, a news item on an automobile launch, a narration of 'do it yourself' (DIY) project etc., has enormous potential to drive on-line commerce. For example, there is an opportunity to upsell Hardware Tools, Materials and other products when a user looks at a DIY article on 'How to Fix a Leaking Tap". Extraction of commerce Entities from content is critical in order to realize this usecase. This is challenging since it is not a pure dictionary based syntactic match but rather involves understanding context around the Entities. This presentation takes a detailed look at the use cases for contextual content analysis and highlights the challenges. A high-level architecture and a prototype implementation for solving extraction problem will be presented. The presentation will also touch upon machine learning techniques that can be used to solve contextual entity extraction problem

Media Entity Experiences and Strategy (30 minutes)

This talk will cover how Yahoo! Media team went about implementing entities - primarily design and architectural trade-offs. Also the talk will cover experiences gained from this exercise and major pain points from both architectural and operational perspective. We will also talk about solutions to the pain points and discuss short term and long term strategies.





Knowledge graph: leveraging a common repository (30 minutes)





Facilitating answering of questions in a community based question answering (CQA) environment (30 minutes)

High volume CQA sites usually depend on a small community of answerers to answer questions out of a sense of recognition, altruism or to return something to the community. But the scope of questions in a general purpose CQA site is broad and answerers need to be surfaced questions according to their areas of expertise, that are usually narrow. At the same time, established CQA sites have a corpus of answered questions which can also serve as excellent sources for answering new questions, due to a high percentage of repeat questions which often go unanswered. For the first scenario (Question Recommendation), the presentation covers the use case and learnings obtained as we refined the algorithm over multiple iterations to get high quality answers. For the second scenario, we present a prototype for automatic answering using machine learning techniques.





4-5PM Student Research Colloquium (Ravee Malla (IIT Delhi), Shashank Gupta (IIT Bombay))

News Graphs as a way to represent context and evolution of News Stories (30 minutes)







Web-scale entity search (30 minutes)



Saturday

9:30am - 12:45pm Georges Dupret (Yahoo! Labs)

Information Retrieval evaluation

We will address the problem of evaluating the performance of ranking algorithms in general. First, we will present the metrics, the document collections commonly used in the field and how these have been annotated by humans to make evaluation possible. In a second time, we will review the advantages and limitations of the traditional metrics and collections. We will discuss collection size, homogeneity and touch the case of XML and semi-structured collections. We will illustrate how difficult it is for a human editors to assess the relevance of a document. Some alternatives have been proposed to the reliance on editors. We will review the use of crowd based judgments, either Mechanical Turks or Click logs of the search engines. Regarding the metrics themselves, we will look more closely at widely accepted metrics like DCG & MAP to reveal the implicit assumptions they make. This will give insights on how to improve them but also on how to use click logs to improve both ranking algorithms and evaluation. We will also see what can be done when only a subset of the documents have received an editorial label.

Finally, in most practical situations model tuning and evaluation is done incrementally: A previous model exists and we need to assess whether a new candidate is really better. We will see different techniques like A/B testing, Interleaving and Absence Time analysis that have the potential to give more detailed and accurate answers.

Lunch: 12:45-1:45pm

1:45pm – 3:45pm Yahoo! Labs, India showcase

(Nikhil Rasiwasia, Uma Sawant)

Cross-modal Retrieval: Retrieval across different content modalities (1.5 hours)

Multimedia data such as images, web pages, videos, music, etc. are now available in abundance. The increasing availability demands the development of novel representations to tackle the unique challenges posedby the multimedia content. The primary challenge being heterogeneousnature --- data with multiple information modalities --- of the contente.g. web pages which contain both images and text, videos which containboth images and audio, songs with associated lyrics, etc. In almost allthese situations, different representations are adopted for differentmodalities, thereby making it nearly impossible to operate across themusing traditional retrieval approaches.

In this talk, the problem of cross-modal retrieval from multimedia repositories is considered. This problem addresses the design of retrieval systems that support queries across content modalities, e.g., using an image to search for texts. Two hypotheses are then investigated. The ﬁrst is that low-level cross-modal correlations should be accounted for. The second is that the joint space should enable semantic abstraction. Three new solutions to the cross-modal retrieval problem are then derived from these hypotheses: correlation matching (CM), an unsupervised method which models cross-modal correlations, semantic matching (SM), a supervised technique that relies on semantic representation, and semantic correlation matching (SCM), which combines both. It is concluded that both hypotheses hold, in a complementary form, although the evidence in favor of the abstraction hypothesis is stronger than that for correlation.

Learning joint query interpretation and response ranking (30 minutes)

Thanks to information extraction and semantic Web efforts, search on unstructured text is increasingly refined using semantic annotations and structured knowledge bases. However, most users cannot become familiar with the schema of knowledge bases and ask structured queries. Interpreting free-format queries into a more structured representation is of much current interest. The dominant paradigm is to segment or partition query tokens by purpose (references to types, entities, attribute names, attribute values, relations) and then launch the interpreted query on structured knowledge bases. Given that structured knowledge extraction is never complete, here we choose a data representation that retains the unstructured text corpus, along with structured annotations (mentions of entities and relationships) on it.

We propose two new, natural formulations for joint query interpretation and response ranking that exploit bidirectional flow of information between the knowledge base and the corpus. One, inspired by probabilistic language models, computes expected response scores over the uncertainties of query interpretation. The other is based on max-margin discriminative learning, with latent variables representing those uncertainties. In the context of typed entity search, both formulations bridge a considerable part of the accuracy gap between a generic query that does not constrain the type at all, and the upper bound where the ``perfect'' target entity type of each query is provided by humans. Our formulations are also superior to a two-stage approach of first choosing a target type using recent query type prediction techniques, and then launching a type-restricted entity search query.





Speaker bio

Tom Heath is Data Scientist at the Open Data Institute, a non-profit organisation at the forefront of research into data sharing on the Web. He joined the ODI from Talis Group, where he led internal research and data science programmes in the fields of Linked Data and the Semantic Web, and was instrumental in ensuring that Talis became synonymous with these terms. He is a long-standing contributor to the Linked Open Data movement, and co-author of key reference papers and books on the subject including "Linked Data: Evolving the Web into a Global Data Space" and "Linked Data - The Story So Far" with Tim Berners-Lee and Christian Bizer. Tom has a BSc in Psychology and a PhD in Computer Science from the Open University, where he developed novel recommender systems based on trust relationships in social networks. In 2009 he was named 'PhD of the Year' by STI International and in 2010 one of 'Ten to Watch in Artificial Intelligence' by IEEE Intelligent Systems Magazine.

Peter Mika is a Senior Research Scientist at Yahoo!, based in Barcelona, Spain. Peter is working on the applications of semantic technology to Web search. He received his MSc and PhD in computer science (summa cum laude) from Vrije Universiteit Amsterdam. He is the author of the book 'Social Networks and the Semantic Web' (Springer, 2007). In 2008 he has been selected as one of "AI's Ten to Watch" by the editorial board of the IEEE Intelligent Systems journal. Peter is a regular speaker at both academic and technology conferences and serves on the advisory board of a number of public and private initiatives. He represents Yahoo! in the leadership of the schema.org collaboration with Google, Bing and Yandex.

Roi Blanco is a Senior Research Scientist in Yahoo! Labs Barcelona, where he has been working since 2009. He is interested in applications of natural language processing for information retrieval, web search and mining and large-scale information access in general, publishing at international conferences in those areas. He also contributes to different industrial products like Yahoo! Search. Previously he taught computer science at A Coruña University, from which he received his Ph.D. degree (cum laude) in 2008.

Edgar Meij is a research scientist at Yahoo! Research. Before this, he was a postdoc at the Intelligent Systems Laboratory of the University of Amsterdam, where he obtained his PhD in Computer Science. His PhD work focused on applying conceptual knowledge from ontologies, thesauri, tags, annotations, or any other structured knowledge source to improve information access. His current research focuses on entity linking and semantic search. As an active member of the information retrieval and natural language processing research communities, he has published over 40 research papers and serves on the program committees of ECIR, SIGIR, CIKM, and WSDM. He regularly teaches at the graduate and post-graduate level, including courses such as Advanced Information Retrieval and tutorials such as Statistical Language Modeling for Information Access. He is a co-organizer of various entity-related NLP and IR workshops, including Reputation 2012 (Language Engineering for Online Reputation Management) and RepLab (an entity-oriented evaluation initiative at CLEF).

Georges Dupret graduated as a civil engineer in Applied Mathematics from the University of Louvain, Belgium. He obtained his PhD from the University of Tsukuba, Japan for his work on spatial data forecasting based on Artificial Neural Network. At the same time, he worked for the IBM TRL Research Laboratory in the field of Information Retrieval for the automatic identification of relevant topics and issues in a call center transaction data. This involved the efficient approximation of Singular Value Decomposition of sparse, huge matrices. He continued this work in the IBM Research Laboratory in Zürich, Switzerland and extended it to Quality Insurance transactions. After two years as a researcher in the Center for Web Research in Santiago, Chile, he joined Yahoo! Labs in 2008 working in Chile and the United States. Topics of interest are web mining, clickthrough data analysis and modelization, query disambiguation and recommendation and automatic taxonomy construction.

Yi Chang joined Yahoo! in 2006. He is leading the ranking science team to work on multiple vertical search and relevance ranking projects. His research interests include information retrieval, applied machine learning, natural language processing and social computing. Yi has more than 50 publications on top journals and conferences. His recent academic activities include associate editor of Neurocomputing journal and Pattern Recognition Letters. Yi served as the poster chair on WWW 2012, workshop co-chair on ICML 2010 KDD 2012, senior PC on CIKM 2011~2013.

Maya Ramanath is an assistant professor in the Dept. of Computer Science and Engg. at IIT-Delhi. Previously, she was a researcher in the Database and Information Systems group at the Max-Planck Institute for Informatics, Saarbruecken, Germany. Her research interests are in the area of database and information retrieval techniques for semantic web data management, information extraction and opinion mining. She is a PC member in CIKM 2013, EDBT 2014 and PVLDB 2014.

Ganesh Ramakrishnan is an alumnus of IIT Bombay (BTech/PhD in Computer Science and Engineering) where he serves as an Assistant Professor as well. Prior to joining IIT Bombay in 2009, Ganesh worked with the IBM India Research Labs (2004-2009). Ganesh's areas of interest include (i) entity extraction and precision search in enterprise domains and (ii) efficient feature induction. He is part of projects such as (a) academic and enterprise search, (b) cross lingual information retrieval (c) entity extraction and disambiguation (IBM Faculty award) (d) efficient feature induction and (e) analytics in collaborative development across educational institutes in India (www.techpedia.in). For more information, visitwww.cse.iitb.ac.in/~ganesh.







Nikhil Rasiwasia received the B.Tech degree in electrical engineering from Indian Institute of Technology Kanpur (India) in 2005. He received the MS and PhD degrees from the University of California, San Diego in 2007 and 2011 respectively, where he was a graduate student researcher at the Statistical Visual Computing Laboratory, in the ECE department. Currently, he is working as scientist for Yahoo Labs! Bangalore, India. In 2008, he was recognized as an `Emerging Leader in Multimedia' by IBM T. J. Watson Research. He also received the best student paper award at ACM Multimedia conference in 2010. His research interests are in applying machine learning solutions to computer vision problems.

NOTE: YAHOO! WILL NOT BEAR ANY TRAVEL, BOARDING OR LODGING EXPENSES OF THE ATTENDEES.