Sentence ordering in Multi-document Summarization

In multi-document summarization (MDS), the goal is to create a coherent single summary from a given set of multiple documents. Usually, all documents to be summarized describe a particular event (e.g. a set of newspaper articles about a particular news). In a typical extractive text summarization an MDS system would first identify important sentences in the given set of documents and select a subset of those sentences as the summary. In doing so, MDS system must avoid duplicated information and also concentrate on the coverage of the summary. Usually, the number of sentences to be selected is given to system. However, it is not obvious how to order the set of selected sentences to be included in the summary such that the summary as a whole is coherent. This research focus on this problem -- given a set of sentences how to order them in a coherent manner. For further details refer this paper.

We proposed a novel measure, average continuity, to evaluate sentence ordering in MDS. A script to compute average continuity is here. Intuitively, if a particular ordering of sentences has many common segments with a human-ordered reference summary, then you will get a better average continuity score. Note that this approach of evaluating an ordered set of items is different from popular rank correlation coefficients such as Kendall coefficient or Spearman coefficient, where the focus is on reversed discordant pairs rather than continuous sequences. We manually ordered sentences for train and test purposes in this research. The data set is available here.

Person name disambiguation and Name alias detection

To correctly resolve an entity, one must overcome two fundamental problems: lexical ambiguity (different entities with the same name), and referential ambiguity (same entity being referred to by different alias names). This research focuses on both these problems in the special case of personal names. For example, if you search for Jim Clark on Google, you would find different Jim Clarks (e.g. Formula One champion and the founder of Netscape). This is the namesake disambiguation problem given an ambiguous name, find the different entities. On the other hand fresh prince is an alias used for Will Smith. This is the alias detection problem given the real name of an entity, find its name aliases. For further details on name disambiguation problem refer the ECAI-2006 paper. For further details on name alias detection refer GoTAL-2008 paper. Download the name alias datasets used in the papers. The personal name disambiguation system was reported by New Scientist Tech, and ACM TechNews.

Semantic Similarity

Given two words A and B, how to measure the semantic similarity between them? This is the problem that I am focusing in this research. Semantic similarity is a concept that extends beyond synonymy and is often called semantic relatedness in literature. For example, a certain degree semantic similarity can be observed not only between synonyms (e.g. car and automobile), but also between meronyms (e.g. car and tire), hyponyms (jaguar and cat), related words (e.g. Apple and Computer) as well as between antonyms (e.g. hot and cold). In this research, I attempt to measure the degree of semantic similarity between two given words using the information available on the Web. I use various information provided by web search engines for the words between which we are interested in measuring similarity and attempt to learn the concept of semantic similarity. For further details on this research refer this paper.

Relational Similarity

Similarity can be broadly categorized into two types: attributional and relational. In attributional similarity we compare the attributes of two objects (e.g. entities or words). On the other hand, when measuring the relational similarity between two pairs of objects, we compare the relations that hold between the two objects in each pair. For example, there is a high degree of relational similarity between the word pair (ostrich,bird) and (lion,cat) because the implicit relation IS_A_LARGE exist between the two words in each word pair. Contrastingly, ostriches and lions do not have much attributes in common and would not have a high degree of attributional similarity. The semantic similarity problem introduced above is in fact a case of attributional similarity. In this work, I look at how to measure the relational similarity between two word pairs (or named-entities) using the information available on the Web. For example, we would expect a high degree of relational similarity between entity pairs (Google,YouTube) and (Microsoft,Powerset) because the implicit semantic relation, ACQUIRES, exists between the two entities in each entity pair. For further details on this work refer WWW-2009 paper. Download the ENT dataset described in the paper.