Text Summarization based on Semantic Graph

Semantic Summarization
Automatic text summarization is a main research area of natural language processing. The capability of consuming a large amount of documents and producing a concise textual summary is a manifestation of artificial intelligence. Most existing research focuses on extractive approaches, where important sentences are identified, optionally compressed, and concatenated to form a text summary. This project seeks to leverage Abstract Meaning Representation, a graph-based semantic representation of natural language, to develop text summarization approaches. Specifically, the input documents are merged into a source semantic graph, from which one or more summary graphs that carry important semantic meaning are identified. Natural language sentences are then generated from the summary (target) graphs to form an abstractive summary. Publications: NAACL 2015

Extracting Key Practices from Website Privacy Policies

Semantic Summarization
Website privacy policies represent a legal binding between end users and the website operator. They are verbose, too long to read, and difficult to understand. Few people attempt to read it, and those who do suffer from understanding the meaning. This project seeks to extracting key privacy practices by combining natural language processing, machine learning, and crowdsourcing techniques. Specifically, the privacy experts raise a number of questions and the program aims to automatically answer the questions for a given privacy policy. Examples questions include: “Does the policy state that the website might collect contact/location/health/financial information about its users?” One of the papers from this project was presented at the 25th International World Wide Web Conference (WWW), held at Montreal, Canada in 2016 and selected as one of five Best Paper Finalists. Publications: COLING 2014, ACL 2014, WWW 2016

Social Media Text Analysis

Social Media Analytics
Social media chats pose great challenges to the existing text analytics softwares. The chats are short, informal, and generated in great volume. The text content is prevalent with abbreviations, acronyms, spelling errors, and other nonstandard word spellings (e.g., tmrw for tomorrow). This project seeks to fulfill two goals: 1) to develop algorithms that automatically translate nonstandard word spellings into standard words, and 2) to generate real-time text summary for an event of interest using the Twitter data stream as input. Research results include scholarly publications in leading academic conferences (ACL 2011, ACL 2012, NAACL 2013) and three patent applications filed by Bosch Research, (one patent issued in 2016). Our text normalization system is able to achieve 90% word coverage on four datasets of both SMS and Twitter messages. Publications: ACL 2011, ACL 2012, NAACL 2013

Summarizing Speech Conversations

Text Summarization
Human conversations are recorded in an unprecedented manner. Examples include telephone speech, meetings, broadcast conversations, lectures, call center dialogues, and doctor-patient conversations. This project focuses on multi-party meeting conversations using a dataset collected by ICSI Berkeley. The goal is to automatically generate a textual summary for an hour-long meeting conversation. The speech recordings are transcribed by both human annotators and an automatic speech recognizer (ASR) developed by SRI. We investigate different approaches that generate textual summaries by extracting and/or compressing the transcribed spoken utterances. The approach yields 5% absolute performance gain on both human and ASR transcripts as evaluated by the ROUGE-1 F-scores.
Publications: IEEE T-ASLP, 2013, IEEE T-ASLP, 2011