Speaker: Misagh Falahi
Affiliation: Ph.D. Candidate - UCLA
Abstract: The development of the Web has, among its other direct influences, provided a vast amount of data to researchers in several disciplines. While in the early stages of its growth the data often went unseen and was secondary to the other products the Internet made available, in the past decade it has quickly become a primary resource for a large number of online applications and has given possibility to many analyses and studies. Text data in particular has been a cornerstone of these works in an attempt to better understand human knowledge and behavior. This work focuses on analysis of the process of writing documents and the abstract underlying contexts driving this process. Contexts are inherently conceptual entities with several possible representations. We propose a generative model for documents based on psychological models of human memory search, and from there we define structures that can represent these abstract contexts. Recent works in psychology literature suggest the brain’s memory search process can be modeled as a random walk on a semantic network. The vast body of research available on random walks in different disciplines, and more recently for their use in analyzing the structure of the web and developing search engines, makes this model particularly appealing for understanding and simulating the brain’s process of vocabulary selection and document generation. It can also be used to drive lexical applications and automated text analyses such as exploring the inherent structures existing in a language and the relationship between words.
In this work, we present a network approach to describing document generation and discovering contexts. We form an associative network of words based on co-occurrence, with ties between words weighted by the number of documents in the corpus they simultaneously appear in. By inspecting the hierarchical modularity of this network and using the random walk model and community detection algorithms based on random walks, we can find communities of words that form contextually homogeneous groups. Within a certain context defined by one of these groups, the relative importance of every other word can be determined by creating a contextually biased word association network and using the Google PageRank algorithm that magnifies nodes with higher centrality. We use these context profiles to form a context-term matrix representative of semantic traces in memory. We then study the hierarchical structure of contextually significant word clusters in different layers of the network, through examining layer blocks of the context-term matrix. Other similar studies include topic modeling, the unsupervised learning of patterns of words and phrases that can represent “topics”. The mainstream view in topic modeling regards a topic as a distribution over known vocabulary. The famous Latent Dirichlet allocation (LDA) for instance, finds a given number of topics within a text corpus, each topic represented by a distribution over all words. LDA essentially fits a latent variable distribution of word combinations to a set of observed documents. While not the main goal of the model presented here, we also extend the concept to calculate vector representation of topics to provide summaries of the information contained in the corpus. These vector representations are calculated by factorization of the context-term matrix. The summary outcome of this method will also reveal important sub-structures of the large hierarchical structure.
Biography: Misagh Falahi is a PhD candidate in Electrical Engineering at UCLA under the supervision of Prof. Vwani Roychowdhury. He received his B.S. degree in Electrical Engineering from Sharif University, Tehran, Iran in 2012 and the M.S. degree in Electrical Engineering from University of California, Los Angeles, in 2014. Misagh’s research interests lie in the areas of Information Retrieval, Natural Language Processing, and Complex Networks.
For more information, contact Prof. Vwani Roychowdhury ()
Date(s) - Sep 08, 2017
1:00 pm - 3:00 pm