Semantics-driven Analysis of Social Media

Please checkout comprehensive tutorials from our group:

at WWW 2011 conference: Meena Nagarajan, Amit Sheth, and Selvam Velmurugan. Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications (Slides)

at ICWSM 2013 conference: Hemant Purohit, Carlos Castillo, Patrick Meier, and Amit Sheth. Crisis Mapping, Citizen Sensing and Social Media Analytics: Leveraging Citizen Roles for Crisis Response. , slides)

at SDM 2014 conference: Hemant Purohit, Carlos Castillo, and Fernando Diaz. Leveraging Social Media and Web of Data for Crisis Response Coordination.

Quick overview of our broader Vision for Citizen Sensing research:

Computing for Human experience- Semantics-Empowered Sensors, Services, and Social Computing on the Ubiquitous Web

- In the interdisciplinary context, our work is part of a broader agenda of analyzing citizen sensing to understand, inform policy or decision makers and develop tools to help manage important social and human development issues/challenges, including:

  • Coordination during disasters
  • We investigate massive social media communities during disasters via psycholinguistic theories to assist coordination functions of demand, supply and engagement while answering: with whom to coordinate, why and how.
  • Harassment on social media
  • We analyze social media conversations to identify and monitor harassment by understanding language of source and target users for mining intention, sentiment tone and emotions evoked. [Highlight: Fast Company's article on our cursing behavior research]
  • Prescription drug abuse
  • We mine social media to capture the knowledge, attitudes and behaviors of prescription drug abusers through the automatic extraction of semantic information (including entities, relationships, triples and other intelligible constructs such as sentiments, emotions, intervals, frequency, dosage, etc.) [Highlight: article on this research]
  • Depressive disorders
  • We leverage social sharing behaviors to mine depression and other mental health issues in this area. [Highlight: Collaboration with domain experts at Mayo Clinic]
  • Gender-based violence
  • We model gender-based dynamics in the social data stream across the world to inform policy decision making of development agencies, in collaboration with the Subject Matter Experts at UN. [Highlight: Joint research with UNFPA experts to directly impact the policy actions]

Short summaries of various aspects of interdisciplinary Social Media research at Kno.e.sis:

Over the last few years, there has been a growing public fascination with 'social media' and its role in modern society. At the heart of this fascination is the ability for users to create and share content via a variety of platforms such as blogs, micro-blogs, collaborative wikis, multimedia sharing sites, social networking sites etc. Our research primarily focuses on the analysis of various aspects of User-Generated Content (UGC) that are central to understanding inter-personal communication on social media. More recently, our interdisciplinary collaboration is studying People-Content-Network, and sentiment-emotion-subjectivity analyses. The objective of our work on semantic content analysis is to bring structure and organization to unstructured chatter on social media centered around the following questions:

  • What are people talking about: What are the Named Entities and topics that people are making references to? How are cultures interpreting any situation in local contexts and supporting them in their variable observations on a social medium?
  • How are they expressing themselves: What do word usages tell us about an active population or about individual allegiances or non-conformity to group practices?
  • Why do they scribe: What are the diverse intentions that produce the diverse content on social media? Can we understand why we share by looking at what we predominantly do with the medium? What emotions are people sharing about something?

On one hand, the social context surrounding the production, consumption, and sharing of user-generated content has opened several opportunities for enriching user interaction with content. On the other hand, this same social aspect to content production has introduced new challenges in terms of the content's informal nature.

User-generated textual content in social media has unique characteristics that set it apart from the traditional content we find in news or scientific articles. Due to social media's personal and interactive communication format, user-generated content is inherently less formal and unmediated. Off-topic discussions are common, making it difficult to automatically identify context. Content is often fragmented, doesn't always follow English grammar rules, and relies heavily on domain- or demographic-specific slang, abbreviations, and entity variations (using "skik3" for "SideKick 3", for example). Some user-generated content is also terse by nature, such as in Twitter posts, which leaves minimal clues for automatically identifying context. All of these factors make the process of automatically identifying what a social media snippet is actually about much harder.

Moreover, the recent popularity of the Web and online social media has completely revolutionized human interaction. The Web has an unprecedented scale of user interactions, as the Internet has become an integral part of human life. This shift in human social dynamics requires research, as the applicability of traditional research methods in social sciences is not clear in this new scale of social interactions. Therefore, we need an interdisciplinary approach, where Computer Science methods build on insights from existing theories in other disciplines- Social Science, Psychology, Linguistics etc., can enable us to perform quantitative analysis of this massive social data, and help us to understand the evolving human social dynamics.

Our current phase of the work is identifying micro level variables from People, Content and Network dimensions of a social network, which will ultimately result in understanding of macro-level behavior of social phenomenon such as, information dissemination, influence spread, community evolution and sustainability, coordination in communities, etc. It will also allow us to validate the classical theories of Social Sciences in this digital realm.

Our work spaces specifically spans eight closely-aligned projects:

  1. Named Entity Recognition
  2. Language usage in Social Media
  3. Monetization/Targeted Content Delivery on Social Networks
  4. Exploration of People, Content and Network dynamics in the online social networks
  5. Assisting Crisis Response Coordination by Leveraging Social Media Communities 
  6. Trust Networks: Interpersonal, Social and Sensor
  7. Mining Opinion and Emotion from Social Media
  8. TWITRIS: A System for Mining Collective Intelligence from Citizen-Sensor Data
  9. Linked Open Social Signals

Depending on the context, we have studied, analyzed and evaluated a variety of content from Twitter, Facebook, Myspace and Wikipedia and also, we have used a variety of relevant background knowledge to advance traditional computational techniques.

Named Entity Recognition (What do people write)

Our work in identifying what people talk about on social media focuses on recognizing named entities, with particular focus on a class of entities called Cultural Entities - those that refer to artifacts of culture, for example, names of movies, TV shows, songs and book titles. In addition to referring to multiple real-world entities (e.g. "The Lord of the Rings" can refer to multiple instances of movies, different video games and a number of novels), cultural entities are particularly hard to extract because of their use of fragments from everyday language.

In a recent work with Amir Padovitz at Microsoft Research, we explored a feature-based approach to improve the accuracy of existing named entity classifiers in identifying such cultural entities. We hypothesized that knowing how hard it is to extract an entity is useful for learning better entity classifiers. With such a measure, entity extractors become "complexity aware", i.e. they can learn to respond differently to signals depending on the entity's extraction difficulty. We proposed and developed an unsupervised algorithm to extract this prior using graph-based spreading activation and clustering techniques. We conducted evaluations in identifying movie named entities in informal weblog posts and found overwhelming evidence that this new prior improves extraction accuracy, supporting our hypothesis about engineering 'complexity aware' classifiers.

Our another investigation in identifying cultural entities, along with researchers at IBM Almaden, utilized MusicBrainz, a rich domain knowledge of music entities and their relationships (encoded in RDF) to annotate artist and track/album mentions in UGC from MySpace music forums [ISWC09a]. In this work, we showed that eliminating parts of the domain model using constraints implied in the content and metadata from the domain model effectively reduces entity disambiguation scenarios and improves spotting precision. For example, a comment, 'Saw you last night in Denver' indicates that the artist is still alive, allowing us to rule out parts of the Ontology mentioning artists who are not. We also showed that simple Machine Learning classifiers built over such pruned models, and learning over a variety of feature types (a combination of Natural Language, domain-related words such as music, song, concert etc. and sentiment expression features) yielded better results than using any of them alone.

In other related efforts, we worked on providing spatio-temporal-thematic summaries of chatter on Twitter using contextual information from the social medium [WISE09]. As part of a targeted content delivery platform, we used an information theory based algorithm to eliminate off-topic chatter in user-generated content on MySpace and Facebook forums to detect the main topic of discussion [WI09]. All of these studies highlighted pertinent challenges that informal UGC brings to text analytics.

Language Usage in Social Media (How do People Write)

In collaboration with Prof. Marti Hearst at UC Berkeley, we conducted an analysis on how men and women self-present in their online dating profiles [ICWSM09]. We studied language usage by quantifying usages of words from linguistic, personal and psychological categories from the Linguistic Inquiry Word Count (LIWC) Dictionary; next, using a multivariate statistical approach called Exploratory Factor Analysis to identify systematic co-occurrence patterns among LIWC variables; and finally grouping user profiles on the basis of their shared multi-dimensional features to compare and contrast self-presentation strategies.

We found interesting results in language usage by the gender groups that deviated from results of past studies in the general speech and writing habits of men and women. In particular, we found men to be using a higher proportion of tentative words (e.g., 'may be', 'could' and 'perhaps'), a class of words typically attributed to the feminine discourse.

We also found many similarities in actual word usages, more so in the use of open-class category words (like affect and verb groups). We believe it may be the case that self-expression tends towards attempting homophily in online dating profiles, given the tendency to 'imitate and impress' in courtship.

Besides people's language use in online dating profiles, we also examined people's cursing language use on Twitter [CSCW04]. On social media, people can instantly chat with friends without face-to-face interaction, usually in a more public fashion and broadly disseminated through highly connected social network. Will these distinctive features of social media lead to a change in people’s cursing behavior? we examine the characteristics of cursing activity on a popular social media platform – Twitter, involving the analysis of about 51 million tweets and about 14 million users. In particular, we explore a set of questions that have been recognized as crucial for understanding cursing in offline communications by prior studies, including the ubiquity, utility, and contextual dependencies of cursing.

Monetization/Targeted Content Delivery on Social Networks (Why do People Write)

Of the several messages that users post on social media everyday, an important task for an application trying to respond to a message is to identify the underlying intent. Our work in the identification of intents behind user posts caters to monetization of user activity on social networks [WI09]. Unlike web search, the presence of an entity alone does not classify intent accurately - any of these intentions could occur with a product X - 'i am thinking of getting X' (transactional); 'i like my new X' (information sharing); and 'what do you think about X' (information seeking). Our approach to automatic identification of intents relied on using 'action patterns' - pattern of words surrounding entity X. Using a set of seed 'action patterns' indicating intent, we developed a minimally supervised bootstrapping algorithm that learns new intent revealing patterns from an un-annotated corpus of 10K user posts from MySpace. Intent tendencies of new patterns are computed using semantic (using communicative functions of words from the LIWC dictionary) and distributional similarity with seed patterns.

As part of a targeted content delivery application [WI09], we found that the new learned patterns were effective in identifying monetizable posts, i.e., those with information seeking and transactional intents. We also found that users were 8 times more likely to click on ads that were generated from their monetizable footprints left on the network (in wall posts, forum messages) than those generated from their profile information (hobbies, activities etc.). An example below shows a user post and contextual ads generated for it using Google AdSense when content has off-topic keywords (upper half). The bottom half shows ads after off-topic keywords have been eliminated.

Figure 1. Difference in the targeted ads after eliminating off-topic keywords

Exploration of People, Content and Network Dynamics in the Online Social Networks

Understanding information, influence or popularity propagation in online social networks is a challenging problem with several contributing factors such as the timeliness of the information, the participants responsible for sharing the information, the structure of the network etc. We believe that there is a three-dimensional dynamic at play in how information propagates through a network - the people involved (passionate advocate or an objective observer), the content being propagated (fact-sharing or emotionally charged) and the connections between the people, all play a role in how information spreads. Understanding these micro-level variables and their interactions will shed light on macro-level consequences, e.g., political decisions or consumer behaviors.

In a recent study of tweeting practices [ICWSM10], we analyzed viral content on Twitter, i.e., tweets that were passed around or retweeted the most in a given period of time.

We focussed on topical tweets generated by communities that gather on Twitter around real-world events. Our study includes popular events of the year 2009 -- the Iran Election, the Health Care Reform debate and the International Semantic Web Conference (ISWC). These events were selected to represent varied characteristics in terms of social significance, attracting different populations, spanning different time periods, different lengths of time and representing a good variety of twitter activity.

We analyzed 250 most viral tweets which were found based on n-gram analysis of the tweet content. For each tweet, we gathered all instances of the tweet in dataset and plotted the re-tweet connections between the authors of the tweets as a directed network, for example, If user A retweets user B, an edge is drawn from node B to node A; indicating the direction of information diffusion. We found two clear patterns in analyzing the networks.

  • 1. A certain class of tweets that we classified as making a call for action, crowd-sourcing or collective group identity-making, generated sparse re-tweet graphs. In other words, although the content was being re-tweeted, author attribution was largely absent. Figure 2,A1. shows one such example for a "call for action" tweet - Join @MarkUdall @RitterForCO and @BennetForCO to support an up-or-down vote on the public option... The corresponding follower graph (A2.) (authors connected by follower links) was however well-connected; implying that people did (potentially) see these tweets from their network but did not feel compelled to credit the sender or the original author.
  • 2. In contrast to the above, we found that tweets that shared an information or fact generated a denser retweet/attribution network. Figure 2, B1 shows an example of an information sharing tweet- Iran Election Crisis: 10 Incredible YouTube Videos..
Figure 2. Graphs showing sparse (A) and dense (B) RT networks and their corresponding follower graphs for 'call for action' and 'information sharing' type of tweets

The patterns were observed consistent across the events despite the fact that the events were varied in the population they attracted and the goals of the communities. This categorization is certainly not exhaustive but suggests an important finding - the content being tweeted plays a key role in what an explicit retweet network will look like and in many cases, whether it will be traceable at all. we believe that the suggestive relationship between the tweet type and its retweet pattern will contribute to the study of link-based diffusion models.

In another study, we explored the dynamics of user engagement in social media [SoME'11,WWW'11], as the shift in the information consumption in today's social media has demanded effective methods to understand the new forms of user engagement, the factors impacting them, and the fundamental reasons for such engagements. We perform exploratory analysis on Twitter to understand the dynamics of user engagement by studying what attracts a user to participate in discussions on a topic. We identify various factors which might affect user engagement, ranging from content properties, network topology to user characteristics on the social network, and use them to predict user joining behavior. As opposed to traditional ways of studying them separately, these factors are organized in our framework, People-Content-Network Analysis (PCNA), mainly designed to enable understanding of human social dynamics on the web. We perform experiments on various Twitter user communities formed around topics from diverse domains, with varied social significance, duration and spread. Our findings suggest that capabilities of content, user and network features vary greatly, motivating the incorporation of all the factors in user engagement analysis, and hence, a strong need can be felt to study dynamics of user engagement by using the PCNA framework. Our study also reveals certain correlation between types of event for discussion topics and impact of user engagement factors.

Assisting Crisis Response Coordination by Identifying and Matching Requests and Offers of Needs on Social Media

Disaster affected communities are increasingly turning to social media for communication and coordination. This includes reports on needs (demands) and offers (supplies) of resources required during emergency situations. Identifying and matching such requests with potential responders can substantially accelerate emergency relief efforts. Current work of disaster management agencies is labor intensive, and there is substantial interest in automated tools to assist response coordination and enhanced situational awareness via mining this new source of information, social sensing. We exploit psycholinguistic cues to validate conversational behavior of language usage, whether online social media also has similar properties as in the face to face communication. We observed similar behavior how linguistic features studied in the past on offline behavior can determine the conversational nature of online social media by studying Twitter datasets of different events. We create classifiers using those linguistic features to filter coordination assisting data out of massive social media streams. [CHB'2013]

Figure 3. Matching requests and offers by mining messages in the ad-hoc social media communities to assist crisis response coordination

Extending our thoughts of further going in-depth by focusing on functions to assist coordination, we created machine-learning methods to automatically identify and match needs and offers communicated via social media for items and services such as shelter, money, clothing, etc. For instance, a message such as 'we are coordinating a clothing/food drive for families affected by Hurricane Sandy. If you would like to donate, DM us' can be matched with a message such as 'I got a bunch of clothes I'd like to donate to hurricane sandy victims. Anyone know where/how I can do that?' Compared to traditional search, our results can significantly improve the matchmaking efforts of disaster response agencies [FM'2014]. It involves collaboration with social scientists, Profs. Valerie Shalin and John Flach at WSU, and our international collaborators Drs. Patrick Meier and Carlos Castillo at QCRI in the humanitarian computing space.

A quick summary of our analysis frameworks and systems is available here

Trust Networks: Interpersonal, Social and Sensor

Trust relationships occur naturally in many diverse contexts such as ecommerce, social interactions, social networks, ad hoc mobile networks, distributed systems, decision-support systems, (semantic) sensor web, emergency response scenarios, etc. As the connections and interactions between humans and/or machines (collectively called agents) evolve, and as the agents providing content and services become increasingly removed from the agents that consume them, miscreants attempt to corrupt, subvert or attack existing infrastructure. This in turn calls for support for robust trust inference (e.g., gleaning, aggregation, propagation) and update (also called trust management). Unfortunately, there is neither a universal notion of trust that is applicable to all domains nor a clear explication of its semantics in many situations. Because Web, social networking and sensor information often provide complementary and overlapping information about an activity or event that are critical for overall situational awareness, there is a unique need for developing an understanding of and techniques for managing trust that span all these information channels. Currently, we are pursuing research on trust and trustworthiness issues in interpersonal, social, and sensor networks, to potentially unify and integrate them for exploiting their complementary strengths.

Mining Opinion and Emotion from Social Media

Social media serves as a platform for people to speak their mind more freely, which lead to a growing volume of opinionated data that can be used by: (1) Individuals as suggestion and recommendation. (2) Companies for making marketing strategies and other decisions. (3) Government for monitoring social phenomenon, being aware of potential dangerous situations, etc.

As a popular social media service, twitter provides a convenient and instant way for people to share their opinions by tweets anytime anywhere. For better use of such information, there should be methods for automatically identifying opinionated tweets with respect to specific topics and classifying them according to their sentiment polarity. We addressed the problem of topical sentiment analysis of tweets and proposed a novel approach to construct topic and context-aware sentiment lexicon for this task [ICWSM12]. The proposed technique has been evaluated on multiple domains, and the results show that the proposed approach outperforms several baseline methods significantly.

In recent years, there is a surge of interest in building systems that harness the power of social data to understand public opinions and predict what is about to happen. We studied the spectrum of Twitter users who participated in the on-line discussion of elections, and examined the predictive power of different user groups [SocInfo12]. We presented a method to predict the "vote" of a user based on target-specific sentiment analysis of his/her tweets. The study showed that Twitter users were not equal in predicting elections, and demonstrated the importance of identifying likely voters and user sampling in electoral predictions.

Besides looking into people's opinions about things, we also looked into people's emotional states. We explored how to detect people's emotions that are expressed in sentences from suicide notes [BII12]. We designed a hybrid system consists of machine learning and rule-based classifiers. For the machine learning classifier, we investigate a variety of lexical, syntactic and knowledge-based features, and show how much these features contribute to the performance of the classifier through experiments. For the rule-based classifier, we propose an algorithm to automatically extract effective syntactic and lexical patterns from training examples.

In the process of detecting emotions from suicide notes, we realized that our system's performance can be affected due to the relatively small size of annotated emotion dataset. To overcome this bottleneck, we have automatically created a large emotion-labeled dataset (of about 2.5 million tweets) by harnessing emotion-related hashtags available in the tweets [SocialCom12]. We have applied two different machine learning algorithms for emotion identification, to study the effectiveness of various feature combinations as well as the effect of the size of the training data on the emotion identification task. Our experiments demonstrate that a combination of unigrams, bigrams, sentiment/emotion-bearing words, and parts-of-speech information is most effective for gleaning emotions.

TWITRIS: A System for Mining Collective Intelligence from Citizen-Sensor Data

Check out our Twitris Wiki page and a short summary of Twitris' Capbilities. Twitris is a Semantic Web application that facilitates understanding of social perceptions by Semantics-based processing of massive amounts of event-centric data. It addresses challenges in large scale processing of social data, preserving spatio-temporal-thematic properties. Twitris 2.0 also covers context based semantic integration of multiple Web resources and expose semantically enriched social data to the public domain. Semantic Web technologies enable the system's integration and analysis abilities.

Why Twitris?

Emergence of microblogging platforms such as Twitter, friendfeed etc. have revolutionized how unfiltered, real-time information is disseminated and consumed by citizens. Twitter, has therefore emerged as the preeminent medium for sharing citizen-sensor observations, as was demonstrated in a variety of situations ranging from Mumbai terrorist attack to Iran elections.

While the decentralized information diffusion model offered by twitter has gained momentum and has created avenues for experiential data sharing, millions of observations, shared through tweets, create significant information overload. In many cases it becomes nearly impossible to make sense of the information around a topic of interest. This problem is further compounded by the fact that tweets increasingly integrate other social networking sites (flickr, twitpics) and general Web content(news, Wikipedia, blogs) through embedded links and metadata. Given this data deluge, analyzing the numerous social signals carried by tweets and associated content to find out what is being said about an event (theme), where (spatial), when (temporal), how are key concerns (topics of discussion) changing over a period of time and whether there are regional differences in the opinions on a given topic, can be extremely challenging.

What is Twitris?

In response to this growing data deluge, we have developed Twitris (currently Twitris 3.0) with the vision of performing semantics-empowered analysis of a broad variety of social media content. Specifically, Twitris aims to capture semantics (i.e., meaning and understanding) with spatial, temporal, thematic dimensions, user intentions and sentiments, networking behavior (user interactions patterns and features such as information diffusion and centrality) and other information present in social media. Semantic Web technologies enable its core integration, analysis and data/knowledge sharing abilities. Twitris, focuses on people-content-network centric analysis, leveraging the relevant Semantic Web technologies, background knowledge, languages, tools where appropriate.

Twitris is a Semantic Social Web approach to detect social signals by analyzing massive, event-centric data through:

  1. Analysis of casual text with spatio-temporal-thematic (STT) bias, to extract event descriptors.
  2. Capturing semantics from contexts associated with tweets.
  3. Use of deep semantics (using automatically created domain models) to understand the meaning of standard event descriptors.
  4. Use of shallow semantics(semantically annotated entities) for knowledge discovery and representation.
  5. Exposure of processed social data to the public domain, complying with semantic Web standards.
  6. Semantic Integration of multiple external Web resources (news, articles, images and videos) utilizing the semantic similarity between contexts.

Twitris has been developed as a multi-layered system where each component acts as part of a pipeline. Here is a Functional Overview of Twitris.

The system is currently being used for a number of People-Content-Network study experiments and being extended to integrate with SMS and other Web data used by a number of widely deployed open source projects. These include applications used by non governmental organizations (NGO) in developing countries for crisis management (in particular,, and Twitris is being extended with Twarql technology for limited real-time support and is being adapted for a cloud platform for much higher scalability.

TWITRIS is part of a larger research agenda on Computing for Human Experience, at the Ohio Center of Excellence in Knowledge-Enabled Computing (Kno.e.sis) Center at the Wright State University, Dayton, Ohio (other key themes include semantics-enriched services computing and the sensor Web). Read more about twitris..

Complementary projects: Twarql, Semantic Sensor Web


Research @Meena

Additional collaborators

Projects Support in part by:

  • Microsoft's "Beyond Search - Semantic Computing and Internet Economics" 2008 Award.
  • IBM's 2007 UIMA Innovation Grant - "UIMA-based Infrastructure for Summarizing Casual, Unstructured Text"
  • NSF Semdis and STT.
  • Trusted Semantic Sensor Web.
  • Interdisciplinary NSF project: SoCS: Social Media Enhanced Organizational Sensemaking in Emergency Response (Read more).