Analysis of Content from Social Media

Popular Web 2.0 technologies or social media software such as tagging, blogging, bookmarking, social networking, image- and videosharing sites and so on have allowed people to consume, produce, and share information easily, making this new class of user-generated content one of the richest forms of data available on the Web today.

On one hand, the social context surrounding the production, consumption, and sharing of user-generated content has opened several opportunities for enriching user interaction with content. But on the other hand, this same social aspect to content production has introduced new challenges in terms of the content's informal nature.

User-generated textual content in social media has unique characteristics that set it apart from the traditional content we find in news or scientific articles. Due to social media's personal and interactive communication format, user-generated content is inherently less formal and unmediated. Off-topic discussions are common, making it difficult to automatically identify context. Content is often fragmented, doesn't always follow English grammar rules, and relies heavily on domain- or demographic-specific slang, abbreviations, and entity variations (using "skik3" for "SideKick 3", for example). Some user-generated content is also terse by nature, such as in Twitter posts, which leaves minimal clues for automatically identifying context. All of these factors make the process of automatically identifying what a social media snippet is actually about much harder.

Our work in this space spans three closely-aligned projects

1. Semantics Empowered Social Computing

2. Targeted Content Delivery on Social Media

3. Monetizing User-Activity on Social Networks

Semantics Empowered Social Computing +

The larger scope of this project is to supplement traditional statistical and natural language processing (NLP) techniques with contextual information that is intrinsic and extrinsic to the text being analyzed. This contextual information could be derived from similar documents in a corpus, user-assigned tags, folksonomies, taxonomies or even full-fledged ontology.

In a recent article, Semantics-Empowered Social Computing [1], we examined the role of background domain knowledge in the analysis of informal text and the newer breed of contextually-aware applications it will empower.

We also implemented a content-analysis system that mined music-artist popularity from user comments on MySpace artist pages. We designed an artist and music annotator to spot artists, albums, tracks, and other music-related mentions (such as labels, tours, shows, and concerts) in user posts, and a sentiment annotator to detect sentiment expressions and measure their polarities[4].

We backed the artist and music annotator with MusicBrainz, a knowledge base of musical artists, genres, albums, and tracks. The annotator compared artist or track mentions in user comments against artist entries and associated track entries in the knowledge base to gain more context. In addition to this, the annotator used results of a syntactic parse of the comment and corpus statistics to annotate a track or artist mention. The sentiment annotator used a syntactic parse of comments to extract adjectives and verbs as potential sentiment expressions. It then consulted a slang dictionary (UrbanDictionary.com) to verify the expression's validity and ascertain polarity (positive or negative).

For both annotators, the combination of techniques proved to be more useful than using techniques in isolation. We aggregated positive and negative sentiments for all artists to generate a ranked list of the top X artists ordered by the number of positive sentiment comments. By observing popularity trends over time and the patterns that stand out in the user activity of such online communities, we were also able to forecast what was going to be popular tomorrow.

Targeted Content Delivery for Social Media Content *

Content Delivery is the task of complementing content that a user is viewing (Web search results or a Web page) with related content such as advertisements, similar articles, RSS feeds, images, tags and so on. Suggested content is pushed to a user because it is deemed relevant to the con- tent the user is viewing and with the goal of minimizing his information seeking efforts. Typically, content delivery involves spotting keywords in the content being consumed and matching those with keywords in the content being delivered. More sophisticated techniques append spotted key-words with synonymns or category level metadata to deliver additional content.

Given the interactional purpose to communication in Social Media, fragmented sentences, misspellings and entity variations are commonplace. Typically, users are also sharing an experience which results in the main message being overloaded with off-topic content. These characteristics, more prevalant in Social Media than elsewhere on the Web, affect the accuracy in identifying contextual keywords, i.e., keywords that are relevant to the main discussion. This in turn affects content suggestions that are matched against identified keywords. Poor suggestions impair user experience, are intrusive and over time, reduce user attention.

The contribution of this project is a simple yet effective algorithm to accurately identify contextual keywords, i.e, keywords that are relevant to the main discussion in the content a user is viewing, and eliminate off-topic keywords [2]. The goal was to assist content delivery systems in generating more relevant or targeted content suggestions. The algorithm is based on well-founded principles of information theory and is applied after keywords have been identifed in content and before suggestions are made. An example below shows a user post and contextual ads generated for it using Google AdSense when content has off-topic keywords (upper half). The bottom half shows ads after off-topic keywords have been eliminated.

Using Google AdSense for content delivery, we evaluated the targeted nature of content suggestions with and without using our algorithm. According to user evaluations over 57 posts from MySpace forums, our algorithm resulted in 22% more targeted content suggestions.

Monetizing User Activity on Social Networks *

Current advertising approaches to monetizing user content on social networks are profile-based contextual advertisements, demographic-based ads or a combination of the two. Content-based or contextual ads are generated by automatically finding relevant keywords on a network page and displaying ads based on those keywords. Demographic-based ads target an individual by age, gender and location information. Content-based ad delivery was made popular on the Web where ads matched content that a user was viewing on a web page. Not surprisingly, this model was a good contender for social networking sites (SNSs) where ads need to be highly targeted to the content in order to trump the value of networking. However, the utility of ad-models proposed to date on SNSs is not yet apparent to its members.

Besides issues of trust, privacy and scattered user attention on SNSs, the content that is being exploited for ad generation is also an important point of concern. Content-based advertising on SNSs uses information on member profiles such as interests and activities for delivering ads. While profile information might be useful for launching product campaigns and micro-targeting customers, it does not necessarily contain current user interests or purchase intents. Ads generated from such content are inherently less relevant to a user. Over time, this leads to a scenario where ad campaigns see several ad impressions but very few click-throughs. In this work, we posit that in addition to using profile information, ad programs should generate profile ads (ads shown on a user profile) from user activity on public venues (such as forums, marketplaces and groups) on SNSs.

Monetization of such user content however, is possible only when ads directly cater to a user's expressed needs. This entails understanding what the user is talking about (extracting key words and phrases) and what the intents behind his post are (looking for a product vs. sharing an opinion). In this project, we developed a content analysis system that

  • identifies monetizable (that contain explicit information-seeking or transactional intents) user activity or posts for profile ad generation
  • eliminates off-topic noise in these user posts, so only the most relevant keywords are used for generating ads.

Preliminary user studies suggest two important results. User activity outside a user's profile is more representative of a user's current interests and is monetizable and eliminating off-topic noise in user activity does generate more targeted ads than using the content as is [3].

Publications

1. D. Gruhl, M. Nagarajan, J. Pieper, C. Robson, and A. Sheth, Multimodal Social Intelligence in a Real-Time Dashboard System, Kno.e.sis Technical Report (submitted and under revision for the VLDB Journal), August 2009.pdf

2. A. Sheth and M. Nagarajan, Semantics-Empowered Social Computing, IEEE Internet Computing, Jan/Feb 2009, 76-80. pdf

3. M. Nagarajan et al., Targeted Content Delivery for Social Media Content, Tech. Report, Kno.e.sis Ctr., Wright State Univ.,2008; pdf

4. M. Nagarajan et al., Monetizing User Activity on Social Networks, Tech. Report, Kno.e.sis Ctr., Wright State Univ.,2008; pdf

5. J. Grace et al., Artist Ranking through Analysis of Online Community Contents,tech. report, Kno.e.sis Ctr., Wright State Univ., 2007;pdf

* Supported in part by Microsoft's "Beyond Search - Semantic Computing and Internet Economics" 2008 Award

+ Supported in part by IBM's 2007 UIMA Innovation Grant - "UIMA-based Infrastructure for Summarizing Casual, Unstructured Text" pdf