Semina Report

CHAPTER 1
INTRODUCTION
As social media and the Web environment continue to
grow and evolve, with large no of users using social networking
sites,it has become a media where millions of users share their
views about various domains.Content analysis is one of the
most powerful methodologies.). Today, a variety of content
analysis measurement techniques are available to support
content analysis tasks within different disciplinary contexts in
distinct ways. Communication researchers have heavily relied
on both human- and computer-based content analyses to
examine symbols of communication and to make valid
inferences about communication. However, the trade-offs
between human and computer coding in terms of validity,
reliability, and large-scale data processing often mar
researchers abilities to identify characteristics within the text
and draw inferences from mediated messages. Human coding
methods maximize the validity of measurement, but are often
limited in their ability to deal with large databases. Computer-
coding methods maximize reliability and can efficiently deal
with large collections of data, but have traditionally been
criticized in terms of their ability to understand the subtle latent
meanings of opinion expression in the same way as human
coders. This inherent tension between human- and computer-
based coding is nothing new, but is only beginning to be
adequately addressed by communication scholars. The main
purpose of this study is to argue the need for communication
scholars to rely on emerging tools for content analysis that
capitalize on the strengths of human coding and intelligent
algorithms. Using one exemplar a supervised, learning-based
hybrid method, developed by Hopkins and King (2010) (referred
to henceforth as HK) among other similar software programs
that are currently available. The hybrid approach that is
advocate for preserves semantic validity, a strength of
humanbased coding, while also being applicable to large
quantities of data.. Using nuclear power and nanotechnology as
our focal issues, we tracked opinion expression on Twitter
before and after the Fukushima Daiichi disaster using the HK
content analysis method.The demonstrated hybrid method is a
supervised machine learning technique relying on a software-
based algorithm.. This study discusses the various advantages
of the hybrid supervised learning technique as compared with
traditional content analysis tools.
CHAPTER 2
METHODOLOGICAL ADVANCEMENTS
In the Web 2.0 environment, researchers are now able to
retrieve a seemingly infinite amount of digital content for
analysis .Not surprisingly, this has resulted in increased interest
in sentiment analysis or opinion mining, a specific form of
content analysis that identifies how sentiments, opinions, and
emotions are expressed about a given subject in text-based
documents, such as social media messages. Importantly, the
creation of data in the new media environment has greatly
outpaced the capacity of conventional sentiment content
analysis approaches . Here, first review the key characteristics
and challenges of human- and computer-based content analysis
before introducing why combining the merits of these two
approaches is a necessity in the Web 2.0 environment. The
comparisons will focus on three key areas: reliability, validity,
and efficiency.
CHAPTER 3
Comparisons of human- and computer-
based content analysis
3.1 Reliability
It is defined as agreement among coders (or within a single

coder over time) when classifying content, is crucial to content
analysis. A high level of reliability can serve to minimize human
biases and maximize the reproducibility of findings, while
inconsistencies in how a single coder or a group of coders
categorize content can hurt the reproducibility of coding
results. Achieving high reliability in human-coded content
analysis is often challenging, especially when analyzing large
volumes of data, as it increases the likelihood that coders will
make mistake. Moreover, when relying on the subjective
judgments of human coders, achieving perfect reliability is
almost impossible. Unlike its human-coded counterpart,
computer-based content analyses are deterministic, which
eliminates many of the potential uncertainties outlined above
and, more importantly, promises perfect.The high reliability
that computer analysis offers, however, comes at a cost as
most computational algorithms are incapable of reading data
precisely the same way humans do. As a result, validity, the
second area of emphasis, is often impaired or questioned.
3.2 Validity
Validity is defined as the extent to which the coding

adequately reflects the real meaning of the concepts being
measured . Manifest content analyses focus on content that is
generally easy for coders to recognize and classify for
example, simple keyword counts and typically results in high
levels of reliable. Latent content analyses, on the other hand,
refer to the practice of coding for the underlying meanings
embedded in content, usually a pattern or theme that must be
interpreted by the analysts . Some scholars have argued that
the coding of textual data should be restricted to manifest
content because latent meaning analysis requires too much
interpretation of the data by coders, which can hamper
reliability . However, manifest content analyses, such as word
counts and character string recognition, are often too simplistic
to be of interest to researchers (Riffe et al., 2005). In practice,
the choice between manifest and latent content analysis is
dependent not only on the goals of the researcher, but also on
the inherent nature of human- and computercoding methods.
Human coders use assertions, ranging from a single word to a
phrase or series of sentences, as the unit of analysis. Unlike
computational programs, human coders are proficient in
understanding complex expressions and in interpreting the
context of messages. Therefore, people are often viewed as
superior to computers for coding latent content . On the other
hand, computer-based content analysis programs typically
recognize only words and character strings as opposed to the
underlying meanings of those words and characters . The
simplest and most common form of computer-based content
analysis measures the number of times a list of keywords
representing a specific category or theme occurs in a collection
of texts (Morris, 1994). Unfortunately, the true meaning of
those words is often lost when the words are removed from
their surrounding context within the text . Compared to human-
based coding techniques, which can uncover the nuanced
meaning of words, computerized content analyses oftentimes
struggle to reveal the true underlying meaning of a text .
3.3 Efficiency
Reliance on the assistance of computer programs in conducting

content analysis has become a trend due to recent
technological advances. Most importantly, computational
programs are capable of analyzing large volumes of data with
astonishing speed. In Web 2.0 environments, seemingly endless
amounts of textual data are produced in digital form each day,
providing social scientists a wealth of data to work with.
Computational software not only assists in the efficient
processing of digital text, but also allows researchers to tap into
online databases and quickly process messages across
numerous online media platforms. Compared with computer-
assisted coding, human coding is not only labor-intensive but
also error-prone as coders ability to concentrate throughout
long and oftentimes tedious coding sessions decreases over
time . As a result, hand-coding of large volumes of unstructured
textual content has been considered almost infeasible.
CHAPTER 4
CONVENTIONAL METHODS
4.1 Dictionary Approach
In the current big-data environment, computer-aided content

analysis techniques have become increasingly attractive to
researchers. One common traditional method to computer-
aided content analysis is the dictionary approach (Krippendorff,
2012).Approach relies on computational programs assigning a
list of search keywords to groups according to not only
predefined standard dictionaries, but also a self-defined
categorization system (Riffe et al., 2005). Dictionary-based
methods have been called theorydriven approaches to content
analysis as categories are constructed on the basis of
theoretical expectations originating outside of the texts
themselves (Simon & Xenos, 2004). Several dictionary-based
software programs, such as TextPack, WordStat, and VBPro,
allow researchers to create customized categories that can be
used either in conjunction with, or independently of, predefined
categories to facilitate greater precision in the application of
computer coding when answering context-specific research
questions (see Krippendorff, 2012; Matthes & Kohring, 2008;
Short, Broberg, Cogliser, & Brigham, 2010). Through this
approach human coders can manually create a set of
sophisticated syntactical rules that can be used by the
computational software to capture the underlying meaning of
texts. However, as each word combination can have only one
defined meaning for use across all texts, these manually
created word indices are often inadequate and limited (Matthes
& Kohring, 2008).
4.2Statistical Approach
Another common method, the statistical association approach

(Krippendorff, 2012), has been developed to detect the words
that tend to occur together in specific communication texts on
the basis that certain word relationships may more precisely
represent a particular theme or frame within a text (e.g., Illia,
Sonpar, & Bauer, 2014; Miller, 1997). One such example is
semantic network analysis, which examines the relationships
and patterns among frequently occurring words in order to
identify the meaning in a text (Doerfel & Barnett, 1999).
Popular examples of semantic network analysis programs
include Catpac (for an overview, see Woelfel & Fink, 1980) and
Wordlink (for an overview, see Danowski, 1993). These software
programs possess a high level of objectivity in concept
extraction because no a priori categories are assumed by
human researchers. Rather, the program identifies clusters of
words from which meanings can be inferred. As a result, these
methods typically fall into the so-called data-driven approaches
to content analysis as researchers adopt categories as they
emerge from the texts themselves (Simon & Xenos, 2004). Of
course, such programs are not without key limitations. Notably,
these programs may treat synonymous concepts as entirely
different ones, or fail to differentiate among different uses or
connotations of the same word (Doerfel & Barnett, 1996; Zll &
Landmann, 2004). Unlike human coders who can comprehend
the nuances of language, software programs that arrive at
textual meanings from groupings of words are not perceived as
tapping the true meaning behind the relationships between
those words
4.3 Lexical Approach
In response to the limitations noted above, there have been

extensive developments in novel sentiment analysis
approaches that have led to a number of new commercial and
academic content analysis software programs appearing on the
market (e.g., Annett & Kondrak, 2008; Nasukawa & Yi, 2003;
Pang & Lee, 2008). One common sentiment analysis technique,
the lexical approach, relies on dictionaries of pre-tagged words
that are used as a basis for analyzing the contents of texts.
When a dictionary word is found in the text, its polarity value
is added to a total polarity score. For instance, the word
great might have a positive polarity score, and, therefore,
increase the total polarity score of the text in question, while
the word terrible might have a negative polarity score and
decrease the total score of the text (Annett & Kondrak, 2008).
Within sentiment analysis classification, another popular
research avenue is supervised learning methods (Annett &
Kondrak, 2008; Boiy & Moens, 2009; Liu & Zhang, 2012). This
technique typically uses a set of chosen feature vectors to label
a training set, also known as a corpus. In turn, the corpus is
used to train a supervised sentiment classifier from manually
classified data, which can then be applied to large quantities of
unlabeled documents using different classification techniques
(Annett & Kondrak, 2008; Liu & Zhang, 2012). These emerging
hybrid-focused supervised learning approaches combine
computation- and human-based methods to take advantage of
human intelligence in the analysis of large-scale data sets. This
study employs just one of the automated supervised learning
techniques, the HK method. We argue that these types of
hybrid approaches to content analysis can be of great
usefulness to communication scholars.
CHAPTER 5
An example of hybrid content analysis
method: The HK method
The HK method (the specific hybrid method employed in our
study) is based on a twostep process. The first stage involves
intensive human input in reading and classifying a sample of
social media corpora that are randomly extracted on the basis
of a set of user-determined keywords. At the second stage, the
HK algorithm learns from the subsample of online texts labeled
by human coders as being representative of particular types of
opinion categories. The trained classifier is then used to derive
the aggregate distribution of classifications of all unread
documents using an automated analysis.1 To understand the
aggregate proportion of opinion classifications across a set of
population documents, the single documents (e.g., tweets) are
first decomposed into a set of word stems that can then be
represented by a list of stemmed unigram vectors to create
word stem profiles for each document (see Hopkins & King,
2010 for further details). Mathematically, the HK algorithm can
be expressed by the following formula: P(D) = P(S) P(S|D) 1 .
(1) The goal of the HK method is to obtain an estimate of P(D),
which is the multinomial frequency distribution of opinion
across all population documents that fall into each possible
opinion category, as the collection of all possible opinion
categories among population documents is indicated by D. The
word stem files, expressed by S, consist of a set of variables
that provide a summary of all the word stems that appear in
each document. Estimation of the multinomial frequency
distribution of word terms P(S) can be directly achieved through
tabulating all the population documents. However, the
conditional probability of each word stem profile occurring in
the population in each opinion category, indicated by P(S|D),
cannot be directly observed. Consequently, the HK method
makes a key theoretical assumption that the estimation of P(S|
D) in the population is the same as the conditional frequency
distribution of word stem profiles in the hand-coded training
samples, Ph(S|D). In other words, the texts of the humancoded
training set are assumed to be homogeneous across the
population set of documents, which suggests that the
conditional distribution of P(S|D) can be estimated by referring
to the human-labeled training set of texts. We can then rewrite
Equation (1) to the following formula that uses estimates of
P(S) and Ph(S|D) to derive the target estimation of P(D): P(D) =
P(S) Ph(S|D) 1.Outside of the ability to provide reliable and
efficient automated textal analysis, a major advantage of the
HK method is its ability to perform human-based
computeraided content analysis. Human coders are better
prepared than computers to make sense of ambiguous
materials with thwarted or negated expressions, to spot irony in
posts, and to recognize alternative forms of expression (e.g.,
abbreviations and neologisms). Another major characteristic of
the HK method is that it provides populationlevel estimates of
the aggregate proportion of content (e.g., tweets, blogs, or
Facebook posts) within each opinion category, rather than
conducting individual-level classification of a single document.
That is, the method does not classify individual tweets or posts,
which oftentimes may express multiple sentiments, but
estimates the proportion of content across the categories of
interest.
CHAPTER 6
Bayesian Network and Recursive
Partitioning Analysis on Health and
Environment Keyword
Since keywords in health and environment topics directly
affect safety of human body and environment, the
corresponding topics received much more public attention than
the other topics, on location and nuclear power. We now
observe the mentioned rate classified by keywords via
Bayesian networks. In this graphical model, nodes represent
random variables and arrows represent the probabilistic
dependencies among nodes that are identified by recursive
partitioning analysis, a statistical method for multivariable
analysis on each topic. This model helps us understand a
sparse set of keywords of direct dependencies..
6.1 Discourse on Health Topics
Figure 1 shows the pie-chart encompassing each keyword in the

health topic. Radiation Exposure is the most frequently
mentioned keyword, followed by cancer, groceries, health, food,
thyroid, and leukemia in order of frequency
6.2 BAYESIAN NETWORK AND TREE STRUCTURE
In Figure 2, the boxes at the bottom represent the frequency

distribution. The recursive partitioning analysis may proceed
until the tree is saturated in the sense that the offspring nodes
can no longer be subjected to further division (Zhang and
Singer, 2010). The root node, groceries, was split into nodes 2
and 3 with criterion 7.5. In node 2, 34,952 people who
mentioned radiation exposure talked about groceries. If the
mention of groceries is 7.5 times or more, the node is split into
two further parts based on the number of comments about
cancer. According to nodes 4 and 5, only 45 people mentioned
all of the keywords: radiation exposure, groceries, and cancer.
The Bayesian networks drawn at the top right of Figure 2 are
representative of the relationships among the three keywords.
Radiation exposure is a parent node of both groceries and
cancer nodes, as well as a forebear node of cancer. Also,
groceries are a parent node of cancer. Radiation exposure has
achieved the most dominant position within the health topic
and the remaining keywords have played a subsidiary role. This
illustrates the concern that radiation exposure would lead to
food contamination, which eventually lead to the topic of
cancer. In particular, the public expressed worries about the
incidence of diseases such as thyroid and leukemia.
CHAPTER 7
CONCLUSION
In the Web 2.0 environment, communication scholars are
confronted with great challenges and opportunities in
comprehensive and systematic content analysis. This study
uses a combination of coding principles and algorithmic
technology to demonstrate how communication scholars can
analyze large quantities of online data. The hybrid content
analysis approach presented in the article can serve as a
methodological blueprint for our discipline as we face
challenges and opportunities for analyzing rapidly growing
bodies of electronic communication.These challenges call for
efficient algorithms and tools for conducting content analyses
on large quantities of data for scholarly application in the
communication field. The combination of computational
processing power with human intelligence ensures high levels
of reliability and validity for the analysis of latent content,
particularly in an environment where character restrictions can
heighten the importance of context when analysing content.
The main goal of Bayesian network research was to measure
and analyze the awareness and concern about environmental
pollution.
REFERENCES
Leona Yi-Fan Su Department of Life Sciences
Communication, University of Wisconsin-Madison,
Madison, WI, USA(2016)
1Graduate School of Culture Technology, Korea Advanced

Institute of Science and Technology, South Korea 2Korea
Institute of Nuclear Safety, South Korea {skylugia, yuiha,
meeyoungcha}@kaist.ac.kr, {jylee, bjkim,
k320ldm}@kins.re.kr
Proceedings of the Fifth International AAAI Conference on

Weblogs and Social Media(2016)

Semina Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Semina Report

Uploaded by

Copyright:

Available Formats

CHAPTER 1

It is defined as agreement among coders (or within a single

Validity is defined as the extent to which the coding

Reliance on the assistance of computer programs in conducting

In the current big-data environment, computer-aided content

Another common method, the statistical association approach

4.3 Lexical Approach

In response to the limitations noted above, there have been

6.1 Discourse on Health Topics

Figure 1 shows the pie-chart encompassing each keyword in the

In Figure 2, the boxes at the bottom represent the frequency

1Graduate School of Culture Technology, Korea Advanced

Proceedings of the Fifth International AAAI Conference on

You might also like