Professional Documents
Culture Documents
net/publication/339984770
CITATIONS READS
0 53
3 authors:
Ruihong Wang
Texas A&M University Central Texas
2 PUBLICATIONS 0 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Subasish Das on 20 March 2020.
Abstract—Many studies have examined vehicle-pedestrian and pedestrians and bicyclists are hard to resolve because both are
vehicle-bicycle crash patterns. However, there is a surprisingly vulnerable road users. Pedestrians may seem more vulnerable
limited amount of research focused on pedestrian-bicyclist colli- to most of the public, but this belief leads to a violent,
sions, and research about the tension between pedestrians and
cyclists is even rarer. The lack of research on this subject is partly negative image of bicyclists [1]. Mainstream websites present
due to the limited number of pedestrian-cyclist crashes; it is also contradicting arguments between the bicyclists and the rest of
due to the fact that the consequences of these crashes are typically the public. However, the lack of related academic studies will
less severe than those of vehicle-involved crashes. Despite the lack bridge the gap by exploring the causes of tensions between
of research focus on this area, pedestrian-cyclist crashes could bicyclists and pedestrians by performing topic modeling and
lead to a serious social crisis. The largest video sharing website,
YouTube.com, contains many pedestrian-cyclist collision videos text mining.
and associated comments and replies. The application of content In recent years, the role that social media can have in
analysis and text mining to the comments from these videos shaping the opinions of individuals on various products and
can provide insight into potential interactions. The findings of issues has gained attention. Because videos allow the visu-
this study show that the emotion patterns of comments and alization of information, concepts, and dialogues and permit
replies differ. This study also provides word shift plots that
show the trend of the emotion used in comments and replies. user-generated communications, videos have developed public
Additionally, the co-occurrence plots show the reason behind perceptions. YouTube.com has more than 1 billion users and is
the use of negative emotions. The findings of this study will the largest online platform for open-access video content [24].
provide additional insights into the ongoing debate on ‘pedestrian As such, this platform plays a significant role in generating
collisions with bicyclists’ issues. public opinion on many issues. YouTube contains the highest
Index Terms—autonomous vehicles, content analysis, text min-
ing, sentiment analysis. number of public opinions on video relevant interactions in
comparison to other social media platforms.
To perform knowledge discovery on the motives of user
I. I NTRODUCTION
participation and consumption on YouTube videos associated
In recent years, non-motorized travel modes (walking and with the ‘pedestrian collisions with bicyclists,’ this study
biking) have been gaining popularity in the U.S. with non- applies the natural language processing (NLP) framework.
motorist travelers. To improve the safety of pedestrians and This study aims to explore to research questions: 1) R1:
bicyclists, researchers have made multiple efforts to reduce What are the key patterns and trends of bicyclist vs pedestrian
vehicle-bicycle and vehicle-pedestrian crashes. A research area interactions? and 2) R2: Do comments and replies vary in
that is less explored is the collision between a bicyclist sentiment scores by video types?
and pedestrian. A recent study examined the incidences of
pedestrians injured by cyclists in California from 2005 to II. EARLIER WORK AND RESEARCH CONTEXT
2011 and in New York between 2004 to 2011 [34]. The Very little research has concentrated on the collision be-
findings show that despite the increasing number of cyclists, tween cyclists and pedestrians. Moreover, any mention of
there was a decline in the rate of pedestrians injured in tension between pedestrians and cyclists in research is rare
collisions with cyclists. One of the primary reasons for the rate because of the limited number of crashes; the exception is in
decrease is due to the cycling infrastructure improvements. some urban-core ped-bike active zones. More serious issues
In the absence of cycling infrastructures, cyclists often use could be a result of the tension between the two groups [2],
pedestrian facilities for part or the entire journey. Pedestrians [34].
feel insecure in the presence of speedy bicyclists, especially Tuckel et al. conducted a study to examine the trend
in dense urban locations. In recent years, antagonism between of pedestrian injuries in the bicyclist-pedestrian (bike-ped)
pedestrians and bicyclists has gained more attention through collisions and investigate possible explanations [34]. The New
social media like mainstream media websites. It is easy to York Times published an article called “The Cyclist-Pedestrian
find videos and articles that contain furious comments about Wars,” detailing the rising tensions between bicyclists and
careless pedestrians or bicyclists online. The tensions between pedestrians in the Upper East Side of New York [15]. An
Fig. 1: Flowchart of data collection and analysis.
Amsterdam study showed that traffic conflicts with bicycle accurate representation of public opinion and sentiments on
paths are a specific safety problem for crossing pedestrians this issue.
[35]. The bicyclist blames the pedestrian for stepping into
the bike lane illegally, and the pedestrian blames the bicyclist III. DATA DESCRIPTION
for running too quickly and without looking [30]. In Santa A. Data Collection
Monica, California, the tension between the pedestrians and
bikers was explained in a published news article [19]. Werneke To collect the ‘bicycle hitting pedestrian’ related videos
et al. found that pedestrians crossing the bicycle path without in YouTube, a detailed list of keywords was developed by
looking were the primary cause of the majority of bike- using the following terms: “walking biking collision,” “biker
pedestrian incidents [36]. hits ped,” “bicyclist hit pedestrian,” “pedestrian bike crash or
incident or accident,” “pedestrian bicyclist crash or incident
In the recent years, several studies have incorporated text or accident.” The researchers automated the data collection
mining in transportation engineering research: consumer com- (extract the video information as well as related comments)
plaint analysis [14], [21], social media mining [7], [13], process by using an open-source R software package called
[16], [28], opinion mining on safety enhancement and bike- “tuber” [32]. Another online YouTube comment scrapper has
sharing [5], [10], topic modeling on transportation engineering also been used [20]. The research team used several open-
conference papers and journals [3], [6], [9], [17], [33] and source R software packages to perform the analysis [22], [29],
crash narrative investigation [4], [8]. An investigation into [31], [37]. The flowchart of data collection and analysis is
the textual content related to YouTube videos on ‘pedestrian shown in Fig. 1.
collisions with bicyclists’ has not yet been conducted. A debate Table I provides descriptive statistics of the top ten most
on whether social media data is representative and unbiased viewed videos. After removing redundant and non-English
enough for a robust study. This study contemplates if the comments, the final number of comments was 26,122. Among
collection of 25,000 comments and replies will provide an the top ten videos, one video was released earlier (in 2010).
(a) Comment (b) Reply
The overall number of views for all videos was 6,799,938 (IDF). Spark Jones first introduced this concept in 1972 [31].
(mean: 679,994, standard deviation: 1,098,276). The number Table II shows the TF-IDF values for the two categories based
of likes on all videos was higher than dislikes (55,482 vs. on interaction types. All comments or replies for each of
7,670). The number of comments were 26,122 (mean: 2,612; these videos are combined by video ids for determining TF-
standard deviation: 4,901). In these videos, participants also IDF measures. As unigrams are not suitable in explaining the
replied to the comments. The replies to the comments are also intent of the topics, bigrams are considered in this analysis.
collected and analyzed in this study. The corpora have around A threshold of 200 counts is considered as the baseline for
2,000 replies, based on all replies. comment corpora. For the reply corpora, this threshold was
20. The majority of the bigrams overlap in both categories.
IV. METHODOLOGY Intersection, signal phases, bike lanes, and lighting conditions
Natural language processing (NLP), a popular branch of are the most common bigrams in both categories. The bigrams
computer science, tends to view the process of text and ‘the crosswalk’ and ‘parents fault’ are present in the comment
language analysis as being divided into a number of stages by category analysis. In the reply categories, two unique bigrams
conducting theoretical linguistic distinctions between syntax are ‘walk on’ and ‘walk in.’
or parsing (relationship between words), semantics (meaning
of a textual content), and pragmatics (process of extracting B. Sentiment Analysis
information from text). Typically, sentences retrieved from a Mining on subjective texts containing opinion or sentiment
text or document are first analyzed in terms of the associated can contribute to understanding perception towards a product.
syntax, which provides a procedure that is more suitable In other words, the objective of sentiment analysis is to de-
to an analysis in terms of semantics and other associated termine which words or sentences express opinions, feelings,
meaning [18]. All required NLP tasks such as tokenization, and sentiments. The sentiment score can be easily calculated
lemmatization, parts of speed tagging, and dependency parsing by using the number of positive words or sentences minus the
were conducted before starting the next steps. number of negative words or sentences. The research team
used ‘udpipe’ inbuilt functions to determine the sentiment
A. Term Frequency-Inverse Document Frequency scores [37]. The descriptive statistics of the sentiment scores
Instead of using a word or word group frequencies, another by the video ids are shown in Table III. Each video id
approach is to look at a term’s inverse document frequency is listed with the maximum, minimum, mean, and standard
TABLE II: TF-IDF of the Top Bigrams from Comment and Reply Corpora
VID Bigram TF IDF TF-IDF
Comments
sYWPHHo0fPU the light 0.00331 1.20397 0.00398
zR4Okh23Zlo bike lane 0.00567 0.69315 0.00393
sYWPHHo0fPU the intersection 0.00197 1.60944 0.00317
sYWPHHo0fPU yellow light 0.00124 2.30259 0.00286
sYWPHHo0fPU red light 0.00232 1.20397 0.00279
sYWPHHo0fPU light was 0.00152 1.60944 0.00245
zR4Okh23Zlo bike path 0.00218 0.91629 0.002
zR4Okh23Zlo parents fault 0.0007 2.30259 0.00162
sYWPHHo0fPU the crosswalk 0.00146 0.91629 0.00134
sYWPHHo0fPU slow down 0.00177 0.69315 0.00123
Replies
sYWPHHo0fPU the light 0.004216 1.504077 0.006342
zR4Okh23Zlo walk on 0.001681 2.197225 0.003694
zR4Okh23Zlo bike path 0.001639 2.197225 0.003602
sYWPHHo0fPU the intersection 0.00233 1.504077 0.003505
sYWPHHo0fPU light was 0.001498 2.197225 0.003291
sYWPHHo0fPU was red 0.001387 2.197225 0.003047
zR4Okh23Zlo cycle lane 0.001303 2.197225 0.002863
sYWPHHo0fPU red light 0.002441 1.098612 0.002682
zR4Okh23Zlo walk in 0.001177 2.197225 0.002586
sYWPHHo0fPU slow down 0.002164 1.098612 0.002377
deviation of each comment and reply. The video with the used in the replies. Some of the terms, such as bike, are
highest comment average is dnkErN9N8KY (Pedestrian hit considered as positive emotion due to the reason of using
by bicycle in San Francisco) with 0.39. It also has the highest conventional sentiment lexicons. There is a need for the
maximum, minimum, and standard deviation measures. development of transportation safety-related senti-lexicon to
1) Valence Shift Word Graphs: In their study, Dodds and precisely capture the domain-specific sentiments and emotions,
Danforth [11] provided the importance of ‘Valence Shift which is currently out of the scope of the present study.
Word Graph.’ Consider two texts Tref (for reference) and
(ref ) C. Emotion Mining
Tcomp (for comparison) with sentiment scores smean and
(comp)
smean . A word’s sentiment is signified relative to text Tref Emotion mining, a similar method like sentiment analy-
by + (positive sentiment) and - (negative sentiment), and its sis, detects, analyzes, and performs evaluations on humans’
relative abundance in text Tcomp versus text Tref with ↑ (more feelings towards different issues, topics, and scenarios. A
prevalent) and ↓ (less prevalent). Combining these two binary specific direction of emotion mining includes text emotion
possibilities
P leads to four cases [12]. mining, which refers to the examination of people’s emotions
Where i δsmean,i = ±100, depending on the sign of the based on their writing observations. By manual annotation
(comp) (ref )
difference in sentiment between the two texts, smean −smean through Amazon’s Mechanical Turk service, Mohammad and
, and the terms to which the symbols +/− and ↑↓ apply Turney [23] compiled a large English term–emotion associ-
have been indicated. The δsmean,i is called the per word ation lexicon, named EmoLex. The lexicon focused on the
sentiment shift of the ith word. Figure 4 can be interpreted emotions of joy, anger, sadness, fear, disgust, trust, surprise,
by the following interpretations: and anticipation and has been argued by many to be the
• Words on the right contribute to an increase in positive prototypical and basic emotions [25]–[27].
emotions in the corpus For the emotion mining tasks, this study considers eight
• A yellow bar in the right with a down arrow indicates major emotion types and their negations. The trends of the
that a negative emotion was used less emotions are shown at the sentence level (shown in Fig. 3).
• A purple bar in the right with an up arrow indicates that This method uses sentiment lexicons to find emotion words
positive emotion was used more and then compute the emotion propensity per sentence [37].
• Words on the left contribute to a decrease in position The x-axis indicates the number of documents in percentage
emotions in the corpus form. For example, if the analysis is conducted on 100
• A yellow bar in the left with an up arrow indicates that documents, 25 percent will indicate the 25th document, and if
a negative emotion was used more the vertical line is drawn on 25%, the intersecting points will
• A purple in the left with a down arrow indicates that be the emotion propensity score for that particular sentence.
positive emotion was used less
D. Co-occurrence of Negative Terms
The word shift plots are not significantly different between
the corpora (plural of corpus) developed for comments and The majority of the sentiment analysis and emotion mining
replies. However, the degree of negative emotions is less studies perform only n-gram related studies to determine the
TABLE III: Descriptive Statistics of Sentiment Scores by Videos
Max Min Mean STD
VID
Comment Reply Comment Reply STD Reply Comment Reply
0Lm9TPym9A4 4.10 0.80 -6.00 -2.65 -0.43 -0.71 1.22 0.88
5Qurlf05YYI 2.25 2.80 -3.25 -0.75 -0.41 0.75 1.01 1.20
dnkErN9N8KY 5.85 5.85 -1.50 -2.50 0.39 -0.03 1.53 1.72
dXpmxmFW164 1.05 0.80 -1.00 -1.00 0.26 -0.07 0.75 0.90
G4K8AjNIVPA 3.10 3.10 -8.60 -4.85 -0.28 -0.30 1.13 1.22
s-PuD8fSI-I 1.40 1.40 -1.75 -1.40 -0.13 -0.04 0.84 0.89
sYWPHHo0fPU 7.60 – -7.15 – -0.37 – 1.06 –
uRoU826ywjw 2.80 4.80 -0.75 -7.15 0.68 -0.32 1.15 1.27
Wq6rpVMcyas 3.00 1.00 -4.50 -4.10 -0.33 -0.38 0.85 0.84
zR4Okh23Zlo 5.30 5.30 -10.00 -6.45 -0.49 -0.42 1.15 1.32