Professional Documents
Culture Documents
Edited by:
Olga Moreira
ARCLER
P r e s s
www.arclerpress.com
Advanced Techniques for Collecting Statistical Data
Olga Moreira
Arcler Press
224 Shoreacres Road
Burlington, ON L7L 2H2
Canada
www.arclerpress.com
Email: orders@arclereducation.com
This book contains information obtained from highly regarded resources. Reprinted
material sources are indicated. Copyright for individual articles remains with the au-
thors as indicated and published under Creative Commons License. A Wide variety of
references are listed. Reasonable efforts have been made to publish reliable data and
views articulated in the chapters are those of the individual contributors, and not neces-
sarily those of the editors or publishers. Editors or publishers are not responsible for
the accuracy of the information in the published chapters or consequences of their use.
The publisher assumes no responsibility for any damage or grievance to the persons or
property arising out of the use of any materials, instructions, methods or thoughts in the
book. The editors and the publisher have attempted to trace the copyright holders of all
material reproduced in this publication and apologize to copyright holders if permission
has not been obtained. If any copyright holder has not been acknowledged, please write
to us so we may rectify.
Notice: Registered trademark of products or corporate names are used only for explana-
tion and identification without intent of infringement.
Arcler Press publishes wide variety of books and eBooks. For more information about
Arcler Press and its products, visit our website at www.arclerpress.com
DECLARATION
Some content or chapters in this book are open access copyright free
published research work, which is published under Creative Commons
License and are indicated with the citation. We are thankful to the
publishers and authors of the content and chapters as without them this
book wouldn’t have been possible.
ABOUT THE EDITOR
x
Appendix B. Internal Validity Test of Vignette Responses ........................ 114
Author Contributions ............................................................................. 115
References ............................................................................................. 116
Chapter 8 Wiki Surveys: Open and Quantifiable Social Data Collection ............... 155
Abstract ................................................................................................. 155
Introduction ........................................................................................... 156
Wiki Surveys .......................................................................................... 157
Case Studies .......................................................................................... 165
Discussion ............................................................................................. 171
Acknowledgments ................................................................................. 173
Author Contributions ............................................................................. 173
References ............................................................................................. 174
xi
Limitations ............................................................................................. 202
Conclusions ........................................................................................... 202
Acknowledgements ............................................................................... 204
Authors’ Contributions ........................................................................... 204
References ............................................................................................. 205
Chapter 10 Mobile Data Collection: Smart, but Not (Yet) Smart Enough ................ 209
Background ........................................................................................... 209
Smart Mobile Data Collection................................................................ 210
Smarter Mobile Data Collection in the Future ........................................ 212
Conclusions ........................................................................................... 214
Author Contributions ............................................................................. 215
Acknowledgments ................................................................................. 215
References ............................................................................................. 216
xii
Principles of Big Data Intelligent Fusion................................................. 259
Experimental Simulation Analysis .......................................................... 262
Conclusion ............................................................................................ 265
References ............................................................................................. 266
xiii
Conclusion ............................................................................................ 326
Acknowledgements ............................................................................... 326
Authors’ Contributions ........................................................................... 326
References ............................................................................................. 327
xiv
LIST OF CONTRIBUTORS
Albine Moser
Faculty of Health Care, Research Centre Autonomy and Participation of Chronically Ill
People, Zuyd University of Applied Sciences, Heerlen, The Netherlands
Faculty of Health, Medicine and Life Sciences, Department of Family Medicine,
Maastricht University, Maastricht, The Netherlands
Irene Korstjens
Faculty of Health Care, Research Centre for Midwifery Science, Zuyd University of
Applied Sciences, Maastricht, The Netherlands
Barbara B. Kawulich
University of West Georgia Educational Leadership and Professional Studies
Department1601 Maple Street, Room 153, Education Annex Carrollton, GA 30118,
USA
Bence Ságvári
Computational Social Science—Research Center for Educational and Network Studies
(CSS–RECENS), Centre for Social Sciences, Tóth Kálmán Utca 4, 1097 Budapest,
Hungary
Institute of Communication and Sociology, Corvinus University, Fővám tér 8, 1093
Budapest, Hungary
Attila Gulyás
Computational Social Science—Research Center for Educational and Network Studies
(CSS–RECENS), Centre for Social Sciences, Tóth Kálmán Utca 4, 1097 Budapest,
Hungary
Júlia Koltai
Computational Social Science—Research Center for Educational and Network Studies
(CSS–RECENS), Centre for Social Sciences, Tóth Kálmán Utca 4, 1097 Budapest,
Hungary
Department of Network and Data Science, Central European University, Quellenstraße
51, 1100 Vienna, Austria
Faculty of Social Sciences, Eötvös Loránd University of Sciences, Pázmány Péter
Sétány 1/A, 1117 Budapest, Hungary
Eric Badu
School of Nursing and Midwifery, The University of Newcastle, Callaghan, Australia
Rebecca Mitchell
Faculty of Business and Economics, Macquarie University, North Ryde, Australia
Matthew J. Salganik
Department of Sociology, Center for Information Technology Policy, and Office of
Population Research, Princeton University, Princeton, NJ, USA
Karen E. C. Levy
Information Law Institute and Department of Media, Culture, and Communication,
New York University, New York, NY, USA and Data & Society Research Institute,
New York, NY, USA
C. A. Piña-García
Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Departamento de
Ciencias de la Computación, Universidad Nacional Autónoma de México, Ciudad de
México, México
Carlos Gershenson
Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Departamento de
Ciencias de la Computación, Universidad Nacional Autónoma de México, Ciudad de
México, México
Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de México,
Circuito Maestro Mario de la Cueva S/N, Ciudad Universitaria, Ciudad de México,
04510 México
SENSEable City Lab, Massachusetts Institute of Technology, 77 Massachusetts Avenue,
Cambridge, 02139 USA
MoBS Lab, Network Science Institute, Northeastern University, 360 Huntington av
1010-177, Boston, 02115 USA
ITMO University, Birzhevaya liniya 4, St. Petersburg, 199034 Russia
J. Mario Siqueiros-García
Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Departamento de
Ciencias de la Computación, Universidad Nacional Autónoma de México, Ciudad de
México, México
xvi
Alexander Seifert
University Research Priority Program “Dynamics of Healthy Aging”, University of
Zurich, Zurich, Switzerland
Matthias Hofer
University Research Priority Program “Dynamics of Healthy Aging”, University of
Zurich, Zurich, Switzerland
Department of Communication and Media Research, University of Zurich, Zurich,
Switzerland
Mathias Allemand
University Research Priority Program “Dynamics of Healthy Aging”, University of
Zurich, Zurich, Switzerland
Department of Psychology, University of Zurich, Zurich, Switzerland
Xiang Huang
Guangdong University of Finance and Economics, College of entrepreneurship
education, Guangzhou 510320, China
Hongying Liu
Department of Computer Science and Engineering, Guangzhou College of Technology
and Business, Guangzhou 510850, China
xvii
Hyeongju Ryu
Biomedical Research Institute, Seoul National University Hospital, Seoul 03080, Korea
Meihua Piao
Office of Hospital Information, Seoul National University Hospital, Seoul 03080, Korea
Heejin Kim
Clinical Trials Center, Seoul National University Hospital, Seoul 03080, Korea
Wooseok Yang
Clinical Trials Center, Seoul National University Hospital, Seoul 03080, Korea
Carlos Baquero
U. Minho and INESC TEC, Braga, Portugal
Paolo Casari
Department of Information Engineering and Computer Science, University of Trento,
Trento, Italy
Amanda García-García
IMDEA Networks Institute, Madrid, Spain
Davide Frey
Inria Rennes, Rennes, France
Augusto Garcia-Agundez
Multimedia Communications Lab, TU Darmstadt, Darmstadt, Germany
Chryssis Georgiou
Department of Computer Science, University of Cyprus, Nicosia, Cyprus
Benjamin Girault
Department of Electrical and Computer Engineering University of Southern California,
Los Angeles, CA, United States
xviii
Antonio Ortega
Department of Electrical and Computer Engineering University of Southern California,
Los Angeles, CA, United States
Mathieu Goessens
Consulting, Rennes, France
Harold A. Hernández-Roig
Department of Statistics, UC3M & UC3M-Santander Big Data Institute, Getafe, Spain
Nicolas Nicolaou
Algolysis Ltd, Nicosia, Cyprus
Efstathios Stavrakis
Algolysis Ltd, Nicosia, Cyprus
Oluwasegun Ojo
IMDEA Networks Institute and UC3M, Madrid, Spain
Julian C. Roberts
Skyhaven Media, Liverpool, United Kingdom
Ignacio Sanchez
InqBarna, Barcelona, Spain
Ashwin A. Phatak
Institute of Exercise Training and Sport Informatics, German Sports University,
Cologne, Germany
Franz-Georg Wieland
Institute of Physics, University of Freiburg, Freiburg im Breisgau, Germany
Kartik Vempala
Bloomberg LP, New York, USA
Frederik Volkmar
Institute of Exercise Training and Sport Informatics, German Sports University,
Cologne, Germany
Daniel Memmert
Institute of Exercise Training and Sport Informatics, German Sports University,
Cologne, Germany
xix
Suppawong Tuarob
Faculty of Information and Communication Technology, Mahidol University, Nakhon
Pathom 73170, Thailand
Poom Wettayakorn
Faculty of Information and Communication Technology, Mahidol University, Nakhon
Pathom 73170, Thailand
Ponpat Phetchai
Faculty of Information and Communication Technology, Mahidol University, Nakhon
Pathom 73170, Thailand
Siripong Traivijitkhun
Faculty of Information and Communication Technology, Mahidol University, Nakhon
Pathom 73170, Thailand
Sunghoon Lim
Department of Industrial Engineering, Ulsan National Institute of Science and
Technology, Ulsan 44919, Republic of Korea
Institute for the 4th Industrial Revolution, Ulsan National Institute of Science and
Technology, Ulsan 44919, Republic of Korea
Thanapon Noraset
Faculty of Information and Communication Technology, Mahidol University, Nakhon
Pathom 73170, Thailand
Tipajin Thaipisutikul
Faculty of Information and Communication Technology, Mahidol University, Nakhon
Pathom 73170, Thailand
xx
LIST OF ABBREVIATIONS
We live in an age of Big Data. This is changing the way researchers collect and
preprocess data. This book aims to provide a broad view of the current methods and
techniques, as well as automated systems for statistical data collection. It is divided into
three parts, each focusing on a different aspect of the statistical data collection process.
The first part of the book is focused on introducing the readers to qualitative research
data collection methods. Qualitative Chapters 1 to 4 include a practical guide by Moser
& Korstjens (2017) on the designing, sampling, collecting, and analyzing data about
people, processes, and cultures in qualitative research. Chapters 5 to 6 are focused on
observation-based methods, participant observation specifically.
Chapter 1 introduces the concept of “qualitative research” from the point of view of
clinical trials and healthcare sciences. Qualitative research is seen as “the investigation
of phenomena, typically in an in-depth and holistic fashion, through the collection
of rich narrative materials using a flexible research design”. Chapter 2 is devoted to
giving an answer to frequent queries about the context, research questions and design
of qualitative research. Chapter 3 is devoted to sampling strategies, as well as data
collection and analysis plans. Chapter 4 reflects upon the trustworthiness of the collected
data. Chapter 5 includes the various definitions of participant observation, the purposes
for which it is used, along with exercises for teaching observation techniques. Chapter
6 includes an exploratory study conducted in Hungary using a factorial design-based
online survey to explore the willingness to participate in a future research project based
on active and passive data collection via smartphones.
The second part of the book is focused on data mining of information collected from
clinical and social studies surveys, as well as from social media. Chapter 7 includes a
review of methods used in clinical research, from study design to sampling and data
collection. Chapters 8 and 9 data collection methods that facilitate quantification of
information from online survey respondents and social media. Chapter 8 presents a new
method for data collection and data analysis for pairwise wiki surveys using two proof-
of-concept case studies involving the free and open-source website www.allourideas.
org. Chapter 9 proposes a methodology to carry out an efficient data collecting process
via three random strategies: Brownian, Illusion and Reservoir. It shows that this new
methodology be used to collect global trends on Twitter. Chapters 10 to 11 are focused
on mobile data collection methods, and chapters 12 to 13 are focused on big data
collection systems. Chapter 10 reflects on many challenges of mobile data collection
with smartphones, as well as on the interesting avenues the future development of this
technology can provide for clinical research. Chapter 11 compares a web-based mobile
phone automated system (MPAS) with the traditional paper and email-based data
collection (PEDC). It demonstrates that MPAS has the potential to be a more effective
and acceptable method for improving the overall management, treatment compliance,
and methodological quality of clinical research. Chapter 12 proposes an analytical
framework, which considers the decision-making of big data objects participating in
the big data collection process. This new framework aims to reflect on factors that
can improve the participation willingness of big data objects. Chapter 13 proposes a
JA-va3D-based big data network multi-resolution acquisition method which has lower
acquisition costs, shorter completion times, and higher acquisition accuracy than most
current data collection and analysis systems.
The third and last part of this book is focused on the current efforts to optimize and
automate data collection procedures. Chapter 14 presents the development of a mobile
application for collecting subject data for clinical trials which is shown to increase
the efficiency of clinical trial management. Chapter 15 describes the CoronaSurveys
system developed for facilitating COVID-19 data collection. The proposed system
includes multiple components and processes, including the web survey; the mobile
apps; the survey responses cleaning and aggregation; the data storage and publication;
the data processing and estimates computation; and the results’ visualization. Chapter
16 is focused on machine learning algorithms for data collection, data mining and
knowledge discovery in sports and healthcare. It proposes an artificial intelligence-
based body sensor network framework (AIBSNF), a framework for strategic use of
body sensor networks (BSN), which combines with a real-time location system (RTLS)
and wearable biosensors to collect multivariate, low-noise, and high-fidelity data.
Chapter 17 introduces DAViS as an automated system for data collection, analysis, and
visualization of stock market prediction in real-time. The proposed stock forecasting
method outperforms a traditional baseline and confirms that leveraging an ensemble
scheme of machine learning methods with contextual information improves stock
prediction performance.
Chapter
SERIES: PRACTICAL
GUIDANCE TO
QUALITATIVE RESEARCH.
1
PART 1: INTRODUCTION
ABSTRACT
In the course of our supervisory work over the years, we have noticed that
qualitative research tends to evoke a lot of questions and worries, so-called
Frequently Asked Questions. This journal series of four articles intends to
provide novice researchers with practical guidance for conducting high-
Citation: (APA): Moser, A., & Korstjens, I. (2017). Series: Practical guidance to quali-
tative research. Part 1: Introduction. European Journal of General Practice, 23(1), 271-
273. (4 pages)
Copyright: © This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0/).
2 Advanced Techniques for Collecting Statistical Data
INTRODUCTION
In the course of our supervisory work over the years, we have noticed that
while many researchers who conducted qualitative research for the first time
understood the tenets of qualitative research, knowing about qualitative
methodology and carrying out qualitative research were two different
things. We noticed that they somehow mixed quantitative and qualitative
methodology and methods. We also observed that they experienced many
uncertainties when doing qualitative research. They expressed a great need
for practical guidance regarding key methodological issues. For example,
questions often heard and addressed were, ‘What kind of literature would I
search for when preparing a qualitative study?’ ‘Is it normal that my research
question seems to change during the study?’ ‘What types of sampling can I
use?’ ‘What methods of data collection are appropriate?’ ‘Can I wait with my
analysis until all data have been collected?’ ‘What are the quality criteria for
qualitative research?’ ‘How do I report my qualitative study?’ This induced
us to write this series providing ‘practical guidance’ to qualitative research.
QUALITATIVE RESEARCH
Qualitative research has been defined as the investigation of phenomena,
typically in an in-depth and holistic fashion, through the collection of rich
narrative materials using a flexible research design [1]. Qualitative research
aims to provide in-depth insights and understanding of real-world problems
and, in contrast to quantitative research, it does not introduce treatments,
manipulate or quantify predefined variables. Qualitative research
encompasses many different designs, which however share several key
features as presented in Box 1.
Series: Practical Guidance to Qualitative Research. Part 1: Introduction 3
ACKNOWLEDGEMENTS
The authors wish to thank the following junior researchers who have been
participating for the last few years in the so-called ‘think tank on qualitative
research’ project, a collaborative project between Zuyd University of Applied
Sciences and Maastricht University, for their pertinent questions: Erica
Baarends, Jerome van Dongen, Jolanda Friesen-Storms, Steffy Lenzen,
Ankie Hoefnagels, Barbara Piskur, Claudia van Putten-Gamel, Wilma
Savelberg, Steffy Stans, and Anita Stevens. The authors are grateful to Isabel
van Helmond, Joyce Molenaar and Darcy Ummels for proofreading our
manuscripts and providing valuable feedback from the ‘novice perspective’.
6 Advanced Techniques for Collecting Statistical Data
REFERENCES
1. Polit DF, Beck CT.. Nursing research: generating and assessing
evidence for nursing practice. 10th ed. Philadelphia (PA): Lippincott,
Williams & Wilkins; 2017.
2. Hepworth J, Key M.. General practitioners learning qualitative
research: a case study of postgraduate education. Aust Fam
Physician 2015;44:760–763.
3. Greenhalgh T, Annandale E, Ashcroft R, et al.. An open letter to the
BMJ editors on qualitative research. BMJ. 2016;352:i563.
Chapter
SERIES: PRACTICAL
GUIDANCE TO
QUALITATIVE RESEARCH.
2
PART 2: CONTEXT,
RESEARCH QUESTIONS
AND DESIGNS
ABSTRACT
In the course of our supervisory work over the years, we have noticed that
qualitative research tends to evoke a lot of questions and worries, so-called
frequently asked questions (FAQs). This series of four articles intends to
provide novice researchers with practical guidance for conducting high-quality
Citation: (APA): Korstjens, I., & Moser, A. (2017). Series: Practical guidance to quali-
tative research. Part 2: Context, research questions and designs. European Journal of
General Practice, 23(1), 274-279. (7 pages)
Copyright: © This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0/).
8 Advanced Techniques for Collecting Statistical Data
INTRODUCTION
In an introductory paper [1], we have described the key features of qualitative
research. The current article addresses frequently asked questions about
context, research questions and design of qualitative research.
CONTEXT
RESEARCH QUESTIONS
Box 1. Searching the literature for qualitative studies: the SPIDER tool.
Based on Cooke et al. [3].
S Sample: qualitative research uses smaller samples, as findings are not intended to be
generalized to the general population.
PI Phenomenon of Interest: qualitative research examines how and why certain experi-
ences, behaviours and decisions occur (in contrast to the effectiveness of intervention).
D Design: refers to the theoretical framework and the corresponding method used, which
influence the robustness of the analysis and findings.
E Evaluation: evaluation outcomes may include more subjective outcomes (views, at-
titudes, perspectives, experiences, etc.).
R Research type: qualitative, quantitative and mixed-methods research could be
searched for.
depth and richness of your findings and the required samples, methods,
techniques and efforts for data collection and analyses. These choices lead
to different research questions, for example:
• ‘What are GPs’ and patients’ attitudes and perspectives towards
discussing family abuse and violence?’ Or:
• ‘How do GPs behave during the communication and follow-
up process when a patient’s signals suggest intimate partner
violence?’
Box 2. The ‘big three’ approaches in qualitative study design. Based on Polit
and Beck [2].
If you want to study the interaction between GPs and family caregivers
to generate a theory of ‘trust’ within caring relationships, your research
question might be ‘How does a relationship of trust between GPs and family
caregivers evolve in end-of-life care for people with COPD?’ Grounded
theory might then be the design of the first choice. In this approach, data
are collected mostly through in-depth interviews, but may also include
observations of encounters, followed by interviews with those who were
observed. The findings presented consist of a theory, including a basic social
process and relevant concepts and categories.
If you merely aim to give a qualitative description of the views of family
caregivers about facilitators and barriers to contacting GPs, you might use
content analysis and present the themes and subthemes you found.
The next article in this Series on qualitative research, Part 3, will focus
on sampling, data collection, and analysis [19]. In the final article, Part 4, we
address two overarching themes: trustworthiness and publishing [20].
ACKNOWLEDGEMENTS
The authors thank the following junior researchers who have been
participating for the last few years in the so-called ‘Think tank on qualitative
research’ project, a collaborative project between Zuyd University of Applied
Sciences and Maastricht University, for their pertinent questions: Erica
Baarends, Jerome van Dongen, Jolanda Friesen-Storms, Steffy Lenzen,
Ankie Hoefnagels, Barbara Piskur, Claudia van Putten-Gamel, Wilma
Savelberg, Steffy Stans, and Anita Stevens. The authors are grateful to Isabel
van Helmond, Joyce Molenaar and Darcy Ummels for proofreading our
manuscripts and providing valuable feedback from the ‘novice perspective’.
Series: Practical Guidance to Qualitative Research. Part 2: Context... 19
REFERENCES
1. Moser A, Korstjens I.. Series: Practical guidance to qualitative research.
Part 1: Introduction. Eur J Gen Pract. 2017;23:271-273.
2. Polit DF, Beck CT, Nursing research: Generating and assessing
evidence for nursing practice. 10th ed. Philadelphia (PA): Lippincott,
Williams & Wilkins; 2017.
3. Cooke A, Smith D, Booth A.. Beyond PICO: the SPIDER tool for
qualitative evidence synthesis. Qual Health Res. 2012; 22:1435–1443.
4. Atkinson P, Coffey A, Delamount S, et al.. Handbook of ethnography.
Thousand Oaks (CA): Sage; 2001.
5. Smith JA, Flowers P, Larkin M.. Interpretative phenomenological
analysis. theory, method and research. London (UK): Sage; 2010.
6. Charmaz K. Constructing grounded theory. 2nd ed. Thousand Oaks
(CA): Sage; 2014.
7. Creswell JW. Qualitative research design. Choosing among five
approaches. 3rd ed. Los Angeles (CA): Sage; 2013.
8. Yin R. Case study research: design and methods (5th ed.). Thousand
Oaks (CA): Sage; 2014.
9. Ten HP. Doing conversation analysis (2nd ed). London (UK): Sage;
2007.
10. Riessman CK. Narrative methods for the human sciences. Thousand
Oaks (CA): Sage; 2008.
11. Fleming V, Gaidys U, Robb Y.. Hermeneutic research in nursing:
developing a Gadamerian-based research method. Nurs Inq.
2003;10:113–120.
12. Lundy KS, Historical research. In Munhall PL, ed. Nursing research:
a qualitative perspective. 5th ed. (pp 381–398). Sudbury (MA): Jones
& Bartlett; 2012.
13. Koch T, Kralik D.. Participatory action research in health care. Oxford
(UK): Blackwell; 2006.
14. Minkler M & Wallerstein N, editors. Community-based participatory
research for health. San Francisco (CA): Jossey-Bass Publishers; 2003.
15. Dant T. Critical social theory. Culture, society and critique. London
(UK): Sage; 2004.
16. Hesse-Biber S (editor). Feminist research practice: a primer. Thousand
Oaks (CA): Sage; 2014.
20 Advanced Techniques for Collecting Statistical Data
SERIES: PRACTICAL
GUIDANCE TO
QUALITATIVE RESEARCH.
3
PART 3: SAMPLING, DATA
COLLECTION AND
ANALYSIS
ABSTRACT
In the course of our supervisory work over the years, we have noticed that
qualitative research tends to evoke a lot of questions and worries, so-called
frequently asked questions (FAQs). This series of four articles intends to
provide novice researchers with practical guidance for conducting high-
Citation: (APA): Moser, A., & Korstjens, I. (2018). Series: Practical guidance to quali-
tative research. Part 3: Sampling, data collection and analysis. European journal of gen-
eral practice, 24(1), 9-18. (11 pages)
Copyright: © 2018 The Author(s). Published by Informa UK Limited, trading as Taylor
& Francis Group. This is an Open Access article distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/).
22 Advanced Techniques for Collecting Statistical Data
INTRODUCTION
This article is the third paper in a series of four articles aiming to provide
practical guidance to qualitative research. In an introductory paper, we
have described the objective, nature and outline of the Series [1]. Part 2 of
the series focused on context, research questions and design of qualitative
research [2]. In this paper, Part 3, we address frequently asked questions
(FAQs) about sampling, data collection and analysis.
SAMPLING
Box 1. Sampling strategies in qualitative research. Based on Polit & Beck [3].
Sampling Definition
Purposive sampling Selection of participants based on the researchers’ judgement
about what potential participants will be most informative.
Criterion sampling Selection of participants who meet pre-determined criteria of
importance.
Theoretical sampling Selection of participants based on the emerging findings to en-
sure adequate representation of theoretical concepts.
Convenience sampling Selection of participants who are easily available.
Snowball sampling Selection of participants through referrals by previously selected
participants or persons who have access to potential participants.
Maximum variation Selection of participants based on a wide range of variation in
sampling backgrounds.
Extreme case sampling Purposeful selection of the most unusual cases.
Typical case sampling Selection of the most typical or average participants.
Confirming and discon- Confirming and disconfirming cases sampling supports checking
firming sampling or challenging emerging trends or patterns in the data.
DATA COLLECTION
play an insider role, just as you do in your own work setting. This role
might be appropriate when studying persons who are difficult to access. The
second type is ‘active participation’. You have gained access to a particular
setting and observed the group under study. You can move around at will
and can observe in detail and depth and in different situations. The third role
is ‘moderate participation’. You do not actually work in the setting you wish
to study but are located there as a researcher. You might adopt this role when
you are not affiliated to the care setting you wish to study. The fourth role
is that of the ‘complete observer’, in which you merely observe (bystander
role) and do not participate in the setting at all. However, you cannot perform
any observations without access to the care setting. Such access might be
easily obtained when you collect data by observations in your own primary
care setting. In some cases, you might observe other care settings, which are
relevant to primary care, for instance observing the discharge procedure for
vulnerable elderly people from hospital to primary care.
your reflections on a piece of paper. After the observations, the field notes
need to be worked out and transcribed immediately to be able to include
detailed descriptions.
ANALYSIS
Can I wait with my analysis until all data have been collected?
You cannot wait with the analysis, because an iterative approach and
emerging design are at the heart of qualitative research. This involves
a process whereby you move back and forth between sampling, data
collection and data analysis to accumulate rich data and interesting findings.
The principle is that what emerges from data analysis will shape subsequent
sampling decisions. Immediately after the very first observation, interview
or focus group discussion, you have to start the analysis and prepare your
field notes.
approaches has different schools of thought, which may also have integrated
the analytical methods from other schools (Box 4). When you opt for a
particular approach, it is best to use a handbook describing its analytical
methods, as it is better to use one approach consistently than to ‘mix up’
different schools.
In general, qualitative analysis begins with organizing data. Large
amounts of data need to be stored in smaller and manageable units, which
can be retrieved and reviewed easily. To obtain a sense of the whole,
analysis starts with reading and rereading the data, looking at themes,
emotions and the unexpected, taking into account the overall picture. You
immerse yourself in the data. The most widely used procedure is to develop
an inductive coding scheme based on actual data [11]. This is a process of
open coding, creating categories and abstraction. In most cases, you do not
start with a predefined coding scheme. You describe what is going on in
the data. You ask yourself, what is this? What does it stand for? What else
is like this? What is this distinct from? Based on this close examination of
what emerges from the data you make as many labels as needed. Then, you
make a coding sheet, in which you collect the labels and, based on your
interpretation, cluster them in preliminary categories. The next step is to
order similar or dissimilar categories into broader higher order categories.
Each category is named using content-characteristic words. Then, you use
abstraction by formulating a general description of the phenomenon under
study: subcategories with similar events and information are grouped
together as categories and categories are grouped as main categories. During
the analysis process, you identify ‘missing analytical information’ and you
continue data collection. You reread, recode, re-analyse and re-collect data
until your findings provide breadth and depth.
Throughout the qualitative study, you reflect on what you see or do not
see in the data. It is common to write ‘analytic memos’ [3], write-ups or mini-
analyses about what you think you are learning during the course of your
study, from designing to publishing. They can be a few sentences or pages,
whatever is needed to reflect upon: open codes, categories, concepts, and
patterns that might be emerging in the data. Memos can contain summaries
of major findings and comments and reflections on particular aspects.
In ethnography, analysis begins from the moment that the researcher
sets foot in the field. The analysis involves continually looking for patterns
in the behaviours and thoughts of the participants in everyday life, in order
to obtain an understanding of the culture under study. When comparing
Series: Practical Guidance to Qualitative Research. Part 3: Sampling... 37
one pattern with another and analysing many patterns simultaneously, you
may use maps, flow charts, organizational charts and matrices to illustrate
the comparisons graphically. The outcome of an ethnographic study is a
narrative description of a culture.
In phenomenology, analysis aims to describe and interpret the meaning
of an experience, often by identifying essential subordinate and major
themes. You search for common themes featuring within an interview and
across interviews, sometimes involving the study participants or other
experts in the analysis process. The outcome of a phenomenological study
is a detailed description of themes that capture the essential meaning of a
‘lived’ experience.
Grounded theory generates a theory that explains how a basic social
problem that emerged from the data is processed in a social setting.
Grounded theory uses the ‘constant comparison’ method, which involves
comparing elements that are present in one data source (e.g., an interview)
with elements in another source, to identify commonalities. The steps in
the analysis are known as open, axial and selective coding. Throughout the
analysis, you document your ideas about the data in methodological and
theoretical memos. The outcome of a grounded theory study is a theory.
Descriptive generic qualitative research is defined as research designed
to produce a low inference description of a phenomenon [12]. Although
Sandelowski maintains that all research involves interpretation, she has
also suggested that qualitative description attempts to minimize inferences
made in order to remain ‘closer’ to the original data [12]. Descriptive
generic qualitative research often applies content analysis. Descriptive
content analysis studies are not based on a specific qualitative tradition
and are varied in their methods of analysis. The analysis of the content
aims to identify themes, and patterns within and among these themes. An
inductive content analysis [11] involves breaking down the data into smaller
units, coding and naming the units according to the content they present,
and grouping the coded material based on shared concepts. They can be
represented by clustering in treelike diagrams. A deductive content analysis
[11] uses a theory, theoretical framework or conceptual model to analyse
the data by operationalizing them in a coding matrix. An inductive content
analysis might use several techniques from grounded theory, such as open
and axial coding and constant comparison. However, note that your findings
are merely a summary of categories, not a grounded theory.
38 Advanced Techniques for Collecting Statistical Data
Analysis software can support you to manage your data, for example
by helping to store, annotate and retrieve texts, to locate words, phrases and
segments of data, to name and label, to sort and organize, to identify data
units, to prepare diagrams and to extract quotes. Still, as a researcher you
would do the analytical work by looking at what is in the data, and making
decisions about assigning codes, and identifying categories, concepts and
patterns. The computer assisted qualitative data analysis (CAQDAS) website
provides support to make informed choices between analytical software
and courses: http://www.surrey.ac.uk/sociology/research/researchcentres/
caqdas/support/choosing. See Box 5 for further reading on qualitative
analysis.
The next and final article in this series, Part 4, will focus on trustworthiness
and publishing qualitative research [13].
ACKNOWLEDGEMENTS
The authors thank the following junior researchers who have been
participating for the last few years in the so-called ‘Think tank on qualitative
research’ project, a collaborative project between Zuyd University of Applied
Sciences and Maastricht University, for their pertinent questions: Erica
Baarends, Jerome van Dongen, Jolanda Friesen-Storms, Steffy Lenzen,
Ankie Hoefnagels, Barbara Piskur, Claudia van Putten-Gamel, Wilma
Savelberg, Steffy Stans, and Anita Stevens. The authors are grateful to Isabel
van Helmond, Joyce Molenaar and Darcy Ummels for proofreading our
manuscripts and providing valuable feedback from the ‘novice perspective’.
Series: Practical Guidance to Qualitative Research. Part 3: Sampling... 39
REFERENCES
1. Moser A, Korstjens I.. Series: practical guidance to qualitative research.
Part 1: Introduction. Eur J Gen Pract. 2017;23:271–273.
2. Korstjens I, Moser A.. Series: Practical guidance to qualitative
research. Part 2: Context, research questions and designs. Eur J Gen
Pract. 2017;23:274–279.
3. Polit DF, Beck CT.. Nursing research: Generating and assessing
evidence for nursing practice. 10th ed. Philadelphia (PA): Lippincott,
Williams & Wilkins; 2017.
4. Moser A, van der Bruggen H, Widdershoven G.. Competency in
shaping one’s life: Autonomy of people with type 2 diabetes mellitus
in a nurse-led, shared-care setting; A qualitative study. Int J Nurs Stud.
2006;43:417–427.
5. Moser A, Korstjens I, van der Weijden T, et al.. Patient’s decision
making in selecting a hospital for elective orthopaedic surgery. J Eval
Clin Pract. 2010;16:1262–1268.
6. Bonevski B, Randell M, Paul C, et al.. Reaching the hard-to-reach:
a systematic review of strategies for improving health and medical
research with socially disadvantaged groups. BMC Med Res Methodol.
2014;14:42.
7. Brinkmann S, Kvale S.. Interviews. Learning the craft of qualitative
research interviewing. 3rd ed. London (UK): Sage; 2014.
8. Kruger R, Casey M.. Focus groups: A practical guide for applied
research. Thousand Oaks (CA): Sage; 2015.
9. Kallio H, Pietilä AM, Johnson M, et al.. Systematic methodological
review: developing a framework for a qualitative semi-structured
interview guide. J Adv Nurs. 2016;72:2954–2965.
10. Salmons J. Qualitative online interviews. 2nd ed London (UK): Sage;
2015.
11. Elo S, Kyngäs A.. The qualitative content analysis process. J Adv Nurs.
2008;62:107–115.
12. Sandelowski M. Whatever happened to qualitative description? Res
Nurs Health. 2010;23:334–340.
13. Korstjens I, Moser A.. Series: Practical guidance to qualitative research.
Part 4: Trustworthiness and publishing. Eur J Gen Pract. 2018;24 DOI:
10.1080/13814788.2017.1375092
Chapter
SERIES: PRACTICAL
GUIDANCE TO
QUALITATIVE RESEARCH.
4
PART 4: TRUSTWORTHI-
NESS AND PUBLISHING
ABSTRACT
In the course of our supervisory work over the years we have noticed that
qualitative research tends to evoke a lot of questions and worries, so-called
frequently asked questions (FAQs). This series of four articles intends to
provide novice researchers with practical guidance for conducting high-
Citation: (APA): Korstjens, I., & Moser, A. (2018). Series: Practical guidance to quali-
tative research. Part 4: Trustworthiness and publishing. European Journal of General
Practice, 24(1), 120-124. (6 pages)
Copyright: © 2018 The Author(s). Published by Informa UK Limited, trading as Taylor
& Francis Group. This is an Open Access article distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/).
42 Advanced Techniques for Collecting Statistical Data
INTRODUCTION
This article is the fourth and last in a series of four articles aiming to provide
practical guidance for qualitative research. In an introductory paper, we
have described the objective, nature and outline of the series [1]. Part 2 of
the series focused on context, research questions and design of qualitative
research [2], whereas Part 3 concerned sampling, data collection and analysis
[3]. In this paper Part 4, we address frequently asked questions (FAQs) about
two overarching themes: trustworthiness and publishing.
TRUSTWORTHINESS
Credibility The confidence that can be placed in the truth of the research findings. Credibility
establishes whether the research findings represent plausible information drawn
from the participants’ original data and is a correct interpretation of the participants’
original views.
Transferability The degree to which the results of qualitative research can be transferred to other
contexts or settings with other respondents. The researcher facilitates the transfer-
ability judgment by a potential user through thick description.
Dependability The stability of findings over time. Dependability involves participants’ evaluation
of the findings, interpretation and recommendations of the study such that all are
supported by the data as received from participants of the study.
Confirmability The degree to which the findings of the research study could be confirmed by other
researchers. Confirmability is concerned with establishing that data and interpre-
tations of the findings are not figments of the inquirer’s imagination, but clearly
derived from the data.
Reflexivity The process of critical self-reflection about oneself as researcher (own biases,
preferences, preconceptions), and the research relationship (relationship to the
respondent, and how the relationship affects participant’s answers to questions).
44 Advanced Techniques for Collecting Statistical Data
Depend- Audit trail Transparently describing the research steps taken from the start of a
ability and research project to the development and reporting of the findings. The
confirm- records of the research path are kept throughout the study.
ability
Reflexiv- Diary Examining one’s own conceptual lens, explicit and implicit assump-
ity tions, preconceptions and values, and how these affect research deci-
sions in all phases of qualitative studies.
also reviewed the analysis, i.e. the descriptive, axial and selective codes, to
see whether they followed from the data (raw data, analysis notes, coding
notes, process notes, and report) and grounded in the data. The auditor who
performed the dependability and confirmability audit was not part of the
research team but an expert in grounded theory. The audit report was shared
with all members of the research team.
PUBLISHING
participants in tables or running text, and you are likely to use boxes to
present your interview guide or questioning route, or an overview of the
main findings in categories, subcategories and themes. Most of your article
is running text, providing a balanced presentation. You provide a thick
description of the participants and the context, transparently describe and
reflect on your methods, and do justice to the richness of your qualitative
findings in reporting, interpreting and discussing them. Thus, the Methods
and Findings sections will be much longer than in a quantitative paper.
The difference between reporting quantitative and qualitative research
becomes most visible in the Results section. Quantitative articles have a
strict division between the Results section, which presents the evidence,
and the Discussion section. In contrast, the Findings section in qualitative
papers consists mostly of synthesis and interpretation, often with links to
empirical data. Quantitative and qualitative researchers alike, however,
need to be concise in presenting the main findings to answer the research
question, and avoid distractions. Therefore, you need to make choices to
provide a comprehensive and balanced representation of your findings. Your
main findings may consist, for example, of interpretations, relationships
and themes, and your Findings section might include the development of a
theory or model, or integration with earlier research or theory. You present
evidence to substantiate your analytic findings. You use quotes or citations
in the text, or field notes, text excerpts or photographs in boxes to illustrate
and visualize the variety and richness of the findings.
Before you start preparing your article, it is wise to examine first the
journal of your choice. You need to check its guidelines for authors and
recommended sources for reference style, ethics, etc., as well as recently
accepted qualitative manuscripts. More and more journals also refer to
quality criteria lists for reporting qualitative research, and ask you to upload
the checklist with your submission. Two of these checklists are available
at http://www.equator-network.org/reporting-guidelines.
enables you to find out how open these journals are to publishing qualitative
research and accepting articles with different designs, structures and lengths.
If you are unsure whether the journal of your choice would accept qualitative
research, you might contact the Editor in Chief. Lastly, you might look in
your top three journals for qualitative articles, and try to decide how your
manuscript would fit in. The author guidelines and examples of manuscripts
will support you during your writing, and your top three offers alternatives
in case you need to turn to another journal.
ACKNOWLEDGEMENTS
The authors wish to thank the following junior researchers who have been
participating for the last few years in the so-called ‘Think tank on qualitative
research’ project, a collaborative project between Zuyd University of Applied
Sciences and Maastricht University, for their pertinent questions: Erica
Baarends, Jerome van Dongen, Jolanda Friesen-Storms, Steffy Lenzen,
Ankie Hoefnagels, Barbara Piskur, Claudia van Putten-Gamel, Wilma
Savelberg, Steffy Stans, and Anita Stevens. The authors are grateful to Isabel
van Helmond, Joyce Molenaar and Darcy Ummels for proofreading our
manuscripts and providing valuable feedback from the ‘novice perspective’.
Series: Practical Guidance to Qualitative Research. Part 4... 51
REFERENCES
1. Moser A, Korstjens I.. Series: practical guidance to qualitative research.
Part 1: Introduction. Eur J Gen Pract. 2017;23:271–273.
2. Korstjens I, Moser A.. Series: practical guidance to qualitative research.
Part 2: Context, research questions and designs. Eur J Gen Pract.
2017;23:274–279.
3. Moser A, Korstjens I.. Series: practical guidance to qualitative
research. Part 3: Sampling, data collection and analysis. Eur J Gen
Pract. 2018;24. DOI: 10.1080/13814788.2017.1375091
4. Lincoln YS, Guba EG.. Naturalistic inquiry. California: Sage
Publications; 1985. [Google Scholar]
5. Tracy SJ. Qualitative quality: eight ‘big-tent’ criteria for excellent
qualitative research. Qual Inq. 2010;16:837–851.
6. Moser A, van der Bruggen H, Widdershoven G, et al.. Self-management
of type 2 diabetes mellitus: a qualitative investigation from the
perspective of participants in a nurse-led, shared-care programme in
the Netherlands. BMC Public Health. 2008;8:91.
7. Sim J, Sharp K.. A critical appraisal of the role of triangulation in
nursing research. Int J Nurs Stud. 1998;35:23–31.
8. Mauthner NS, Doucet A.. Reflexive accounts and accounts of reflexivity
in qualitative data. Soc 2003;37:413–431.
9. O’Brien BC, Harris IB, Beckman TJ, et al.. Standards for reporting
qualitative research: a synthesis of recommendations. Acad Med.
2014;89:1245–1251.
10. Tong A, Sainsbury P, Craig J.. Consolidated criteria for reporting
qualitative research (COREQ): a 32-item checklist for interviews and
focus groups. Int J Qual Health Care. 2007;19:349–357.
Chapter
PARTICIPANT
OBSERVATION AS A DATA
COLLECTION METHOD
5
Barbara B. Kawulich
University of West Georgia Educational Leadership and Professional Studies Department1601
Maple Street, Room 153, Education Annex Carrollton, GA 30118, USA
ABSTRACT
Observation, particularly participant observation, has been used in a
variety of disciplines as a tool for collecting data about people, processes,
and cultures in qualitative research. This paper provides a look at various
definitions of participant observation, the history of its use, the purposes
for which it is used, the stances of the observer, and when, what, and how
to observe. Information on keeping field notes and writing them up is also
discussed, along with some exercises for teaching observation techniques to
researchers-in-training.
INTRODUCTION
Participant observation, for many years, has been a hallmark of both
anthropological and sociological studies. In recent years, the field of
education has seen an increase in the number of qualitative studies that include
participant observation as a way to collect information. Qualitative methods
of data collection, such as interviewing, observation, and document analysis,
have been included under the umbrella term of “ethnographic methods” in
recent years. The purpose of this paper is to discuss observation, particularly
participant observation, as a tool for collecting data in qualitative research
studies. Aspects of observation discussed herein include various definitions
of participant observation, some history of its use, the purposes for which
such observation is used, the stances or roles of the observer, and additional
information about when, what, and how to observe. Further information is
provided to address keeping field notes and their use in writing up the final
story. [1]
DEFINITIONS
MARSHALL and ROSSMAN (1989) define observation as “the systematic
description of events, behaviors, and artifacts in the social setting chosen
for study” (p.79). Observations enable the researcher to describe existing
situations using the five senses, providing a “written photograph” of the
situation under study (ERLANDSON, HARRIS, SKIPPER, & ALLEN,
1993). DeMUNCK and SOBO (1998) describe participant observation as
the primary method used by anthropologists doing fieldwork. Fieldwork
involves “active looking, improving memory, informal interviewing, writing
detailed field notes, and perhaps most importantly, patience” (DeWALT
& DeWALT, 2002, p.vii). Participant observation is the process enabling
researchers to learn about the activities of the people under study in the
natural setting through observing and participating in those activities. It
provides the context for development of sampling guidelines and interview
guides (DeWALT & DeWALT, 2002). SCHENSUL, SCHENSUL, and
LeCOMPTE (1999) define participant observation as “the process of
learning through exposure to or involvement in the day-to-day or routine
activities of participants in the researcher setting” (p.91). [2]
Participant Observation as a Data Collection Method 55
method for over a century. As DeWALT and DeWALT (2002) relate it,
one of the first instances of its use involved the work of Frank Hamilton
CUSHING, who spent four and a half years as a participant observer with the
Zuni Pueblo people around 1879 in a study for the Smithsonian Institution’s
Bureau of Ethnology. During this time, CUSHING learned the language,
participated in the customs, was adopted by a pueblo, and was initiated into
the priesthood. Because he did not publish extensively about this culture, he
was criticized as having gone native, meaning that he had lost his objectivity
and, therefore, his ability to write analytically about the culture. My own
experience conducting research in indigenous communities, which began
about ten years ago with my own ethnographic doctoral dissertation on
Muscogee (Creek) women’s perceptions of work (KAWULICH, 1998)
and has continued in the years since (i.e., KAWULICH, 2004), leads me to
believe that, while this may have been the case, it is also possible that he
held the Zuni people in such high esteem that he felt it impolitic or irreverent
to do so. In my own research, I have been hesitant to write about religious
ceremonies or other aspects of indigenous culture that I have observed,
for example, for fear of relating information that my participants or other
community members might feel should not be shared. When I first began
conducting my ethnographic study of the Muscogee culture, I was made
aware of several incidents in which researchers were perceived to have
taken information they had obtained through interviews or observations and
had published their findings without permission of the Creek people or done
so without giving proper credit to the participants who had shared their lives
with the researchers. [5]
A short time later, in 1888, Beatrice Potter WEBB studied poor
neighborhoods during the day and returned to her privileged lifestyle
at night. She took a job as a rent collector to interact with the people in
buildings and offices and took a job as a seamstress in a sweatshop to better
understand their lives. Then, in the early 1920s, MALINOWSKI studied
and wrote about his participation and observation of the Trobriands, a
study BERNARD (1998) calls one of the most cited early discussions of
anthropological data collection methods. Around the same time, Margaret
MEAD studied the lives of adolescent Samoan girls. MEAD’s approach
to data collection differed from that of her mentor, anthropologist Frank
BOAS, who emphasized the use of historical texts and materials to document
disappearing native cultures. Instead, MEAD participated in the living culture
to record their cultural activities, focusing on specific activities, rather than
participating in the activities of the culture overall as did MALINOWSKI.
Participant Observation as a Data Collection Method 57
Limitations of Observation
Several researchers have noted the limitations involved with using
observations as a tool for data collection. For example, DeWALT and
DeWALT (2002) note that male and female researchers have access to
different information, as they have access to different people, settings,
and bodies of knowledge. Participant observation is conducted by a biased
human who serves as the instrument for data collection; the researcher must
understand how his/her gender, sexuality, ethnicity, class, and theoretical
approach may affect observation, analysis, and interpretation. [16]
SCHENSUL, SCHENSUL, and LeCOMPTE (1999) refer to
participation as meaning almost total immersion in an unfamiliar culture
to study others’ lives through the researcher’s participation as a full-time
resident or member, though they point out that most observers are not full
participants in community life. There are a number of things that affect
whether the researcher is accepted in the community, including one’s
appearance, ethnicity, age, gender, and class, for example. Another factor
they mention that may inhibit one’s acceptance relates to what they call the
structural characteristics—that is, those mores that exist in the community
regarding interaction and behavior (p.93). Some of the reasons they mention
for a researcher’s not being included in activities include a lack of trust, the
community’s discomfort with having an outsider there, potential danger to
either the community or the researcher, and the community’s lack of funds to
Participant Observation as a Data Collection Method 61
further support the researcher in the research. Some of the ways the researcher
might be excluded include the community members’ use of a language that
is unfamiliar to the researcher, their changing from one language to another
that is not understood by the researcher, their changing the subject when the
researcher arrives, their refusal to answer certain questions, their moving
away from the researcher to talk out of ear shot, or their failure to invite the
researcher to social events. [17]
SCHENSUL, SCHENSUL, and LeCOMPTE further point out that all
researchers should expect to experience a feeling of having been excluded
at some point in the research process, particularly in the beginning. The
important thing, they note, is for the researcher to recognize what that
exclusion means to the research process and that, after the researcher has
been in the community for a while, the community is likely to have accepted
the researcher to some degree. [18]
Another limitation involved in conducting observations is noted
by DeWALT, DeWALT, and WAYLAND (1998). The researcher must
determine to what extent he/she will participate in the lives of the participants
and whether to intervene in a situation. Another potential limitation they
mention is that of researcher bias. They note that, unless ethnographers
use other methods than just participant observation, there is likelihood that
they will fail to report the negative aspects of the cultural members. They
encourage the novice researcher to practice reflexivity at the beginning
of one’s research to help him/her understand the biases he/she has that
may interfere with correct interpretation of what is observed. Researcher
bias is one of the aspects of qualitative research that has led to the view
that qualitative research is subjective, rather than objective. According to
RATNER (2002), some qualitative researchers believe that one cannot be
both objective and subjective, while others believe that the two can coexist,
that one’s subjectivity can facilitate understanding the world of others. He
notes that, when one reflects on one’s biases, he/she can then recognize
those biases that may distort understanding and replace them with those that
help him/her to be more objective. In this way, he suggests, the researcher is
being respectful of the participants by using a variety of methods to ensure
that what he/she thinks is being said, in fact, matches the understanding of
the participant. BREUER and ROTH (2003) use a variety of methods for
knowledge production, including, for example, positioning or various points
of view, different frames of reference, such as special or temporal relativity,
perceptual schemata based on experience, and interaction with the social
context—understanding that any interaction changes the observed object.
62 Advanced Techniques for Collecting Statistical Data
Ethics
A primary consideration in any research study is to conduct the research
in an ethical manner, letting the community know that one’s purpose for
observing is to document their activities. While there may be instances
where covert observation methods might be appropriate, these situations are
few and are suspect. DeWALT, DeWALT, and WAYLAND (1998) advise
the researcher to take some of the field notes publicly to reinforce that what
the researcher is doing is collecting data for research purposes. When the
researcher meets community members for the first time, he/she should
66 Advanced Techniques for Collecting Statistical Data
be sure to inform them of the purpose for being there, sharing sufficient
information with them about the research topic that their questions about the
research and the researcher’s presence there are put to rest. This means that
one is constantly introducing oneself as a researcher. [31]
Another ethical responsibility is to preserve the anonymity of the
participants in the final write-up and in field notes to prevent their
identification, should the field notes be subpoenaed for inspection. Individual
identities must be described in ways that community members will not be
able to identify the participants. Several years ago, when I submitted an
article for publication, one of the reviewers provided feedback that it would
be helpful to the reader if I described the participants as, for example, “a
35 year old divorced mother of three, who worked at Wal-Mart.” This level
of detail was not a feasible option for me in providing a description of
individual participants, as it would have been easy for the local community
members to identify these participants from such specific detail; this was a
small community where everyone knew everyone else, and they would have
known who the woman was. Instead, I only provided broad descriptions that
lacked specific details, such as “a woman in her thirties who worked in the
retail industry.” [32]
DeWALT, DeWALT, and WAYLAND also point out that there is an
ethical concern regarding the relationships established by the researcher
when conducting participant observation; the researcher needs to develop
close relationships, yet those relationships are difficult to maintain, when
the researcher returns to his/her home at a distant location. It is typical
for researchers who spend an extended period of time in a community to
establish friendships or other relationships, some of which may extend over a
lifetime; others are transient and extend only for the duration of the research
study. Particularly when conducting cross-cultural research, it is necessary
to have an understanding of cultural norms that exist. As MARSHALL and
BATTEN (2004) note, one must address issues, such as potential exploitation
and inaccuracy of findings, or other actions which may cause damage to the
community. They suggest that the researcher take a participatory approach
to research by including community members in the research process,
beginning with obtaining culturally appropriate permission to conduct
research and ensuring that the research addresses issues of importance to the
community. They further suggest that the research findings be shared with
the community to ensure accuracy of findings. In my own ongoing research
projects with the Muscogee (Creek) people, I have maintained relationships
with many of the people, including tribal leaders, tribal administrators, and
Participant Observation as a Data Collection Method 67
council members, and have shared the findings with selected tribal members
to check my findings. Further, I have given them copies of my work for their
library. I, too, have found that, by taking a participatory approach to my
research with them, I have been asked to participate in studies that they wish
to have conducted. [33]
be people who are respected by other cultural members and who are viewed
to be neutral, to enable the researcher to meet informants in all of the various
factions found in the culture. [36]
The researcher also should become familiar with the setting and social
organization of the culture. This may involve mapping out the setting or
developing social networks to help the researcher understand the situation.
These activities also are useful for enabling the researcher to know what to
observe and from whom to gather information. [37]
“Hanging out” is the process through which the researcher gains trust
and establishes rapport with participants (BERNARD, 1994). DeMUNCK
and SOBO (1998) state that, “only through hanging out do a majority of
villagers get an opportunity to watch, meet, and get to know you outside your
‘professional’ role” (p.41). This process of hanging out involves meeting and
conversing with people to develop relationships over an extended period
of time. There are three stages to the hanging out process, moving from a
position of formal, ignorant intruder to welcome, knowledgeable intimate
(DeMUNCK & SOBO). The first stage is the stage at which the researcher
is a stranger who is learning the social rules and language, making herself/
himself known to the community, so they will begin to teach her/him how
to behave appropriately in that culture. In the second stage, one begins to
merge with the crowd and stand out less as an intruder, what DeMUNCK
and SOBO call the “acquaintance” stage. During this stage, the language
becomes more familiar to the researcher, but he/she still may not be fluent
in its use. The third stage they mention is called the “intimate” stage, during
which the researcher has established relationships with cultural participants
to the extent that he/she no longer has to think about what he/she says, but
is as comfortable with the interaction as the participants are with her/him
being there. There is more to participant observation than just hanging out.
It sometimes involves the researcher’s working with and participating in
everyday activities beside participants in their daily lives. It also involves
taking field notes of observations and interpretations. Included in this
fieldwork is persistent observation and intermittent questioning to gain
clarification of meaning of activities. [38]
Rapport is built over time; it involves establishing a trusting relationship
with the community, so that the cultural members feel secure in sharing
sensitive information with the researcher to the extent that they feel assured
that the information gathered and reported will be presented accurately and
dependably. Rapport-building involves active listening, showing respect and
Participant Observation as a Data Collection Method 69
KUTSCHE suggests that the researcher visit the setting under study at
different times of the day to see how it is used differently at different times
of the day/night. He/she should describe without judgment and avoid using
meaningless adjectives, such as “older” (older than what/whom?) or “pretty”
(as compared to what/whom?); use adjectives that help to describe the
various aspects of the setting meaningfully (what is it that makes the house
inviting?). When one succeeds in avoiding judgment, he/she is practicing
cultural relativism. This mapping process uses only one of the five senses—
vision. “Human events happen in particular places, weathers, times, and so
forth. If you are intrigued, you will be pleased to know that what you are
doing is a subdiscipline of anthropology called cultural ecology” (p.16). It
involves looking at the interaction of the participants with the environment.
STEWARD (1955, as cited in KUTSCHE, 1998), a student of KROEBER
(1939, as cited in KUTSCHE, 1998), who wrote about Native American
adaptations to North American environments, developed a theory called
“multilinear evolution” in which he described how cultural traditions evolve
related to specific environments.
“Cultural systems are not just rules for behavior, ways of surviving,
or straitjackets to constrict free expression ... All cultures, no matter how
simple or sophisticated, are also rhythms, music, architecture, the dances of
living. ... To look at culture as style is to look at ritual” (p.49). [58]
KUTSCHE refers to ritual as being the symbolic representation of the
sentiments in a situation, where the situation involves person, place, time,
conception, thing, or occasion. Some of the examples of cultural rituals
KUTSCHE presents for analysis include rites of deference or rites of
passage. Ritual and habit are different, KUTSCHE explains, in that habits
have no symbolic expression or meaning (such as tying one’s shoes in the
same way each time). [59]
In mapping out the setting being observed, SCHENSUL, SCHENSUL,
and LeCOMPTE (1999) suggest the following be included:
• a count of attendees, including such demographics as age, gender,
and race;
• a physical map of the setting and description of the physical
surroundings;
• a portrayal of where participants are positioned over time;
• a description of the activities being observed, detailing activities
of interest. [60]
78 Advanced Techniques for Collecting Statistical Data
They indicate that counting, census taking, and mapping are important
ways to help the researcher gain a better understanding of the social setting
in the early stages of participation, particularly when the researcher is not
fluent in the language and has few key informants in the community. [61]
Social differences they mention that are readily observed include
differences among individuals, families, or groups by educational level,
type of employment, and income. Things to look for include the cultural
members’ manner of dress and decorative accoutrements, leisure activities,
speech patterns, place of residence and choice of transportation. They also
add that one might look for differences in housing structure or payment
structure for goods or services. [62]
Field notes are the primary way of capturing the data that is collected from
participant observations. Notes taken to capture this data include records
of what is observed, including informal conversations with participants,
records of activities and ceremonies, during which the researcher is unable
to question participants about their activities, and journal notes that are kept
on a daily basis. DeWALT, DeWALT, and WAYLAND describe field notes
as both data and analysis, as the notes provide an accurate description of
what is observed and are the product of the observation process. As they
note, observations are not data unless they are recorded into field notes. [63]
DeMUNCK and SOBO (1998) advocate using two notebooks for
keeping field notes, one with questions to be answered, the other with more
personal observations that may not fit the topics covered in the first notebook.
They do this to alleviate the clutter of extraneous information that can occur
when taking. Field notes in the first notebook should include jottings, maps,
diagrams, interview notes, and observations. In the second notebook, they
suggest keeping memos, casual “mullings, questions, comments, quirky
notes, and diary type entries” (p.45). One can find information in the notes
easily by indexing and cross-referencing information from both notebooks
by noting on index cards such information as “conflicts, gender, jokes,
religion, marriage, kinship, men’s activities, women’s activities, and so on”
(p.45). They summarize each day’s notes and index them by notebook, page
number, and a short identifying description. [64]
The feelings, thoughts, suppositions of the researcher may be noted
separately. SCHENSUL, SCHENSUL, and LeCOMPTE (1999) note that
good field notes:
• use exact quotes when possible;
• use pseudonyms to protect confidentiality;
Participant Observation as a Data Collection Method 79
and take notes as they take pictures to help them keep the photos organized
in the right sequence. Several students have indicated that this was a fun
exercise in which their children, who were the participants in the activity,
were delighted to be involved; they also noted that this provided them with
a pictographic recollection of a part of their children’s lives that would be
a keepsake. One student recorded her 6 year old daughter’s first formal tea
party, for example. [77]
Direct Observation—In this instance, students are asked to find a
setting they wish to observe in which they will be able to observe without
interruption and in which they will not be participating. For some specified
length of time (about 15 to 30 minutes), they are asked to record everything
they can take in through their senses about that setting and the interactions
contained therein for the duration of the time period, again recording on one
side of the paper their field notes from observation and on the other side their
thoughts, feelings, and ideas about what is happening. Part of the lesson here
is that, when researchers are recording aspects of the observation, whether
it be the physical characteristics of the setting or interactions between
participants, they are unable to both observe and record. This exercise is
also good practice for getting them to write detailed notes about what is or
is not happening, about the physical surroundings, and about interactions,
particularly conversations and the nonverbal behaviors that go along with
those conversations. [78]
Participant Observation—Students are asked to participate in some
activity that takes at least 2 hours, during which they are not allowed to
take any notes. Having a few friends or family members over for dinner is
a good example of a situation where they must participate without taking
notes. In this situation, the students must periodically review what they want
to remember. They are instructed to remember as much as possible, then
record their recollections in as much detail as they can remember as soon as
possible after the activity ends. Students are cautioned not to talk to anyone
or drink too much, so their recollections will be unaltered. The lesson here
is that they must consciously try to remember bits of conversation and other
details in chronological order. [79]
When comparing their field notes from direct observation to participant
observation, the students may find that their notes from direct observation
(without participation) are more detailed and lengthy than with participant
observation; however, through participation, there is more involvement in
the activities under study, so there is likely to be better interpretation of
84 Advanced Techniques for Collecting Statistical Data
what happened and why. They also may find that participant observation
lends itself better to recollecting information at a later time than direct
observation. [80]
SUMMARY
Participant observation involves the researcher’s involvement in a variety
of activities over an extended period of time that enable him/her to observe
the cultural members in their daily lives and to participate in their activities
to facilitate a better understanding of those behaviors and activities. The
process of conducting this type of field work involves gaining entry into
the community, selecting gatekeepers and key informants, participating in
as many different activities as are allowable by the community members,
clarifying one’s findings through member checks, formal interviews, and
informal conversations, and keeping organized, structured field notes
to facilitate the development of a narrative that explains various cultural
aspects to the reader. Participant observation is used as a mainstay in field
work in a variety of disciplines, and, as such, has proven to be a beneficial
tool for producing studies that provide accurate representation of a culture.
This paper, while not wholly inclusive of all that has been written about this
type of field work methods, presents an overview of what is known about
it, including its various definitions, history, and purposes, the stances of the
researcher, and information about how to conduct observations in the field.
[81]
Notes
1) Validity is a term typically associated with quantitative research;
however, when viewed in terms of its meaning of reflecting what
is purported to be measured/observed, its use is appropriate.
Validity in this instance may refer to context validity, face validity
or trustworthiness as described by LINCOLN and GUBA (1994).
2) Many years after MEAD studied the Samoan girls, FREEMAN
replicated MEAD’s study and derived different interpretations.
FREEMAN’s study suggested that MEAD’s informants had
misled her by telling her what they wanted her to believe, rather
than what was truthful about their activities.
Participant Observation as a Data Collection Method 85
REFERENCES
1. Adler, Patricia A. & Adler, Peter (1987). Membership roles in field
research. Newbury Park: Sage.
2. Adler, Patricia A. & Adler, Peter (1994). Observation techniques.
In Norman K. Denzin & Yvonna S. Lincoln (Eds.), Handbook of
qualitative research (pp.377-392). Thousand Oaks, CA: Sage.
3. Agar, Michael H. (1980). The professional stranger: an informal
introduction to ethnography. SanDiego: Academic Press.
4. Angrosino, Michael V. & Mays dePerez, Kimberly A. (2000). Rethinking
observation: From method to context. In Norman K. Denzin & Yvonna
S. Lincoln (Eds.), Handbook of Qualitative Research (second edition,
pp.673-702), Thousand Oaks, CA: Sage.
5. Bernard, H. Russell (1994). Research methods in anthropology:
qualitative and quantitative approaches (second edition). Walnut
Creek, CA: AltaMira Press.
6. Bernard, H. Russell (Ed.) (1998). Handbook of methods in cultural
anthropology. Walnut Creek: AltaMira Press.
7. Breuer, Franz & Roth, Wolff-Michael (2003, May). Subjectivity and
reflexivity in the social sciences: epistemic windows and methodical
consequences [30 paragraphs]. Forum Qualitative Sozialforschung /
Forum: Qualitative Social Research [On-line Journal], 4(2), Art.25.
Available at http://www.qualitative-research.net/fqs-texte/2-03/2-
03intro-3-e.htm [April, 5, 2005].
8. deMunck, Victor C. & Sobo, Elisa J. (Eds) (1998). Using methods in
the field: a practical introduction and casebook. Walnut Creek, CA:
AltaMira Press.
9. DeWalt, Kathleen M. & DeWalt, Billie R. (1998). Participant
observation. In H. Russell Bernard (Ed.), Handbook of methods in
cultural anthropology (pp.259-300). Walnut Creek: AltaMira Press.
10. DeWalt, Kathleen M. & DeWalt, Billie R. (2002). Participant
observation: a guide for fieldworkers. Walnut Creek, CA: AltaMira
Press.
11. Ellis, Carolyn (2003, May). Grave tending: with mom at the cemetery [8
paragraphs]. Forum Qualitative Sozialforschung / Forum: Qualitative
Social research [On-line Journal], 4(2), Art.28. Available at http://
www.qualitative-research.net/fqs-texte/2-03/2-03ellis-e.htm [April 5,
2005].
86 Advanced Techniques for Collecting Statistical Data
12. Erlandson, David A.; Harris, Edward L.; Skipper, Barbara L. & Allen,
Steve D. (1993). Doing naturalistic inquiry: a guide to methods.
Newbury Park, CA: Sage.
13. Fine, Gary A. (2003). Towards a peopled ethnography developing
theory from group life. Ethnography, 4(1), 41-60.
14. Gaitan, Alfredo (2000, November). Exploring alternative forms
of writing ethnography. Review Essay: Carolyn Ellis and Arthur
Bochner (Eds.) (1996). Composing ethnography: Alternative forms of
qualitative writing [9 paragraphs}. Forum Qualitative Sozialforschung
/ Forum: Qualitative Social Research [On-line Journal], 1(3), Art.42.
Available at: http://www.qualitative-research.net/fqs-texte/3-00/3-
00review-gaitan-e.htm [April, 5, 2005].
15. Gans, Herbert J. (1999). Participant observation in the era of
“ethnography.” Journal of Contemporary Ethnography, 28(5), 540-
548.
16. Geertz, Clifford (1973). Thick description: Towards an interpretive
theory of culture. In Clifford Geertz (Ed.), The interpretation of
cultures (pp.3-32). New York: Basic Books.
17. Glantz, Jeffrey & Sullivan, Susan (2000). Supervision in practice: 3
Steps to improving teaching and learning. Corwin Press, Inc.
18. Glickman, Carl D.; Gordon, Stephen P. & Ross-Gordon, Jovita
(1998). Supervision of instruction (fourth edition). Boston: Allyn &
Bacon.
19. Gold, Raymond L. (1958). Roles in sociological field observations. Social
Forces, 36, 217-223.
20. Holman Jones, Stacy (2004, September). Building connections in
qualitative research. Carolyn Ellis and Art Bochner in conversation
with Stacy Holman Jones [113 paragraphs]. Forum Qualitative
Sozialforschung / Forum: Qualitative Social Research [On-line
Journal], 5(3), Art.28. Available at http://www.qualitative-research.
net/fqs-texte/3-04/04-3-28-e.htm [April 5, 2005].
21. Johnson, Allen & Sackett, Ross (1998). Direct systematic observation
of behavior. In H. Russell Bernard (Ed.), Handbook of methods in
cultural anthropology (pp.301-332). Walnut Creek: AltaMira Press.
22. Kawulich, Barbara B. (1998). Muscogee (Creek) women’s perceptions
of work (Unpublished doctoral dissertation, Georgia State University).
Participant Observation as a Data Collection Method 87
ATTITUDES TOWARDS
PARTICIPATION IN A
PASSIVE DATA
6
COLLECTION EXPERIMENT
Budapest, Hungary
3
Department of Network and Data Science, Central European University, Quellenstraße 51,
1100 Vienna, Austria
4
Faculty of Social Sciences, Eötvös Loránd University of Sciences, Pázmány Péter Sétány
1/A, 1117 Budapest, Hungary
ABSTRACT
In this paper, we present the results of an exploratory study conducted
in Hungary using a factorial design-based online survey to explore the
Citation: (APA): Ságvári, B., Gulyás, A., & Koltai, J. (2021). Attitudes towards Par-
ticipation in a Passive Data Collection Experiment. Sensors, 21(18), 6085. (18 pages)
Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is
an open access article distributed under the terms and conditions of the Creative Com-
mons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
90 Advanced Techniques for Collecting Statistical Data
INTRODUCTION
Smartphone technologies combined with the improvement of cloud-based
research architecture offers great opportunities in social sciences. The most
common methodology in the social sciences is still the use of surveys and
other approaches that require the active participation of research subjects.
However, there are some areas that are best researched not through surveys,
but rather by observing individuals’ behaviour in a continuous social
experiment. Mobile technologies make it possible to observe behaviour on a
new level by using raw data of various kinds collected by our most common
everyday companion: our smartphone. Moreover, since smartphones shape
our daily lives thanks to various actions available through countless apps, it
is logical to consider them as a platform for actual research.
There have been numerous research projects that have relied on
collecting participants’ mobile sensor and app usage data, but the biggest
concern has been the willingness to share this data. Privacy and trust
Attitudes towards Participation in a Passive Data Collection Experiment 91
BACKGROUND
- in this case, the researching institution. The participants must give their
explicit consent for their data to be collected and transferred from their
device to a location unknown to them. Similarly, the researching institution
must ensure proper handling of the data and is responsible to the participants
for the security of their data.
Several studies have found that people are generally reluctant to share
their data when it comes to some form of passive data collection [21,22,23],
mostly due to privacy concerns. However, people who frequently use
their smartphones are less likely to have such concerns [23]. Over the past
decade, the amount of data collected by various organizations has increased
dramatically. This includes companies with whom users share their data
with their consent [24], but they are probably unaware of the amount of data
they are sharing and how it is exactly exploited for commercial purposes.
Several studies have found that people are much more likely to share
data when they are actively engaged in the process (e.g., sending surveys,
taking photos, etc.) than when they passively share sensor data [15,23].
This lower participation rate is influenced by numerous factors, so people’s
willingness to share data is itself an interesting research question.
Incentives
Another way to improve WTP is to provide monetary incentives for
participation. Haas et al. focused their analysis on different types of
incentives paid at different points in an experiment [33]. Incentives can be
given in different time frames for different activities of the participants. In
terms of time frame, it is common to offer incentives for installing a research
application, at the end of the survey, but it is also possible to offer recurring
incentives. Another option for incentives is to offer them based on “tiers” of
how much data participants are willing to provide.
In their study, Haas et al. also examined the impact of incentives on
the installation and rewarded sharing of various sensor data. There was a
positive effect of initial incentives, but interestingly, they did not find the
expected positive effect of incentives on granting access to more data-sharing
functions. Another interesting finding was that a higher overall incentive did
not increase participants’ willingness to have the application installed over a
longer experimental period.
In addition to these findings, their overall conclusion was that the effects
of incentives improve participation similar to regular survey studies. The
results of Keusch et al. also support this finding [21].
Other Factors
Keusch et al. [21] found that a shorter experimental period (one month as
opposed to six months) and monetary incentives increased willingness to
participate in a study. As another incentive, Struminskaya et al. [32] found
that actual interest in the research topic (participants can receive feedback on
research findings) is also a positive factor for increased level of participation.
Finally, participants’ limited ability to use devices was also found to be a
factor in the study by Wenz et al. They found that individuals who rated their
own usage abilities as below average (below 3 on a 5-point scale) showed
a significantly lower willingness to participate, especially in passive data
collection tasks. [23] On the other hand, those who reported advanced phone
use skills were much more willing to participate in such tasks.
Although not necessarily related to age, Mulder and Bruinje found in
another study that willingness to participate decreased dramatically after age
50 [15]. These results indicate that usability is important when designing a
research application.
98 Advanced Techniques for Collecting Statistical Data
As these results show, there are many details to analyse when designing
an experiment that relies on passive data collection. Some of the studies used
surveys to uncover various latent characteristics that influence willingness
to participate, while others conducted a working research application to
share practical usage information.
Given that many studies reported low WTP scores, we concluded that it
is very important to conduct a preliminary study before elaborating the final
design of such an experiment. Therefore, the goal of this work is to figure
out how we can implement a research tool that motivates participation in the
study and still collect a useful amount of information.
Research Questions
Because the focus of our study is exploratory in nature, we did not formulate
explicit research hypotheses, but designed our models and the survey to be
able to answer the following questions:
• Q1.
What is the general level of WTP in a passive data collection study?
In order to have a single benchmark data and provide comparison with
similar studies we asked a simple question whether respondents would be
willing to participate or not in a study that is built on smartphone based
passive data collection.
• Q2.
What features of the research design would motivate people to participate
in the study?
We included several questions in our survey that address key features
of the study: the type of institute conducting the experiment, type of data
collected, length of the study, monetary incentives, and control over data
collection. We wanted to know which of these features should be emphasized
to maximize WTP.
Attitudes towards Participation in a Passive Data Collection Experiment 99
• Q3.
What kind of demographic attributes influence WTP?
As mentioned in previous studies, age may be an important factor for
participation in our study, but we also considered other characteristics, such
as gender, education, type of settlement, and geographic region of residence.
• Q4.
What is the role of trust-, skills-, and privacy related contextual factors
on WTP?
As previous results suggest, trust, previous (negative) experiences and
privacy concerns might be key issues in how people react to various data
collection techniques. We used composite indicators to measure the effect
of interpersonal and institutional trust, smartphone skills and usage, and
general concerns over active and passive data collection methods on WTP.
Six dimensions were varied in the vignettes with the following values:
• The organizer of the research: (1) decision-makers, (2) a private
company, (3) scientific research institute.
• Data collected: (1) spatial movement, (2) mobile usage, (3)
communication habits, (4) spatial movement & mobile usage, (5)
spatial movement & communication habits, (6) mobile usage &
communication habits, (7) all three.
• Length of the research: (1) one month, (2) six months.
• Incentive: (1) HUF 5000 after installing the application, (2)
HUF 5000 after the completion of the study, (3) HUF 5000 after
installing the application and HUF 5000 after the completion of
the study.
• Interruption and control: (1) user cannot interrupt the data
collection, (2) user can temporarily interrupt the data collection,
(3) user can temporarily interrupt the data collection and review
the data and authorize its transfer.
Following Jasso, the creation of the vignettes proceeded as follows [35]:
First, we created a “universe” in which all combinations of the dimensions
described above were present, which included 378 different situations. From
these 378 situations, we randomly selected 150 and assigned them, also
randomly, to 15 different vignette blocks, which we call decks. Here, each
deck included 10 different vignettes and an additional control vignette to
Attitudes towards Participation in a Passive Data Collection Experiment 101
test the internal validity of the experiment. The content of this last vignette
was the same as a randomly selected vignette from the first nine items
previously evaluated. The results show a high degree (64%) of consistency
between responses to the same two vignettes, suggesting a satisfactory level
of internal validity (see Appendix B for details on the analysis of this test).
In this manner, each respondent completed one randomly assigned deck
with 10 + 1 vignettes. In total, 11,000 vignettes were answered by 1000
participants. (Data from the 11th vignette were excluded from the analysis).
The descriptive statistics of the vignette dimensions are presented in Table
2.
the Hungarian internet users. But one in two of them seems to be open for
passive data collection as well. (Q1)
The vignettes used in the survey was designed to understand the internal
motives and factors behind that shape the level of willingness. In the first step
of this analysis, we built two only intercept models, in which the dependent
variable was the outcome and there were no independent variables, but
the control for the level of decks and the level of respondents. Based on
the estimates of covariance parameters, we could conclude on the ratio of
explained variance by the different levels. The variance component models
revealed that 77.6 percent of the total variance in the vignette outcome is
explained by respondent characteristics and 1.4 percent is explained by the
deck of vignettes. Thus, the effect of the decks (the design of the vignette
study) is quite small.
We then created three multilevel regression models. (Table 3). In the first
one (Model 1), we included only the independent variables at the vignette
level. Then, in a second step (Model 2), we added the socio-demographic
characteristics of the respondents, as we assumed that they influence the
respondents’ willingness to participate. In a third step (Model 3), we also
included composite indices of respondent-level attitudinal variables. Table
3 shows the results of the three models.
(0.47) (0.48)
Northern Great Plane 0.02 0.46
(0.45) (0.48)
Southern Great Plane 0.44 0.62
(0.46) (0.48)
Southern Transdanubia 0.28 0.35
(0.48) (0.52)
Central Transdanubia 0.72 0.84
(0.47) (0.50)
Respondent level attitude indices
Smartphone activities (+: multiple) 0.10 *
(0.04)
Personal trust (+: high) 0.07
(0.05)
Institutional trust (+: high) 0.12
(0.07)
Time spent online on an average day 0.00
(minutes)
(0.00)
Time spent using their smart phone 0.00
on an average day (minutes)
(0.00)
Number of active data collection −0.06
mentioned as rather worrying
(0.09)
Number of passive data collection −0.20 *
mentioned as rather worrying
(0.04)
AIC 44,997.2 44,977.2 37,066.5
BIC 45,011.8 44,991.8 37,080.7
Observations 100,000 10,000 10,000
would be more likely to participate in a study if they were paid twice instead
of once-by about 0.44 points. The chance of participating is highest if the
user can suspend the data collection at any time and view the collected
data when needed. The two options of no suspension and suspension but
no control over the data showed a lower chance of participation (with an
average of 0.59 and 0.13 respectively). Interestingly, there were no significant
differences between the purpose of the data collection and thus the type of
data collected. Compared to the reference category, where all three types of
data are requested, none of the other single types of data collection showed
significantly lower or higher level of participation (Q2).
We included respondents’ sociodemographic variables in Model 2.
Including respondents’ sociodemographic variables did not really change the
effect of the vignette dimensions. Interestingly, none of the sociodemographic
characteristics have a significant effect on participation, with the exception
of age: in accordance with previous research, older individuals are less likely
to participate. The probability of participation decreases by 0.04 points with
each additional year (Q3).
In Model 3, we added respondents’ attitudinal indices to the model.
The addition of the respondent-level attitudinal variables did not really
change the effects of the variables compared to the previous models. Of the
attitudinal variables, none of the trust indices appear to have a significant
effect, however smartphone use and concerns about passive data collection
do change the likelihood of participation. This is because the more activity
and the longer time someone uses a smartphone, the more likely they are
to participate in such a study. The more types of passive data collection
someone has concerns for, the less likely they are to participate (Q4).
CONCLUSIONS
With this study, we aimed to continue the series of analyses examining
users’ attitudes toward passive sensors- and other device information-based
smartphone data collection.
Overall, our results are consistent with findings of previous research: We
found evidence that a more trusted survey organiser/client, shorter duration
of data collection, multiple incentives, and control over data collection can
significantly influence willingness to participate. The results also show that
apart from age (as major determinant of digital technology use and attitudes
towards digital technologies), demographic characteristics alone do not play
an important role. This finding might be biased by the general characteristics
of the online panel we used for the survey, but they might come as an
important information for future studies that aim for representativeness of
the online (smartphone user) population.
Contrary to our preliminary expectations, trust in people and institutions
alone does not seem to have a notable effect. This is especially noteworthy
given the fact that the Hungarian society has generally lower level of
personal and institutional trust compared to Western and Northern European
countries. However, general attitudes toward technology, the complexity
and intensity of smartphone use, and general concerns about passive data
collection may be critical in determining who is willing to participate in
future research.
Asking questions on future behaviour of people in hypothetical situations
have obvious limitations. In our case, this means that there is a good chance
that we would get different results if we asked people to download an
existing, ready-to-test app and to both actively and passively collect real,
personal data from users. We were mainly interested in people’s feelings,
fears and expectations that determine their future actions, and we suggest
that our results provided valid insights.
It should also be mentioned that in this research we focused mostly on
the dimensions analysed in previous studies and included them in our own
analysis. Of course, there are many other important factors that can influence
the willingness of users to participate. Our aim was therefore not to provide
a complete picture, but to gather important aspects that could enrich our
collective knowledge on smartphone based passive data collection and
inform our own application development process.
Attitudes towards Participation in a Passive Data Collection Experiment 111
ACKNOWLEDGMENTS
We thank János Klenovszki, Bálint Markos and Norbert Sárközi at NRC
Ltd. for the professional support they provided us in conducting the online
survey. The authors also wish to thank anonymous reviewers for feedback
on the manuscript.
Gender
Male 42.9 48.2 429
Female 57.1 51.8 571
Age
y18–29 19.8 22.8 198
y30–39 21.3 20.1 213
y40–49 23.5 24.2 235
y50–59 17.6 15.2 176
y60–69 14.0 13.9 140
y70+ 3.8 3.9 38
Education
Primary 24.7 35.6 247
Secondary 42.3 39.0 423
Tertiary 33.0 25.4 330
Type of settlement
Budapest (capital) 20.7 21.2 207
Towns 54.3 52.1 543
Villages 25.0 26.7 250
Region
Central Hungary 32.4 34.6 324
Northern Hungary 11.1 10.4 111
Northern Great Plain 14.8 13.8 148
Southern Great Plain 14.0 11.7 140
Southern Transdanubia 8.9 8.8 89
112 Advanced Techniques for Collecting Statistical Data
Table A2: Questions of activities, for which the respondent uses their smart-
phone
1. Browsing websites
2. Write/read emails
3. Taking photos, videos
4. View content from social networking sites
(e.g., texts, images, videos on Facebook, Instagram, Twitter, etc.)
5. Post content to social media sites
(e.g., share text, images, videos on Facebook, Instagram, Twitter)
6. Online shopping (e.g., tickets, books, clothes, technical articles)
7. Online banking (e.g., account balance inquiries, transfers)
8. Install new apps (e.g., via Google Play or App Store)
9. Use apps that use the device’s location (e.g., Google Maps, Foursquare)
10. Connect devices to your device via Bluetooth (e.g., smart watch, pedometer)
11. Game
12. Listening to music, watching videos
13. Recording of training data (e.g., while running, number of steps per day, etc.)
14. Reading and editing files related to work and study
15. Voice assistant services (Google Assistant, Apple Siri, Amazon Alexa, etc.)
Table A3: The three items in the questionnaire measuring personal trust
1. In general, what would you say? Can most people be trusted, or rather that we
cannot be careful enough in human relationships? Put your opinion on a scale where
“0” means we can’t be careful enough and “10” means that most people are trust-
worthy.
2. Do you think that most people would try to take advantage of you if they had the
opportunity or try to be fair? Place your opinion on a scale where “0” means that
Most people would try to take advantage and “10” means that most people would
try to be fair.
3. Do you think people tend to care only about themselves, or are they generally
helpful? Place your opinion on a scale where “0” means people care more about
themselves and “10” means people tend to be helpful.
Attitudes towards Participation in a Passive Data Collection Experiment 113
Table A4: The list of institutions, about which we asked the respondent how
much they trust in them
• 1.
Hungarian Parliament
• 2.
Hungarian legal system
• 3.
Politicians
• 4.
Police
• 5.
Scientists
• 6.
Online stores
• 7.
Large Internet companies (Apple, Google, Facebook, Microsoft, etc.)
• 8.
Online news portals
Figure A1: Comparison of responses for original vignettes vs. control vignette.
AUTHOR CONTRIBUTIONS
Conceptualization and methodology: B.S., J.K. and A.G.; formal analysis,
J.K., B.S. and A.G.; writing—original draft preparation, review and editing,
A.G., J.K. and B.S. All authors have read and agreed to the published version
of the manuscript.
116 Advanced Techniques for Collecting Statistical Data
REFERENCES
1. De Bruijne M., Wijnant A. Comparing Survey Results Obtained
via Mobile Devices and Computers: An Experiment with a Mobile
Web Survey on a Heterogeneous Group of Mobile Devices Versus a
Computer-Assisted Web Survey. Soc. Sci. Comput. Rev. 2013;31:482–
504. doi: 10.1177/0894439313483976.
2. De Bruijne M., Wijnant A. Mobile Response in Web Panels. Soc. Sci.
Comput. Rev. 2014;32:728–742. doi: 10.1177/0894439314525918.
3. Couper M.P., Antoun C., Mavletova A. Total Survey Error in Practice.
John Wiley & Sons, Ltd.; Hoboken, NJ, USA: 2017. Mobile Web
Surveys; pp. 133–154.
4. Couper M.P. New Developments in Survey Data Collection. Annu. Rev.
Sociol. 2017;43:121–145. doi: 10.1146/annurev-soc-060116-053613.
5. Brenner P.S., DeLamater J. Lies, Damned Lies, and Survey Self-
Reports? Identity as a Cause of Measurement Bias. Soc. Psychol. Q.
2016;79:333–354. doi: 10.1177/0190272516628298.
6. Brenner P.S., DeLamater J. Measurement Directiveness as a Cause
of Response Bias: Evidence From Two Survey Experiments. Sociol.
Methods Res. 2014;45:348–371. doi: 10.1177/0049124114558630.
7. Palczyńska M., Rynko M. ICT Skills Measurement in Social Surveys:
Can We Trust Self-Reports? Qual. Quant. 2021;55:917–943. doi:
10.1007/s11135-020-01031-4.
8. Tourangeau R., Rips L.J., Rasinski K. The Psychology of Survey
Response. Cambridge University Press; Cambridge, UK: 2000.
9. Link M.W., Murphy J., Schober M.F., Buskirk T.D., Childs J.H.,
Tesfaye C.L. Mobile Technologies for Conducting, Augmenting and
Potentially Replacing Surveys: Report of the AAPOR Task Force on
Emerging Technologies in Public Opinion Research. Public Opin. Q.
2014;78:779–787. doi: 10.1093/poq/nfu054.
10. Karsai M., Perra N., Vespignani A. Time Varying Networks and
the Weakness of Strong Ties. Sci. Rep. 2015;4:4001. doi: 10.1038/
srep04001.
11. Onnela J.-P., Saramäki J., Hyvönen J., Szabó G., Lazer D., Kaski
K., Kertész J., Barabási A.-L. Structure and Tie Strengths in Mobile
Communication Networks. Proc. Natl. Acad. Sci. USA. 2007;104:7332–
7336. doi: 10.1073/pnas.0610245104.
Attitudes towards Participation in a Passive Data Collection Experiment 117
12. Palmer J.R.B., Espenshade T.J., Bartumeus F., Chung C.Y., Ozgencil
N.E., Li K. New Approaches to Human Mobility: Using Mobile Phones
for Demographic Research. Demography. 2013;50:1105–1128. doi:
10.1007/s13524-012-0175-z.
13. Miritello G., Moro E., Lara R., Martínez-López R., Belchamber
J., Roberts S.G.B., Dunbar R.I.M. Time as a Limited Resource:
Communication Strategy in Mobile Phone Networks. Soc. Netw.
2013;35:89–95. doi: 10.1016/j.socnet.2013.01.003.
14. Kreuter F., Presser S., Tourangeau R. Social Desirability Bias in CATI,
IVR, and Web Surveys: The Effects of Mode and Question Sensitivity.
Public Opin. Q. 2008;72:847–865. doi: 10.1093/poq/nfn063.
15. Mulder J., de Bruijne M. Willingness of Online Respondents to
Participate in Alternative Modes of Data Collection. Surv. Pract.
2019;12:8356. doi: 10.29115/SP-2019-0001.
16. Scherpenzeel A. Data Collection in a Probability-Based Internet
Panel: How the LISS Panel Was Built and How It Can Be Used. BMS
Bull. Sociol. Methodol./Bull. Méthodol. Sociol. 2011;109:56–61. doi:
10.1177/0759106310387713.
17. Kołakowska A., Szwoch W., Szwoch M. A Review of Emotion
Recognition Methods Based on Data Acquired via Smartphone
Sensors. Sensors. 2020;20:6367. doi: 10.3390/s20216367.
18. Kreuter F., Haas G.-C., Keusch F., Bähr S., Trappmann M. Collecting
Survey and Smartphone Sensor Data With an App: Opportunities and
Challenges Around Privacy and Informed Consent. Soc. Sci. Comput.
Rev. 2020;38:533–549. doi: 10.1177/0894439318816389.
19. Struminskaya B., Lugtig P., Keusch F., Höhne J.K. Augmenting
Surveys With Data From Sensors and Apps: Opportunities and
Challenges. Soc. Sci. Comput. Rev. 2020:089443932097995. doi:
10.1177/0894439320979951.
20. Younis E.M.G., Kanjo E., Chamberlain A. Designing and Evaluating
Mobile Self-Reporting Techniques: Crowdsourcing for Citizen
Science. Pers. Ubiquitous Comput. 2019;23:329–338. doi: 10.1007/
s00779-019-01207-2.
21. Keusch F., Struminskaya B., Antoun C., Couper M.P., Kreuter F.
Willingness to Participate in Passive Mobile Data Collection. Public
Opin. Q. 2019;83:210–235. doi: 10.1093/poq/nfz007.
118 Advanced Techniques for Collecting Statistical Data
22. Revilla M., Toninelli D., Ochoa C., Loewe G. Do Online Access
Panels Need to Adapt Surveys for Mobile Devices? Internet Res.
2016;26:1209–1227. doi: 10.1108/IntR-02-2015-0032.
23. Wenz A., Jäckle A., Couper M.P. Willingness to Use Mobile
Technologies for Data Collection in a Probability Household Panel.
Surv. Res. Methods. 2019;13:1–22. doi: 10.18148/SRM/2019.
V1I1.7298.
24. Van Dijck J. Datafication, Dataism and Dataveillance: Big Data between
Scientific Paradigm and Ideology. Surveill. Soc. 2014;12:197–208. doi:
10.24908/ss.v12i2.4776.
25. Bricka S., Zmud J., Wolf J., Freedman J. Household Travel Surveys
with GPS An Experiment. Transp. Res. Rec. J. Transp. Res. Board.
2009;2105:51–56. doi: 10.3141/2105-07.
26. Biler S., Šenk P., Winklerová L. Willingness of Individuals to
Participate in a Travel Behavior Survey Using GPS Devices [Stanislav
Biler et al.]; Proceedings of the NTTS 2013; Brussels, Belgium. 5–7
March 2013; pp. 1015–1023.
27. Toepoel V., Lugtig P. What Happens If You Offer a Mobile Option
to Your Web Panel? Evidence From a Probability-Based Panel
of Internet Users. Soc. Sci. Comput. Rev. 2014;32:544–560. doi:
10.1177/0894439313510482.
28. Pinter R. Willingness of Online Access Panel Members to Participate
in Smartphone Application-Based Research. Mob. Res. Methods.
2015:141–156. doi: 10.5334/bar.i.
29. Revilla M., Ochoa C., Loewe G. Using Passive Data from a Meter to
Complement Survey Data in Order to Study Online Behavior. Soc. Sci.
Comput. Rev. 2017;35:521–536. doi: 10.1177/0894439316638457.
30. Scherpenzeel A. Mixing Online Panel Data Collection with Innovative
Methods. In: Eifler S., Faulbaum F., editors. Methodische Probleme
von Mixed-Mode-Ansätzen in der Umfrageforschung. Springer
Fachmedien; Wiesbaden, Germany: 2017. pp. 27–49. Schriftenreihe
der ASI—Arbeitsgemeinschaft Sozialwissenschaftlicher Institute.
31. Cabalquinto E., Hutchins B. “It Should Allow Me to Opt in or Opt
out”: Investigating Smartphone Use and the Contending Attitudes of
Commuters towards Geolocation Data Collection. Telemat. Inform.
2020;51:101403. doi: 10.1016/j.tele.2020.101403.
Attitudes towards Participation in a Passive Data Collection Experiment 119
32. Struminskaya B., Toepoel V., Lugtig P., Haan M., Luiten A., Schouten
B. Understanding Willingness to Share Smartphone-Sensor Data.
Public Opin. Q. 2021;84:725–759. doi: 10.1093/poq/nfaa044.
33. Haas G., Kreuter F., Keusch F., Trappmann M., Bähr S. Effects of
Incentives in Smartphone Data Collection. In: Hill C.A., Biemer
P.P., Buskirk T.D., Japec L., Kirchner A., Kolenikov S., Lyberg L.E.,
editors. Big Data Meets Survey Science. Wiley; Hoboken, NJ, USA:
2020. pp. 387–414.
34. Hox J.J., Kreft I.G.G., Hermkens P.L.J. The Analysis of
Factorial Surveys. Sociol. Methods Res. 1991;19:493–510. doi:
10.1177/0049124191019004003.
35. Jasso G. Factorial Survey Methods for Studying Beliefs and
Judgments. Sociol. Methods Res. 2006;34:334–423. doi:
10.1177/0049124105283121.
36. Auspurg K., Hinz T. Multifactorial Experiments in Surveys: Conjoint
Analysis, Choice Experiments, and Factorial Surveys. In: Keuschnigg
M., Wolbring T., editors. Experimente in den Sozialwissenschaften.
Nomos; Baden-Baden, Germany: 2015. pp. 291–315. Soziale Welt
Sonderband.
37. Wallander L. 25 Years of Factorial Surveys in Sociology: A Review. Soc.
Sci. Res. 2009;38:505–520. doi: 10.1016/j.ssresearch.2009.03.004.
Chapter
AN INTEGRATIVE REVIEW
ON METHODOLOGICAL
CONSIDERATIONS IN
7
MENTAL HEALTH RESEARCH
– DESIGN, SAMPLING, DATA
COLLECTION PROCEDURE
AND QUALITY ASSURANCE
Faculty of Health and Medicine, School Nursing and Midwifery, University of Newcastle,
2
Callaghan, Australia
Faculty of Business and Economics, Macquarie University, North Ryde, Australia
3
ABSTRACT
Background
Several typologies and guidelines are available to address the methodological
and practical considerations required in mental health research. However,
Citation: (APA): (1Badu, E., O’Brien, A. P., & Mitchell, R. (2019). An integrative
review on methodological considerations in mental health research–design, sampling,
data collection procedure and quality assurance. Archives of Public Health, 77(1), 1-15.
(15 pages)
Copyright: © This is an open-access article distributed under the terms of a Creative
Commons Attribution License (https://creativecommons.org/licenses/by/4.0/).
122 Advanced Techniques for Collecting Statistical Data
Methods
A search of the published literature was conducted using EMBASE, Medline,
PsycINFO, CINAHL, Web of Science, and Scopus. The search was limited
to papers published in English for the timeframe 2000–2018. Using pre-
defined inclusion and exclusion criteria, three reviewers independently
screened the retrieved papers. A data extraction form was used to extract
data from the included papers.
Results
Of 27 papers meeting the inclusion criteria, 13 focused on qualitative research,
8 mixed methods and 6 papers focused on quantitative methodology. A total
of 14 papers targeted global mental health research, with 2 papers each
describing studies in Germany, Sweden and China. The review identified
several methodological considerations relating to study design, methods,
data collection, and quality assurance. Methodological issues regarding the
study design included assembling team members, familiarisation and sharing
information on the topic, and seeking the contribution of team members.
Methodological considerations to facilitate data collection involved
adequate preparation prior to fieldwork, appropriateness and adequacy of
the sampling and data collection approach, selection of consumers, the
social or cultural context, practical and organisational skills; and ethical and
sensitivity issues.
Conclusion
The evidence confirms that studies on methodological considerations in
conducting mental health research largely focus on qualitative studies in
a transcultural setting, as well as recommendations derived from multi-
site surveys. Mental health research should adequately consider the
methodological issues around study design, sampling, data collection
procedures and quality assurance in order to maintain the quality of data
collection.
Keywords: Mental health, Methodological approach, Mixed methods,
Sampling, Data collection
An Integrative Review on Methodological Considerations in Mental ... 123
BACKGROUND
In the past decades there has been considerable attention on research
methods to facilitate studies in various academic fields, such as public health,
education, humanities, behavioural and social sciences [1–4]. These research
methodologies have generally focused on the two major research pillars
known as quantitative or qualitative research. In recent years, researchers
conducting mental health research appear to be either employing both
qualitative and quantitative research methods separately, or mixed methods
approaches to triangulate and validate findings [5, 6].
A combination of study designs has been utilised to answer research
questions associated with mental health services and consumer outcomes
[7, 8]. Study designs in the public health and clinical domains, for example,
have largely focused on observational studies (non-interventional) and
experimental research (interventional) [1, 3, 9]. Observational design in
non-interventional research requires the investigator to simply observe,
record, classify, count and analyse the data [1, 2, 10]. This design is different
from the observational approaches used in social science research, which
may involve observing (participant and non- participant) phenomena in the
fieldwork [1]. Furthermore, the observational study has been categorized
into five types, namely cross-sectional design, case-control studies, cohort
studies, case report and case series studies [1–3, 9–11]. The cross-sectional
design is used to measure the occurrence of a condition at a one-time point,
sometimes referred to as a prevalence study. This approach of conducting
research is relatively quick and easy but does not permit a distinction between
cause and effect [1]. Conversely, the case-control is a design that examines
the relationship between an attribute and a disease by comparing those with
and without the disease [1, 2, 12]. In addition, the case-control design is
usually retrospective and aims to identify predictors of a particular outcome.
This type of design is relevant when investigating rare or chronic diseases
which may result from long-term exposure to particular risk factors [10].
Cohort studies measure the relationship between exposure to a factor and
the probability of the occurrence of a disease [1, 10]. In a case series design,
medical records are reviewed for exposure to determinants of disease and
outcomes. More importantly, case series and case reports are often used as
preliminary research to provide information on key clinical issues [12].
The interventional study design describes a research approach that
applies clinical care to evaluate treatment effects on outcomes [13]. Several
previous studies have explained the various forms of experimental study
design used in public health and clinical research [14, 15]. In particular,
124 Advanced Techniques for Collecting Statistical Data
can inform clinicians and academia about the gaps in the literature related to
methodological considerations.
METHODS
Methodology
An integrative review was conducted to synthesise the available evidence on
mental health research methodological considerations. To guide the review,
the World Health Organization (WHO) definition of mental health has been
utilised. The WHO defines mental health as: “a state of well-being, in which
the individual realises his or her own potentials, ability to cope with the
normal stresses of life, functionality and work productivity, as well as the
ability to contribute effectively in community life” [20]. The integrative
review enabled the simultaneous inclusion of diverse methodologies (i.e.,
experimental and non-experimental research) and varied perspectives to
fully understand a phenomenon of concern [21, 22]. The review also uses
diverse data sources to develop a holistic understanding of methodological
considerations in mental health research. The methodology employed
involves five stages: 1) problem identification (ensuring that the research
question and purpose are clearly defined); 2) literature search (incorporating
a comprehensive search strategy); 3) data evaluation; 4) data analysis
(data reduction, display, comparison and conclusions) and; 5) presentation
(synthesising findings in a model or theory and describing the implications
for practice, policy and further research) [21].
Inclusion Criteria
The integrative review focused on methodological issues in mental health
research. This included core areas such as study design and methods,
particularly qualitative, quantitative or both. The review targeted papers
that addressed study design, sampling, data collection procedures, quality
assurance and the data analysis process. More specifically, the included
papers addressed methodological issues on empirical studies in mental
health research. The methodological issues in this context are not limited to
a particular mental illness. Studies that met the inclusion criteria were peer-
reviewed articles published in the English Language, from January 2000 to
July 2018.
126 Advanced Techniques for Collecting Statistical Data
Exclusion Criteria
Articles that were excluded were based purely on general health services
or clinical effectiveness of a particular intervention with no connection to
mental health research. Articles were also excluded when it addresses non-
methodological issues. Other general exclusion criteria were book chapters,
conference abstracts, papers that present opinion, editorials, commentaries
and clinical case reviews.
Data Synthesis
Content analysis was used to synthesise the extracted data. The content
analysis process involved several stages which involved noting patterns
and themes, seeing plausibility, clustering, counting, making contrasts
and comparisons, discerning common and unusual patterns, subsuming
particulars into general, noting relations between variability, finding
intervening factors and building a logical chain of evidence [21] (see Table
2).
Data collection Approaches for collecting 9 (28) (41) (30) (31) (44) (47)
in mental health qualitative data (19) (40) (34)
research Consideration for data col- 6 (32) (37) (31) (41) (49) (47)
lection
Preparing for data collection 8 (25) (33) (34) (35) (39) (41)
(49) (30)
Quality assurance Seeking informed consent 7 (25) (26) (33) (35) (37) (39)
procedures (47)
Procedure for ensuring quality 5 (49) (25) (39) (33) (38)
control (quantitative)
Procedure for ensuring quality 4 (32) (37) (46) (19)
control (qualitative)
Na number of papers
RESULTS
Study Characteristics
The integrative review identified a total of 491 records from all databases,
after which 19 duplicates were removed. Out of this, 472 titles and abstracts
were assessed for eligibility, after which 439 articles were excluded. Articles
not meeting the inclusion criteria were excluded. Specifically, papers
excluded were those that did not address methodological issues as well as
papers addressing methodological consideration in other disciplines. A total
of 33 full-text articles were assessed – 9 articles were further excluded,
whilst an additional 3 articles were identified from reference lists. Overall,
27 articles were included in the final synthesis (see Fig. 1). Of the total
included papers, 12 contained qualitative research, 9 were mixed methods
(both qualitative and quantitative) and 6 papers focused on quantitative data.
Conversely, a total of 14 papers targeted global mental health research and 2
papers each describing studies in Germany, Sweden and China. The papers
addressed different methodological issues, such as study design, methods,
data collection, and analysis as well as quality assurance (see Table 3).
130 Advanced Techniques for Collecting Statistical Data
that the sequential design is a process where the data collection and analysis
of one component (eg. quantitative) takes place after the data collection
and analysis of the other component (eg qualitative). Herein, the data
collection and analysis of one component (e.g. qualitative) may depend on
the outcomes of the other component (e.g. quantitative) [43, 48]. An earlier
review suggested that the majority of contemporary studies in mental health
research use a sequential design, with qualitative methods, more often
preceding quantitative methods [18].
Alternatively, the concurrent design collects and analyses data of
both components (e.g. quantitative and qualitative) simultaneously and
independently. Palinkas, Horwitz [42] recommend that one component is
used as secondary to the other component, or that both components are
assigned equal priority. Such a mixed methods approach aims to provide a
depth of understanding afforded by qualitative methods, with the breadth of
understanding offered by the quantitative data to elaborate on the findings
of one component or seek convergence through triangulation of the results.
Schoonenboom and Johnson [48] recommended the use of capital letters for
one component and lower case letters for another component in the same
design to indicate that one component is primary and the other is secondary
or supplemental.
search
Three studies highlighted several factors that need to be considered when
conducting mixed methods design in mental health research [18, 19, 45].
Accordingly, these factors include developing familiarity with the topic
under investigation based on experience, willingness to share information
on the topic [19], establishing early collaboration, willingness to negotiate
emerging problems, seeking the contribution of team members, and soliciting
third-party assistance to resolve any emerging problems [45]. Additionally,
Palinkas, Horwitz [18] recommended that mixed methods in the context
of mental health research are mostly applied in studies that assess needs of
services, examine existing services, developing new or adapting existing
services, evaluating services in randomised control trials, and examining
service implementation.
Sampling Consideration
Four studies in this section highlighted some of the sampling considerations
in mental health research [30–32, 46]. Generally, mental health research
should consider the appropriateness and adequacy of sampling approach by
applying attributes such as shared social, or cultural experiences, or shared
concern related to the study [32], diversity and variety of participants [31],
practical and organisational skills, as well as ethical and sensitivity issues
[46]. Robinson [46] further suggested that sampling can be homogenous or
heterogeneous depending on the research questions for the study. Achieving
homogeneity in sampling should employ a variety of parameters, which
include demographic, graphical, physical, psychological, or life history
homogeneity [46]. Additionally, applying homogeneity in sampling can be
influenced by theoretical and practical factors. Alternatively, some samples
are intentionally selected based on heterogeneous factors [46].
confidence and trust between the researcher and consumers [31, 37]. This
is a significant prerequisite, as it can sensitise and normalise the research
process and aims with the participants prior to discussing their personal
mental health issues. Similarly, some studies added that the researcher can
gain the confidence of service providers who manage consumers of mental
health services [41, 47], seek ethical approval from the relevant committee(s)
[41, 47], meet and greet the consumers of mental health services before
data collection, and arrange a mutually acceptable venue for the groups and
possibly supply transport [41].
Two studies further suggested that the cultural and social differences of
the participants need consideration [26, 31]. These factors could influence
the perception and interpretation of ethical issues in the research situation.
Additionally, two studies recommended the use of standardised
assessment instruments for mental health research that involve quantitative
data collection [33, 49]. A recent survey suggested that measures to
standardise the data collection approach can convert self-completion
instruments to interviewer-completion instruments [49]. The interviewer
can then read the items of the instruments to respondents and record their
responses. The study further suggested the need to collect demographic and
behavioural information about the participant(s).
status and checks across variables” [25, 33, 49]. For example, Alonso,
Angermeyer [25] advocate that various checks are used to verify completion
of the interview, and consistency across instruments against the standard
procedure.
DISCUSSION
The integrative review was conducted to synthesise evidence into
recommended methodological considerations when conducting mental
health research. The evidence from the review has been discussed according
to five major themes: 1) mixed methods study in mental health research;
2) qualitative study in mental health research; 3) sampling in mental
An Integrative Review on Methodological Considerations in Mental ... 143
January 2000 to July 2018 could have missed useful articles published in
other languages and those published prior to 2000. The review did not assess
the methodological quality of included papers using a critical appraisal tool,
however, the combination of clearly articulated search methods, consultation
with the research librarian, and reviewing articles with methodological
experts in mental health research helped to address the limitations.
CONCLUSION
The review identified several methodological issues that need critical
attention when conducting mental health research. The evidence confirms
that studies that addressed methodological considerations in conducting
mental health research largely focuses on qualitative studies in a transcultural
setting, in addition to lessons from multi-site surveys in mental health
research. Specifically, the methodological issues related to the study design,
sampling, data collection processes and quality assurance are critical to the
research design chosen for any particular study. The review highlighted
that researchers conducting mental health research can establish early
collaboration, familiarise themselves with the topic, share information on the
topic, negotiate to resolve any emerging problems and seek the contribution
of clinical (or researcher) team members on the ground. In addition, the
recruitment of consumers of mental health services should consider the
appropriateness and adequacy of sampling approaches, diversity and variety
of consumers of services, their social or cultural experiences, practical and
organisational skills, as well as ethical and sensitivity issues.
The evidence confirms that in an attempt to effectively recruit and collect
data from consumers of mental health services, there is the need to build
confidence and trust between the researcher and consumers; and to gain the
confidence of mental health service providers. Furthermore, seeking ethical
approval from the relevant committee, meeting with consumers of services
before data collection, arranging a mutually acceptable venue for the groups,
and providing transport services, are all further important considerations. The
review findings establish that researchers conducting mental health research
should consider several quality assurance issues. Issues such as adequate
training prior to data collection, seeking informed consent from consumers
of mental health services, pre-testing of tools, minimising non-response rates
and monitoring of the data collection process. More specifically, quality
assurance for qualitative data can be achieved by applying the principles of
credibility, dependability, transferability, reflexivity, confirmability.
An Integrative Review on Methodological Considerations in Mental ... 149
ACKNOWLEDGEMENTS
The authors wish to thank the University of Newcastle Graduate Research
and the School of Nursing and Midwifery, for the Doctoral Scholarship
offered to the lead author. The authors are also grateful for the support
received from Ms. Debbie Booth, the Librarian for supporting the literature
search.
AUTHORS’ CONTRIBUTIONS
EB, APO’B, and RM conceptualized the study. EB conducted the data
extraction, APO’B, and RM, conducted the second review of the extracted
data. EB, working closely with APO’B and RM performed the content
analysis and drafted the manuscript. EB, APO’B, and RM, reviewed and
made inputs into the intellectual content and agreed on its submission for
publication. All authors read and approved the final manuscript.
150 Advanced Techniques for Collecting Statistical Data
REFERENCES
1. National Ethics Advisory Committee. Ethical guidelines for
intervention studies: revised edition. Wellington (New Zealand):
Ministry of Health. 2012.
2. Mann C. Observational research methods. Research design II: cohort,
cross sectional, and case-control studies. Emerg Med J. 2003;20(1):54–
60. doi: 10.1136/emj.20.1.54.
3. DiPietro NA. Methods in epidemiology: observational study designs.
Pharmacotherapy: The Journal of Human Pharmacology and Drug
Therapy. 2010;30(10):973–984. doi: 10.1592/phco.30.10.973.
4. Hong NQ, Pluyr P, Fabregues S, Bartlett G, Boardman F, Cargo M,
et al. Mixed Methods Appraisal Tool (MMAT). Canada.: Intellectual
Property Office, Canada; 2018.
5. Creswell JW, Creswell JD. Research design: qualitative, quantitative,
and mixed methods approaches: sage publications. 2017.
6. Wisdom J, Creswell JW. Mixed methods: integrating quantitative and
qualitative data collection and analysis while studying patient-centered
medical home models. Rockville: Agency for Healthcare Research and
Quality; 2013.
7. Bonita R, Beaglehole R, Kjellström T. Basic epidemiology: World
Health Organization. 2006.
8. Centers for Disease Control Prevention [CDC]. Principles of
epidemiology in public health practice: an introduction to applied
epidemiology and biostatistics. Atlanta, GA: US Dept. of Health and
Human Services, Centers for Disease Control and Prevention (CDC),
Office of Workforce and Career Development; 2012.
9. Parab S, Bhalerao S. Study designs. International journal of Ayurveda
research. 2010;1(2):128. doi: 10.4103/0974-7788.64406.
10. Yang W, Zilov A, Soewondo P, Bech OM, Sekkal F, Home PD.
Observational studies: going beyond the boundaries of randomized
controlled trials. Diabetes Res Clin Pract. 2010;88:S3–S9. doi:
10.1016/S0168-8227(10)70002-4.
11. Department of Family Medicine (McGill University). Mixed Methods
Appraisal Tool (MMAT) – Version 2011 Canada: McGill University;
2011 [Available from: http://mixedmethodsappraisaltoolpublic.
pbworks.com/w/file/fetch/84371689/MMAT%202011%20
criteria%20and%20tutorial%202011-06-29updated2014.08.21.pdf.
An Integrative Review on Methodological Considerations in Mental ... 151
ABSTRACT
In the social sciences, there is a longstanding tension between data collection
methods that facilitate quantification and those that are open to unanticipated
information. Advances in technology now enable new, hybrid methods that
combine some of the benefits of both approaches. Drawing inspiration from
Citation: (APA): Salganik, M. J., & Levy, K. E. (2015). Wiki surveys: Open and quan-
tifiable social data collection. PloS one, 10(5), e0123483. (17 pages)
Copyright: ©2015 Salganik, Levy. This is an open access article distributed under the
terms of the Creative Commons Attribution 4.0 International License (http://creative-
commons.org/licenses/by/4.0/)
156 Advanced Techniques for Collecting Statistical Data
INTRODUCTION
In the social sciences, there is a longstanding tension between data
collection methods that facilitate quantification and those that are open
to unanticipated information. For example, one can contrast a traditional
public opinion survey based on a series of pre-written questions and
answers with an interview in which respondents are free to speak in their
own words. The tension between these approaches derives, in part, from
the strengths of each: open approaches (e.g., interviews) enable us to learn
new and unexpected information, while closed approaches (e.g., surveys)
tend to be more cost-effective and easier to analyze. Fortunately, advances
in technology now enable new, hybrid approaches that combine the benefits
of each. Drawing inspiration both from online information aggregation
systems like Wikipedia and from traditional survey research, we propose
a new class of research instruments called wiki surveys. Just as Wikipedia
grows and improves over time based on contributions from participants, we
envision an evolving survey driven by contributions from respondents.
Although the tension between open and closed approaches to data
collection is currently most evident in disagreements between proponents
of quantitative and qualitative methods, the trade-off between open and
closed survey questions was also particularly contentious in the early days
of survey research [1–3]. Although closed survey questions, in which
respondents choose from a series of pre-written answer choices, have come
to dominate the field, this is not because they have been proven superior for
measurement. Rather, the dominance of closed questions is largely based
on practical considerations: having a fixed set of responses dramatically
simplifies data analysis [4].
Wiki Surveys: Open and Quantifiable Social Data Collection 157
WIKI SURVEYS
Online information aggregation projects, of which Wikipedia is an exemplar,
can inspire new directions in survey research. These projects, which are built
from crowdsourced, user-generated content, tend to share certain properties
158 Advanced Techniques for Collecting Statistical Data
Greediness
Traditional surveys attempt to collect a fixed amount of information
from each respondent; respondents who want to contribute less than one
questionnaire’s worth of information are considered problematic, and
respondents who want to contribute more are prohibited from doing so. This
contrasts sharply with successful information aggregation projects on the
Internet, which collect as much or as little information as each participant is
willing to provide. Such a structure typically results in highly unequal levels
of contribution: when contributors are plotted in rank order, the distributions
tend to show a small number of heavy contributors—the “fat head”—and
a large number of light contributors—the “long tail” [21, 22] (Fig 1). For
example, the number of edits to Wikipedia per editor roughly follows a
power-law distribution with an exponent 2 [22]. If Wikipedia were to allow
10 and only 10 edits per editor—akin to a survey that requires respondents
to complete one and only one form—it would exclude about 95% of the
edits contributed. As such, traditional surveys potentially leave enormous
amounts of information from the “fat head” and “long tail” uncollected.
Wiki surveys, then, should be greedy in the sense that they should capture as
much or as little information as a respondent is willing to provide.
These systems can handle both heavy contributors (“the fat head”),
shown on the left side of the plot, and light contributors (“the long tail”),
shows on the right side of the plot. Traditional survey methods utilize
information from neither the “fat head” nor the “long tail” and thus leave
huge amounts of information uncollected.
https://doi.org/10.1371/journal.pone.0123483.g001
Collaborativeness
In traditional surveys, the questions and answer choices are typically written
by researchers rather than respondents. In contrast, wiki surveys should be
collaborative in that they are open to new information contributed directly
by respondents that may not have been anticipated by the researcher, as often
happens during an interview. Crucially, unlike a traditional “other” box in a
survey, this new information would then be presented to future respondents
for evaluation. In this way, a wiki survey bears some resemblance to a focus
group in which participants can respond to the contributions of others [23,
24]. Thus, just as a community collaboratively writes and edits Wikipedia,
the content of a wiki survey should be partially created by its respondents.
This approach to collaborative survey construction resembles some forms
of survey pre-testing [25]. However, rather than thinking of pre-testing
as a phase distinct from the actual data collection, in wiki surveys the
collaboration process continues throughout data collection.
Adaptivity
Traditional surveys are static: survey questions, their order, and their possible
answers are determined before data collection begins and do not evolve as
more is learned about the parameters of interest. This static approach, while
easier to implement, does not maximize the amount that can be learned from
each respondent. Wiki surveys, therefore, should be adaptive in the sense that
the instrument is continually optimized to elicit the most useful information,
given what is already known. In other words, while collaborativeness
involves being open to new information, adaptivity involves using the
information that has already been gathered more efficiently. In the context
of wiki surveys, adaptivity is particularly important given that respondents
can provide different amounts of information (due to greediness) and that
some answer choices are newer than others (due to collaborativeness).
Like greediness and collaborativeness, adaptivity increases the complexity
160 Advanced Techniques for Collecting Statistical Data
Data Collection
In order to collect pairwise wiki survey data, we created the free and open-
source website All Our Ideas (www.allourideas.org), which enables anyone
to create their own pairwise wiki survey. To date, about 6,000 pairwise
wiki surveys have been created that include about 300,000 items and 7
million responses. By providing this service online, we are able to collect
a tremendous amount of data about how pairwise wiki surveys work in
practice, and our steady stream of users provides a natural testbed for further
methodological research.
The data collection process in a pairwise wiki survey is illustrated by
a project conducted by the New York City Mayor’s Office of Long-Term
Planning and Sustainability in order to integrate residents’ ideas into PlaNYC
2030, New York’s citywide sustainability plan. The City has typically held
public meetings and small focus groups to obtain feedback from the public.
By using a pairwise wiki survey, the Mayor’s Office sought to broaden the
dialogue to include input from residents who do not traditionally attend
public meetings. To begin the process, the Mayor’s Office generated a list of
25 ideas based on their previous outreach (e.g., “Require all big buildings to
make certain energy efficiency upgrades,” “Teach kids about green issues as
part of school curriculum”).
Using these 25 ideas as “seeds,” the Mayor’s Office created a pairwise
wiki survey with the question “Which do you think is a better idea for creating
a greener, greater New York City?” Respondents were presented with a pair
of ideas (e.g., “Open schoolyards across the city as public playgrounds” and
“Increase targeted tree plantings in neighborhoods with high asthma rates”),
and asked to choose between them (see Fig 2). After choosing, respondents
were immediately presented with another randomly selected pair of ideas.
Respondents were able to continue contributing information about their
preferences for as long as they wished by either voting or choosing “I can’t
decide.” Crucially, at any point, respondents were able to contribute their
own ideas, which—pending approval by the wiki survey creator—became
part of the pool of ideas to be presented to others. Respondents were also
able to view the popularity of the ideas at any time, making the process
transparent. However, by decoupling the processes of voting and viewing
the results—which occur on distinct screens (see Fig 2)—the site prevents
a respondent from having immediate information about the opinions of
others when she responds, which minimizes the risk of social influence and
information cascades [43, 45–48].
162 Advanced Techniques for Collecting Statistical Data
Data Analysis
Given this data collection process, we analyze data from a pairwise wiki
survey in two main steps (Fig 3). First, we use responses to estimate the
opinion matrix Θ that includes an estimate of how much each respondent
values each item. Next, we summarize the opinion matrix to produce a
score for each item that estimates the probability that it will beat a randomly
chosen item for a randomly chosen respondent. Because this analysis is
modular, either step—estimation or summarization—could be improved
independently.
which has one row for each respondent and one column for each item, where
θj, k is the amount that respondent j values item k (or more generally, the
amount that respondent j believes item k answers the question being asked).
In the New York City example described above, θj, k could be the amount that
a specific respondent values the idea “Open schoolyards across the city as
public playgrounds.”
Three features of the response data complicate the process of estimating
the opinion matrix Θ. First, because the wiki survey is greedy, we have
an unequal number of responses from each respondent. Second, because
the wiki survey is collaborative, there are some items that can never be
presented to some respondents. For example, if respondent j contributed
an item, then none of the previous respondents could have seen that item.
Collectively, the greediness and the collaborativeness mean that in practice
we often have to estimate a respondent’s value for an item that she has never
encountered. The third problem is that responses are in the form of pairwise
comparisons, which means that we can only observe a respondent’s relative
preference between two items, not her absolute feeling about either item.
In order to address these three challenges, we propose a statistical model
that assumes that respondents’ responses reflect their relative preferences
between items (i.e., the Thurstone-Mosteller model [41, 49, 50]) and that
the distribution of preferences across respondents for each item follows
a normal distribution. Given these assumptions and weakly informative
priors, we can perform Bayesian inference to estimate the θj, k’s that are
most consistent with the responses that we observe and the assumptions
164 Advanced Techniques for Collecting Statistical Data
that we have made. One important feature of this modeling strategy is that
for those who contribute many responses, we can better estimate their row
in the opinion matrix, and for those who contribute fewer responses, we
have to rely more on the pooling of information from other respondents
(i.e., imputation). The specific functional forms that we assume result in
the following posterior distribution, which resembles a hierarchical probit
model:
(1)
where X is an appropriately constructed design matrix, Y is an appropriately
constructed outcome vector, μ = μ1…μK represents the mean appeal of each
item, and μ0 = μ0[1]…μ0[K] and τ20=τ20[1]…τ20[K]τ02=τ0[1]2…τ0[K]2 are
parameters to the priors for mean appeal of each item (μ).
This statistical model is just one of many possible approaches to
estimating the opinion matrix from the response data, and we hope that future
research will develop improved approaches. We fully derive the model,
discuss situations in which our modeling assumptions might not hold, and
describe the Gibbs sampling approach that we use to make repeated draws
from the posterior distribution. Computer code to make these draws was
written in R [51] and utilized the following packages: plyr [52], multicore
[53], bigmemory [54], truncnorm [55], testthat [56], Matrix [57], and
matrixStats [58].
(2)
Wiki Surveys: Open and Quantifiable Social Data Collection 165
CASE STUDIES
To show how pairwise wiki surveys operate in practice, in this section we
describe two case studies in which the All Our Ideas platform was used
for collecting and prioritizing community ideas for policymaking: New
York City’s PlaNYC 2030 and the Organisation for Economic Co-operation
and Development (OECD)’s “Raise Your Hand” initiative. As described
previously, the New York City Mayor’s Office conducted a wiki survey in
order to integrate residents’ ideas into the 2011 update to the City’s long-
term sustainability plan. The wiki survey asked residents to contribute their
ideas about how to create “a greener, greater New York City” and to vote
166 Advanced Techniques for Collecting Statistical Data
on the ideas of others. The OECD’s wiki survey was created in preparation
for an Education Ministerial Meeting and an Education Policy Forum on
“Investing in Skills for the 21st Century.” The OECD sought to bring fresh
ideas from the public to these events in a democratic, transparent, and
bottom-up way by seeking input from education stakeholders located around
the globe. To accomplish these goals, the OECD created a wiki survey to
allow respondents to contribute and vote on ideas about “the most important
action we need to take in education today.”
We assisted the New York City Mayor’s Office and the OECD in the
process of setting up their wiki surveys, and spoke with officials of both
institutions multiple times over the course of survey administration. We
also conducted qualitative interviews with officials from both groups at the
conclusion of survey data collection in order to better understand how the
wiki surveys worked in practice, contextualize the results, and get a better
sense of whether the use of a wiki survey enabled the groups to obtain
information that might have been difficult to obtain via other data collection
methods. Unfortunately, logistical considerations prevented either group
from using a probabilistic sampling design. Therefore, we can only draw
inferences about respondents, who should not be considered a random
sample from some larger population. However, wiki surveys can be used in
conjunction with probabilistic sampling designs, and we will return to the
issue of sampling in the Discussion.
Quantitative Results
The pairwise wiki surveys conducted by the New York City Mayor’s Office
and the OECD had similar patterns of respondent participation. In the
PlaNYC wiki survey, 1,436 respondents contributed 31,893 responses, and
in the OECD wiki survey 1,668 respondents contributed 28,852 responses.
Further, respondents contributed a substantial number of new ideas (464 for
PlaNYC, and 534 for OECD). Of these contributed ideas, those that the wiki
survey creators deemed inappropriate or duplicative were not activated. In
the end, the number of ideas under consideration was dramatically expanded.
For PlaNYC the number of active ideas in the wiki survey increased from
25 to 269, a 10-fold increase, and for the OECD from 60 to 285, a 5-fold
increase (Fig 4).
Wiki Surveys: Open and Quantifiable Social Data Collection 167
Figure 4: Cumulative number of activated ideas for PlaNYC [A] and OECD
[B].
The PlaNYC wiki survey ran from October 7, 2010 to January 30, 2011.
The OECD wiki survey ran from September 15, 2010 to October 15, 2010.
In both cases the pool of ideas grew over time as respondents contributed to
the wiki survey. PlaNYC had 25 seed ideas and 464 user-contributed ideas,
244 of which the Mayor’s Office activated. The OECD had 60 seed ideas
(6 of which it deactivated during the course of the survey), and 534 user-
contributed ideas, 231 of which it activated. In both cases, ideas that were
deemed inappropriate or duplicative were not activated.
https://doi.org/10.1371/journal.pone.0123483.g004
Within each survey, the level of respondent contribution varied widely,
in terms of both number of responses and number of ideas contributed, as
we expected given the greedy nature of the wiki survey. In both cases, the
distributions of both responses and contributed ideas contained “fat heads”
and “long tails” (see Fig 5). If the wiki surveys captured only a fixed amount
of information per respondent—as opposed to capturing all levels of effort—a
significant amount of information would have been lost. For instance, if
we only accepted the first 10 responses per respondent and discarded all
respondents with fewer than 10 responses, approximately 75% of the
responses in each survey would have been discarded. Further, if we were
to limit the number of ideas contributed to one per respondent, as is typical
in surveys with one and only one “other box,” we would have excluded a
significant number of new ideas: nearly half of the user-contributed ideas in
the PlaNYC survey and about 40% in the OECD survey.
168 Advanced Techniques for Collecting Statistical Data
Figure 6: Ten highest-scoring ideas for PlanNYC [A] and OECD [B].
Ideas that were contributed by respondents are printed in a bold/italic font
and marked by closed circles; seed ideas are printed in a standard font and
marked by open circles. In the case of PlaNYC, 8 of the 10 highest-scoring
Wiki Surveys: Open and Quantifiable Social Data Collection 169
Qualitative Results
Because user-contributed ideas that score well are likely to be of interest—
in fact, they highlight the value of the collaborativeness of wiki surveys—
we sought to understand more about these items by conducting interviews
170 Advanced Techniques for Collecting Statistical Data
with the creators of the PlaNYC and OECD wiki surveys. Based on these
interviews, as well as interviews with six other wiki survey creators, we
identified two general categories of high-scoring user-contributed ideas:
novel information—that is, substantively new ideas that were not anticipated
by the wiki survey creators—and alternative framings—that is, new and
resonant ways of expressing existing ideas.
Some high-scoring user-contributed ideas contained information that
was novel to the wiki survey creator. For example, in the PlaNYC context,
the Mayor’s Office reported that user-contributed ideas were sometimes
able to bridge multiple policy arenas (or “silos”) that might have been
more difficult connections to make for office staff working within a specific
arena. For instance, consider the high-scoring user-contributed idea “plug
ships into electricity grid so they don’t idle in port—reducing emissions
equivalent to 12000 cars per ship.” The Mayor’s Office suggested that
staff may not have prioritized such an idea internally (it did not appear on
the Mayor’s Office’s list of seed ideas), even though the idea’s high score
suggested public support for this policy goal: “[T]his relates to two areas.
So plugging ships into electricity grid, so that’s one, in terms of energy and
sourcing energy. And it relates to freight. [Question: Okay, which are two
separate silos?] Correct, so freight is something that we’re looking closer at.
… And emissions, reducing emissions, is something that’s an overall goal
of the plan. … So this has a lot of value to it for us to learn from” (interview
with Ibrahim Abdul-Matin, New York City Mayor’s Office, December 12,
2010).
Other user-contributed ideas suggested alternative framings for existing
ideas. For instance, the creators of the OECD wiki survey noted that high-
scoring, user-contributed ideas like “Teach to think, not to regurgitate”
“wouldn’t be formulated in such a way [by the OECD]. … [I]t’s very
un-OECD-speak, which we liked” (interview with Julie Harris, OECD,
February 3, 2011). More generally, OECD staff noted that “what for me has
been most interesting is that … those top priorities [are] very much couched
in the language of principles[. …] It’s sort of constitutional language”
(interview with Joanne Caddy, OECD, February 15, 2011). PlaNYC’s wiki
survey creators also described the importance of user-contributed ideas
being expressed in unexpected ways. The top-scoring idea in PlaNYC’s wiki
survey, contributed by a respondent, was “Keep NYC’s drinking water clean
by banning fracking in NYC’s watershed”; Mayor’s Office staff indicated
that the office would have used more general language about protecting the
Wiki Surveys: Open and Quantifiable Social Data Collection 171
DISCUSSION
In this paper we propose a new class of data collection instruments called
wiki surveys. By combining insights from traditional survey research and
projects such as Wikipedia, we propose three general principles that all wiki
surveys should satisfy: they should be greedy, collaborative, and adaptive.
Designing an instrument that satisfies those three criteria introduces a
number of challenges for data collection and data analysis, which we attempt
to resolve in the form of a pairwise wiki survey. Through two case studies
we show that pairwise wiki surveys can enable data collection that would
be difficult with other methods. Moving beyond these proof-of-concept case
studies to a fuller understanding of the strengths and weaknesses of pairwise
wiki surveys, in particular, and wiki surveys, in general, will require
substantial additional research.
One next step for improving our understanding of the measurement
properties of pairwise wiki surveys would be additional studies to assess
the consistency and validity of responses. Consistency could be assessed
by measuring the extent to which respondents provide identical responses
to the same pair and provide transitive responses to a series of pairs.
Assessing validity would be more difficult, however, because wiki surveys
tend to measure subjective states, such as attitudes, for which gold-standard
measures rarely exist [61]. Despite the inherent difficulty of validating
measures of subjective states, there are several approaches that could lead
to increased confidence in the validity of pairwise wiki surveys [62]. First,
studies could be done to assess discriminant validity by measuring the
extent to which groups of respondents who are thought to have different
preferences produce different wiki survey results. Second, construct validity
could be assessed by measuring the extent to which responses for items
that we believe to be similar are in fact similar. Third, studies could assess
172 Advanced Techniques for Collecting Statistical Data
ACKNOWLEDGMENTS
We thank Peter Lubell-Doughtie, Adam Sanders, Pius Uzamere, Dhruv
Kapadia, Chap Ambrose, Calvin Lee, Dmitri Garbuzov, Brian Tubergen,
Peter Green, and Luke Baker for outstanding web development; we thank
Nadia Heninger, Bill Zeller, Bambi Tsui, Dhwani Shah, Gary Fine, Mark
Newman, Dennis Feehan, Sophia Li, Lauren Senesac, Devah Pager, Paul
DiMaggio, Adam Slez, Scott Lynch, David Rothschild, and Ceren Budak
for valuable suggestions; and we thank Josh Weinstein for his critical role
in the genesis of this project. Further, we thank Ibrahim Abdul-Matin and
colleagues at the New York City Mayor’s Office and Joanne Caddy, Julie
Harris, and Cassandra Davis at the Organisation for Economic Co-operation
and Development. This paper represents the views of its authors and not the
users or funders of www.allourideas.org.
AUTHOR CONTRIBUTIONS
Conceived and designed the experiments: MJS KECL. Performed the
experiments: MJS KECL. Analyzed the data: MJS KECL. Wrote the paper:
MJS KECL.
174 Advanced Techniques for Collecting Statistical Data
REFERENCES
1. Lazarsfeld PF. The Controversy Over Detailed Interviews—An Offer
for Negotiation. Public Opinion Quarterly. 1944;8(1):38–60.
2. Converse JM. Strong arguments and weak evidence: The open/closed
questioning controversy of the 1940s. Public Opinion Quarterly.
1984;48(1):267–282.
3. Converse JM. Survey research in the United States: Roots and
emergence 1890–1960. New Brunswick: Transaction Publishers; 2009.
4. Schuman H. Method and meaning in polls and surveys. Cambridge:
Harvard University Press; 2008.
5. Schuman H, Presser S. The Open and Closed Question. American
Sociological Review. 1979 Oct;44(5):692–712.
6. Schuman H, Scott J. Problems in the Use of Survey Questions to
Measure Public Opinion. Science. 1987 May;236(4804):957–959.
pmid:17812751
7. Presser S. Measurement Issues in the Study of Social Change. Social
Forces. 1990 Mar;68(3):856–868.
8. Roberts ME, Stewart BM, Tingley D, Lucas C, Leder-Luis J, Gadarian
SK, et al. Structural Topic Models for Open-Ended Survey Responses.
American Journal of Political Science. 2014 Oct;58(4):1064–1082.
9. Krosnick JA. Survey Research. Annual Review of Psychology. 1999
Feb;50(1):537–567. pmid:15012463
10. Mitofsky WJ. Presidential Address: Methods and Standards: A
Challenge for Change. Public Opinion Quarterly. 1989;53(3):446–453.
11. Dillman DA. Presidential Address: Navigating the Rapids of Change:
Some Observations on Survey Methodology in the Early Twenty-First
Century. Public Opinion Quarterly. 2002 Oct;66(3):473–494.
12. Couper MP. Designing effective web surveys. Cambridge, UK:
Cambridge University Press; 2008.
13. Couper MP, Miller PV. Web Survey Methods: Introduction. Public
Opinion Quarterly. 2009 Jan;72(5):831–835.
14. Couper MP. The Future of Modes of Data Collection. Public Opinion
Quarterly. 2011;75(5):889–908.
15. Groves RM. Three Eras of Survey Research. Public Opinion Quarterly.
2011;75(5):861–871.
Wiki Surveys: Open and Quantifiable Social Data Collection 175
41. Thurstone LL. The Method of Paired Comparisons for Social Values.
Journal of Abnormal and Social Psychology. 1927;21(4):384–400.
42. Hacker S, von Ahn L. Matchin: Eliciting User Preferences with an
Online Game. Proceedings of the 27th international conference on
Human factors in computing systems. 2009;p. 1207–1216.
43. Salganik MJ, Watts DJ. {Web-based} Experiments for the Study of
Collective Social Dynamics in Cultural Markets. Topics in Cognitive
Science. 2009 Jul;1(3):439–468. pmid:25164996
44. Goel S, Mason W, Watts DJ. Real and perceived attitude agreement
in social networks. Journal of Personality and Social Psychology.
2010;99(4):611–621. pmid:20731500
45. Salganik MJ, Dodds PS, Watts DJ. Experimental Study of Inequality
and Unpredictability in an Artificial Cultural Market. Science. 2006
Feb;311(5762):854–856. pmid:16469928
46. Zhu H, Huberman B, Luon Y. To switch or not to switch: understanding
social influence in online choices. In: Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems. CHI’12. New
York, NY, USA: ACM; 2012. p. 2257–2266.
47. Muchnik L, Aral S, Taylor SJ. Social Influence Bias: A Randomized
Experiment. Science. 2013 Aug;341(6146):647–651. pmid:23929980
48. van de Rijt A, Kang SM, Restivo M, Patil A. Field experiments of
success-breeds-success dynamics. Proceedings of the National
Academy of Sciences. 2014 May;111(19):6934–6939.
49. Mosteller F. Remarks on the Method of Paired Comparisons: I. The
Least Squares Solution Assuming Equal Standard Deviations and
Equal Correlations. Psychometrika. 1951 Mar;16:3–9.
50. Stern H. A Continuum of Paired Comparisons Models. Biometrika.
1990 Jun;77(2):265–273.
51. R Core Team. R: A Language and Environment for Statistical
Computing; 2014. R Foundation for Statistical Computing, Vienna,
Austria. Available from: http://www. R-project. org/.
52. Wickham H. The Split-Apply-Combine Strategy for Data Analysis.
Journal of Statistical Software. 2011;40(1):1–29.
53. Urbanek S. multicore: Parallel Processing of R Code on Machines with
Multiple Cores or CPUs; 2011. R package version 0. 1–5.
178 Advanced Techniques for Collecting Statistical Data
54. Kane MJ, Emerson JW. bigmemory: Manage Massive Matrices With
Shared Memory and Memory-Mapped Files; 2011. R package version
4. 2. 11.
55. Trautmann H, Steuer D, Mersmann O, Bornkamp B. truncnorm:
Truncated Normal Distribution; 2011. R package, version 1. 0–5.
56. Wickham H. testthat: Get Started with Testing. The R Journal.
2011;3(1):5–10.
57. Bates D, Maechler M. Matrix: Sparse and Dense Matrix Classes and
Methods; 2011. R package version 0. 999375–50.
58. Bengtsson H. matrixStats: Methods that apply to rows and columns of
a matrix. ; 2013. R package version 0. 8. 14.
59. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis.
2nd ed. Boca Raton: Chapman and Hall/CRC; 2003.
60. Girotra K, Terwiesch C, Ulrich KT. Idea Generation and the Quality of
the Best Idea. Management Science. 2010 Apr;56(4):591–605.
61. Turner CF, Martin E. Surveying subjective phenomena. New York:
Russell Sage Foundation; 1984.
62. Fowler FJ. Improving survey questions: design and evaluation.
Thousand Oaks: Sage; 1995.
63. Lindley DV. On a Measure of the Information Provided by an Experiment.
The Annals of Mathematical Statistics. 1956 Dec;27(4):986–1005.
64. Glickman ME, Jensen ST. Adaptive paired comparison design. Journal
of Statistical Planning and Inference. 2005 Jan;127(1–2):279–293.
65. Pfeiffer T, Gao XA, Chen Y, Mao A, Rand DG. Adaptive Polling for
Information Aggregation. In: Twenty-Sixth AAAI Conference on
Artificial Intelligence; 2012.
66. von Ahn L, Dabbish L. Designing Games with a Purpose.
Communications of the ACM. 2008;51(8):58–67.
67. Mao A, Soufiani HA, Chen Y, Parkes DC. Capturing Cognitive Aspects
of Human Judgment; 2013. 1311. 0251. Available from: http://arxiv.
org/abs/1311. 0251.
68. Conrad FG, Schober MF. Envisioning the survey interview of the
future. Hoboken, N. J. : Wiley-Interscience; 2008.
69. Baker R, Blumberg SJ, Brick JM, Couper MP, Courtright M, Dennis
JM, et al. Research Synthesis: AAPOR Report on Online Panels. Public
Opinion Quarterly. 2010 Dec;74(4):711–781.
Wiki Surveys: Open and Quantifiable Social Data Collection 179
70. Brick JM. The Future of Survey Sampling. Public Opinion Quarterly.
2011 Dec;75(5):872–888.
71. Sullivan JL, Piereson J, Marcus GE. An Alternative Conceptualization
of Political Tolerance: Illusory Increases 1950s-1970s. American
Political Science Review. 1979;73(3):781–794.
72. Gal D, Rucker DD. Answering the Unasked Question: Response
Substitution in Consumer Surveys. Journal of Marketing Research.
2011 Feb;48:185–195.
Chapter
TOWARDS A STANDARD
SAMPLING METHODOLOGY
ON ONLINE SOCIAL
9
NETWORKS: COLLECTING
GLOBAL TRENDS ON
TWITTER
Circuito Maestro Mario de la Cueva S/N, Ciudad Universitaria, Ciudad de México, 04510
México
SENSEable City Lab, Massachusetts Institute of Technology, 77 Massachusetts Avenue,
3
ABSTRACT
One of the most significant current challenges in large-scale online social
networks, is to establish a concise and coherent method aimed to collect and
summarize data. Sampling the content of an Online Social Network (OSN)
plays an important role as a knowledge discovery tool.
It is becoming increasingly difficult to ignore the fact that current
sampling methods must cope with a lack of a full sampling frame i.e., there
is an imposed condition determined by a limited data access. In addition,
another key aspect to take into account is the huge amount of data generated
by users of social networking services such as Twitter, which is perhaps
the most influential microblogging service producing approximately 500
million tweets per day. In this context, due to the size of Twitter, which is
problematic to be measured, the analysis of the entire network is infeasible
and sampling is unavoidable.
In addition, we strongly believe that there is a clear need to develop a
new methodology to collect information on social networks (social mining).
In this regard, we think that this paper introduces a set of random strategies
that could be considered as a reliable alternative to gather global trends on
Twitter. It is important to note that this research pretends to show some
initial ideas in how convenient are random walks to extract information or
global trends.
The main purpose of this study, is to propose a suitable methodology
to carry out an efficient collecting process via three random strategies:
Brownian, Illusion and Reservoir. These random strategies will be applied
through a Metropolis-Hastings Random Walk (MHRW). We show that
interesting insights can be obtained by sampling emerging global trends on
Twitter. The study also offers some important insights providing descriptive
statistics and graphical description from the preliminary experiments.
INTRODUCTION
In recent years, there has been an increasing interest in Online Social
Networks (OSNs) exploration. Mining social signals can provide quick
knowledge of a real-world event (Roy and Zeng 2014). More recently,
areas of social network analysis are now expanding to different disciplines,
Towards a Standard Sampling Methodology on Online Social Networks... 183
not only in data mining studies but also in computational social science
(user behavior), social media analytics and complex systems. Thus, the
availability of unprecedented amounts of data about human interactions
from different social networks opens the possibility of using this information
to leverage knowledge about the diversity of social behavior and the activity
of individuals (Lu and Brelsford 2014; Piña-García and Gu 2013b; 2015;
Thapen and Ghanem 2013; Weng et al. 2012). The focus of social data
analysis is essentially the content that is being produced by users. The data
produced in social networks are rich, diverse and abundant, which makes
them a relevant source for data science (Ferrara et al. 2014; Kurka et al.
2015; Weikum et al. 2011).
The main challenge faced by many experiments in data science is the
lack of a standard methodology to collect and analyze data sets. Thus, the
main obstacles that data scientists face is as follows:
• They do not know what sort of data they need,
• They do not know how much data they need,
• They do not know what critical questions they should be asking
and
• They do not know is this data is private i.e. this data could be
considered illegal in some contexts.
Social media platforms have increasingly replaced other means of
communication, such as telephone and emails (Phan and Airoldi 2015).
Thus, the rising interest in digital media and social interactions mediated
by online technologies is boosting the research outputs in an emerging field
that is multidisciplinary by nature: it brings together computer scientists,
sociologists, physicists, and researchers from a wide range of other
disciplines (González-Bailón et al. 2014).
Twitter, can be considered the most studied OSN (Kurka et al. 2015). This
social media platform provides an efficient and effective communication
medium for one-on-one interactions and broadcast calls (e.g., for assistance
or dissemination and access to useful information) (Phan and Airoldi 2015).
In this regard, we consider Twitter as a a suitable large-scale social network
to be explored (Kwak et al. 2010).
Twitter is the most famous microblogging website in the social media
space, where users post messages that are limited to 140 characters. In
addition, users can follow other accounts they find interesting. Posts are
called “tweets”. Unlike the case with other social networks, the relationship
does not have to be mutual (Golbeck 2013). It should be noted that Twitter
184 Advanced Techniques for Collecting Statistical Data
produces approximately 500 million tweets per day, with 271 million regular
users (Serfass and Sherman 2015). Therefore, Twitter has been a valuable
tool to track and identify patterns of mobility and activity, especially using
geolocated tweets. Geolocated tweets typically use the Global Positioning
System (GPS) tracking capability installed on mobile devices when enabled
by the user to give his or her precise location (latitude and longitude).
In this research we adopt a strategy to identify “trending topics”
generated in real-time on Twitter (Weng and Menczer 2015). The main goal
of this research is to extract emergent topics and identify their relevance on
Twitter. This manuscript is an exploratory data analysis based on topical
interest. It is important to note that Twitter has emerged as an important
platform to observe relatively informal communication.
This manuscript provides the basic steps that were followed to collect
systematically a set of trending topics. These topics or global trends1 were
gathered and filtered according to their geographic distribution and topical
interest.
In addition, we present a statistical and a descriptive analysis of the main
features obtained from our collected dataset. Finally, our central hypothesis
in this work is that in order to advance our understanding of social interaction,
it is necessary to propose a reliable methodology to collect, analyze and
visualize collected data from OSNs.
Contributions
A central hypothesis in this study is that in order to advance our quantitative
understanding of social interaction, is not possible to get by with incomplete
data. It becomes necessary to obtain representative data. Therefore, the aim
of this study is to propose an algorithm to discover and collect emerging
global trends on Twitter. Specifically, our contributions in this study are as
follows:
• This paper provides a series of random strategies (Brownian,
Illusion and Reservoir) based on random walk models to sample
small but relevant parts of information produced on Twitter.
• This research is intended to determine the extent to which
random walks can be combined by using an alternative version of
a Metropolis-Hastings algorithm.
Towards a Standard Sampling Methodology on Online Social Networks... 185
RELATED WORK
A considerable amount of literature has been published on using graph
sampling techniques on large-scale OSNs. These studies are rapidly growing
in the scientific community, showing that sampling methods are essential
for practical estimation of OSN properties. These properties include, for
example: user age distribution, net activity, net connectivity and node
degree. Studies on social science show the importance of graph sampling
techniques, e.g., (Caci et al. 2012; Fire and Puzis 2012; Lee et al. 2006;
Mislove et al. 2006; Scott 2011).
Online social networks such as Facebook represents one of the biggest
social services in the world. Therefore, it may be seen as a large-scale source
to collect data with the aim to obtain a representative sample or characterize
the whole network structure (Bhattacharyya et al. 2011; Caci et al. 2011;
Ferri et al. 2012; Ugander et al. 2011). Recent evidence suggests that efficient
random walk inspired techniques has been successfully used to sample large-
scale social networks, in particular, Facebook (Gjoka et al. 2010; 2011a).
However, despite its relative success of Facebook, these specific sampling
strategies have not been tested on different social networking services such
as: Twitter.
A number of researchers have pointed out that statistical approaches
such as random walks can be used to improve and speeding up the process of
sampling. This can be done by considering different randomized algorithms
which are able to cope with large datasets. Recently, the Metropolis-
Hastings Random Walk algorithm have been tested on Facebook and Last.
fm (a music website with 30 million active users) showing significant results
for an unbiased sampling of users (Gjoka et al. 2011a, 2011b; Kurant et al.
2011).
Similarly, some studies based on supervised random walks use the
information from the network structure with the aim to guide a random walk
on the graph, e.g., on the Facebook social graph (Backstrom and Leskovec
2011). In addition, there are other studies that introduce the same random
walk technique analyzed from the Markov Chain Monte Carlo (MCMC)
perspective, i.e., the Metropolis-Hastings random walk (MHRW), which is
mainly used to produce uniform samples (Bar-Yossef and Gurevich 2008).
An alternative Metropolis-Hastings random walk using a spiral proposal
distribution is presented in (Piña-García and Gu 2013b). The authors
examined whether it was possible to alter the behavior of the MHRW using
spirals as a probability distribution instead a classic Gaussian distribution.
186 Advanced Techniques for Collecting Statistical Data
They observed that the spiral inspired approach was able to adapt itself
correctly to a Metropolis-Hastings random walk.
These studies presented thus far provide evidence that there is a
growing interest in the use of rapid sampling models and a clear need of
data extraction tools on Facebook (Bhattacharyya et al. 2011; Caci et al.
2011; Ferri et al. 2012; Ugander et al. 2011). However, Twitter has recently
received special attention from researchers that are interested in uncovering
global topics, that are well known as: “memes” and “hashtags2” (Hawelka
et al. 2014; Kallus 2014; Mitchell et al. 2013; Takhteyev et al. 2012; Thapen
and Ghanem 2013).
Recently, there has been an increasing amount of literature on data
collection via Twitter. Preliminary work on information diffusion was
presented in (Weng et al. 2013b), where authors examined the mechanisms
behind human interactions through an unprecedented amount of data (social
observatory). They also argued that information diffusion affects network
evolution.
An important analysis about the geography of twitter networks was
presented in (Takhteyev et al. 2012). In this case, the authors showed that
distance matters on Twitter, both at short and longer ranges. In addition, they
argued that the distance considerably constrains ties. The authors highlighted
the importance of Twitter in terms of collection of data due to its popularity
and international reach. They also suggested that these ties at distances of
up to 1000 km are more frequent than what it would be expected if the ties
were formed randomly.
In a large longitudinal study carried out in (Hawelka et al. 2014), the
authors found global patterns of human mobility based on data extracted
from Twitter. A dataset of almost a billion of tweets recorded in 2012, was
used to estimate volumes of international travelers. The authors argue that
Twitter is a viable source to understand and quantify global mobility patterns.
Furthermore, a detailed investigation on correlations between real-
time expressions of individuals and a wide range of emotional, geographic,
demographic and health characteristics was conducted in (Mitchell et al.
2013). Results showed how social media may potentially be used to estimate
real-time levels and changes in population-level measures (Haralabopoulos
and Anagnostopoulos 2014; Leskovec et al. 2008). The findings in (Mitchell
et al. 2013), were supported by a large dataset of over 10 million geo-
tagged tweets, gathered from 373 urban areas in the United States during
the calendar year of 2011.
Towards a Standard Sampling Methodology on Online Social Networks... 187
PROBLEM DEFINITION
As the digital world grows, it generates an enormous amounts of data every
second, challenging us to find new methods to efficiently extract and sample
information. Currently, data science is a relatively new field of study that
involves concepts that come from data analysis and data mining. However,
we are experiencing a digital revolution where collecting data has become in
a everyday task to data scientist. In this regard, scientific production based
on data science has grown up sharply in last few years and it tends to still
happen. At this point, it is possible to observe many proposed methodologies
to collect and analyze information. However, none of these approaches
188 Advanced Techniques for Collecting Statistical Data
RANDOM STRATEGIES
This section will examine three random strategies that are incorporated into
the alternative version of the MHRW. These random strategies are aimed to
be used heuristically as an internal picker for a candidate node (hereinafter
referred to as ϱ). This set of random strategies is composed as follows:
Brownian walk (normal distribution), a spiral-inspired walk (Illusion) and a
Reservoir sampling method. It is important to note that the Brownian case
Towards a Standard Sampling Methodology on Online Social Networks... 189
will be used as a the baseline to be compared with the rest of the random
strategies.
The main idea of the Metropolis-Hastings algorithm is to provide a
number of random samples from a given distribution. Thus, our proposed
version of the MHRW is able to sample a candidate node ϱ, which is directly
obtained from: q(y|x)={B r o w n i a n,I l l u s i o n,R e s e r v o i r}.
Brownian Walk
The traditional approach used to sample through the MHRW is based on
the normal distribution. In this regard, we have developed a Brownian walk
that presents a normal distribution. It is important to note that in most cases
the Brownian walk is related to a continuous time process. However, in
this research it has been considered a discretized version of this strategy.
Technically speaking in this model, the candidate node ϱ will be computed
according to the Java language command: Math.random().
Illusion spiRal
In this research, we have considered a spiral-inspired approach in terms of
an Illusion spiral.3 This spiral presents an interesting geometric shape which
presents a sequence of points spirally on a plane such that they are equitably
and economically spaced (see Fig. 1). This spiral model is produced by the
following expression:
z←az+bz/|z|. (1)
Reservoir Sampling
A reservoir sampling can be seen as an algorithm that consists in selecting
a random sample of size n, from a file containing N records, in which
the value of N is not known to the algorithm. According to (Vitter 1985),
the first step of any reservoir algorithm is to put the first n records into a
“reservoir”. The rest of the records are processed sequentially. Thus, the
number of items to select (k) is smaller than the size of the source array S(i).
Algorithm 1 provides an overview of the steps carried out by the reservoir
sampling process.
Figure 2: In summary, all the global trends retrieved from Twitter are poten-
tial nodes. Subsequently, all the trends that were collected from the servers are
drawn through the random walks provided by q(y|x).
The key idea of this alternative version of the MHRW algorithm is to
generate a number of independent samples from a given random generator.
Thus, it is necessary to sample a candidate node ϱ from q(y|x)={B r o w n i
a n,I l l u s i o n,R e s e r v o i r}. The candidate node is accepted if and only
if this node belongs to the graph G. The steps of this method are outlined in
Algorithm 2.
192 Advanced Techniques for Collecting Statistical Data
Pre-processing
For the estimation of trends concentration, a list of countries with publicly
available trends was requested from Twitter. Countries are identified by
means of a specific WOEID. The term WOEID refers to a service that allows
to look up the unique identifier called the “Where on Earth ID” (see http://
developer.yahoo.com/geo/geoplanet/). Figure 3 illustrates a map of the
geographical locations of these countries. In addition, a full list of retrieved
countries can be found in Table 1.
Towards a Standard Sampling Methodology on Online Social Networks... 193
Figure 3: The map shows the geographical location of the countries around the
globe that had more activity on Twitter according to a set of empirical trials.
Table 1: Table of retrieved countries with more activity on Twitter during our
empirical trials
List of countries
Argentina Australia
Belgium Brazil
Canada Chile
Colombia Dom. Republic
Ecuador France
Germany Greece
Guatemala India
Indonesia Ireland
Italy Japan
Kenya Korea
Malaysia Mexico
Netherlands New Zealand
Nigeria Norway
Pakistan Peru
Philippines Poland
Portugal Russia
Singapore South Africa
Spain Sweden
U. Arab Emirates Turkey
Ukraine United Kingdom
United States Venezuela
194 Advanced Techniques for Collecting Statistical Data
The Algorithm 2 interacts with Twitter via its public API as a primary way
to retrieve data. Once all the information has been retrieved, a random
sampling is performed across the global trends using Algorithm 2. Collected
samples are stored in an output data file and depicted on a visual interface.
Figure 4 shows how this process works on Twitter.
Figure 4: The diagram shows the content extraction tool or social explorer us-
ing an API to establish a connection, then a random sampling is carried out for
collecting global trends in real-time from Twitter. Finally, it generates an output
information file and depicts the results upon the visual interface.
A sample is chosen according to the following eligibility criteria (initial
conditions): 1) number of countries and 2) a minimum number of users
following a global trend. In this study, the initial conditions consisted of 15
countries and 10 users. Therefore, a maximum of 150 (15 × 10) trending
topics per independent run were available to be gathered. However, due to
a trending topic or an user can be counted multiple times, which makes the
measurement hard to interpret, all duplicate trends and duplicate users were
removed from the sample. After filtering out all duplicates, it can be built a
data structure containing a set of unique records.
In summary, the steps to generate the data are as follows:
• Collect a list of WOEIDs by searching countries with publicly
available trends, then select randomly a set of W=15 unique
countries.
• for each country c∈W, we acquire a list of the top ten trending
topics(TT), add each trending topic TT as a node to the graph G.
Then, set the minimum number of users following this trend to
F r=10;
Towards a Standard Sampling Methodology on Online Social Networks... 195
• for each TT, get a list of users linked to the corresponding trending
topic, e.g. F r(T T), and add each of them as a node to G;
• create an edge [T T,F r(T T)] and add it to G;
• save the graph G.
RESULTS
In order to assess the performance of the social explorer, a sample of publicly
available trends was collected, this random sample contains tweets posted
from December 17 to December 20 2013, between 16:30 and 22:30 GMT
(time window). This sample consisted of 3,325 trending topics generated by
225,102 unique users that emerged during the observed time window.
It is important to note that in this case, not only tweets written in English
were extracted. This feature provides a different framework with respect to
previous studies whereby only English tweets were collected e.g., (Weng et
al. 2012, 2013a). One advantage of this multilingual feature is that it avoids
a bias in terms of the information posted in English.
To replicate the sampling process, a series of 10 independent walks was
performed for each one of the three random strategies: q(y|x)=Brownian,
Illusion, Reservoir (30 runs in total). Then, two different output files were
stored for further analysis: a .dat file and a .gml file. The first one contains
information such as: Total number of trending topics, total number of unique
followers, number of iterations, total number of sampled trends, a full list
of the collected trends, number of nodes, number of edges, node degree per
trending topic, memory usage, total number of duplicates and the elapsed
time during the sampling process. On the other hand, the second output
file contains a GML (Graph Modeling Language) formatted file, which
describes a graph obtained by the social explorer. This file is used to build
and evaluate graphically each one of the samples.
Figure 5 a compares a cumulative analysis of the number of trends. This
plot may be divided into three main criteria: number of trends retrieved from
the Twitter service (collected), number of trends after removing all duplicates
(filtered) and number of sampled trends collected by each random generator
(sampled). Similarly, means with respect to the number of sampled trends
are shown in Fig. 5 b. What is interesting in this data, is that the sampled
trends represents the core information to evaluate how was the three models
behavior in terms of data collection. It should be highlighted that the number
of collected trends depends exclusively from the Twitter service. Likewise,
the filtering process was carried out as a cleaning data process.
196 Advanced Techniques for Collecting Statistical Data
Figure 5: a Plot divided into three main criteria: number of trends retrieved
from Twitter (collected); number of trends after removing all duplicates (fil-
tered) and number of sampled trends collected by each random generator (sam-
pled). b Means corresponding to the average of sampled trends.
Owing to the natural tendency of the social explorer to move toward
the same node many times, which is induced by each one of the random
strategies, a considerable number of duplicate trends is added to the output
sequence. This permits to compare the results in terms of the number of
duplicate trends generated during the observation time window (see Fig. 6
a). Likewise, Fig. 6 b compares the number of unique followers obtained
from each random generator q(y|x).
Figure 6: Plots generated during the observation time window: a the number
of duplicate trends for q(y|x). b the number of unique followers presented with
logarithmic scale for the y-axis.
Towards a Standard Sampling Methodology on Online Social Networks... 197
Figure 7: Group of plots of the percentage of accuracy plotted versus the num-
ber of trials. The measures are computed based on the percentage of sampled
trends and on the percentage of duplicate trends generated by each random
generator.
Memory Consumption
This section examines the estimated memory usage employed by each
proposed model. The results obtained from the preliminary analysis of
memory consumption can be compared in Table 4, this table compares the
average memory consumption in Megabytes (MB) and the total of memory
used across 10 independent runs. In this regard, there were no significant
differences between the amount of MB used for each random generator.
Due to the experiments were run using custom software written in Java,
it has been considered to assess the results obtained related to the memory
consumption in Megabytes (MB). The basic computer hardware information
is as follows: Processor: Intel(R) Core(TM)2 Duo CPU at 3.33 GHz. Installed
memory (RAM): 4.00 GB. System type: 64-bit operation system. The
application was run on Windows 7 Enterprise edition. Figure 8 presents a
cumulative memory usage plot. This plot is presented as a stacked bar which
provides the sum of all the memory consumption across 10 independent
walks per model. From these data, it can be seen that there were no significant
differences between the sampling methods used as random strategies.
Figure 8: Stacked bar chart displaying the sum of the memory consumption
split in 10 independent runs per random generator.
200 Advanced Techniques for Collecting Statistical Data
whose sizes are proportional to the number of edges incident to the trending
node i.e., the degree of the node. Essentially, either the graphs and the word
clouds show the same information.
Convergence Monitoring
Part of the aim of this research is to identify convergence during the sampling
process. Therefore, a convergence analysis was prepared according to the
procedure used by the Geweke to evaluate the accuracy of sampling-based
approaches (Geweke 1991; Lee et al. 2006). This Geweke diagnostics is a
standard Z-score which consists in taking two non-overlapping parts of the
Markov chain and compares the means of both parts, using a difference of
means test to see if the two parts of the chain are from the same distribution
(null hypothesis).
This diagnostic represents a test of whether the sample of draws has
attained an equilibrium state based on the first 10 % of the sample of draws,
versus the last 50 % of the sample of draws. If the Markov chain of draws
has reached an equilibrium state, it would be expected to obtain roughly
equal averages from these two splits of the sample (Lesage 1999). MATLAB
functions that were used to implement these estimations can be found at
http://www.spatial-econometrics.com/gibbs/contents.html.
Figure 10 provides trace plots for the property of node degree (number
of users that follow a particular trend). These plots present the Z-score value
against the number of iterations. Therefore, using the Geweke diagnostics it
is possible to identify the convergence analysis for the Brownian walk, the
Illusion spiral and the Reservoir sampling. The number of draws was fixed
to 1100 with a burn-in process discarding the first 100. Thus, in accordance
to (Gjoka et al. 2011a) we can declare convergence when most values fall
in the [–1, 1] interval. Additionally, we plot an average line using 30 points
on the x-axis. Finally, as it can be seen in Fig. 10 our convergence analysis
suggests that our sample draws have attained an equilibrium state showing
that the means of the values converge rapidly in the sequence.
202 Advanced Techniques for Collecting Statistical Data
Figure 10: Plots of the resulting Z-scores against the number of iterations for
the metric of node degree (number of users that follow a particular trend). Hori-
zontal lines at Z=±1 are added to the plots to indicate the convergence interval.
LIMITATIONS
One advantage of this approach is the multilingual feature which avoids
a bias in terms of the information posted in English. However, there are
certain drawbacks associated with the use of different languages e.g., lack
of knowledge of the language and the misinterpretation of the statements.
On the other hand, this research does not take into account that the social
explorer is not able to distinguish between Twitterbots4 and real users on
Twitter. Therefore, all the estimates include Twitterbots causing an over
estimation in the results. These data must be interpreted with caution since
all the information collected from this study is mainly based on the Twitter
response service.
CONCLUSIONS
This paper has explained the central importance of defining a standard
sampling methodology applicable to cases where the social network
information flow is readily available. The main purpose of the current study
was to assess a low computational cost method for sampling emerging
global trends on Twitter.
Towards a Standard Sampling Methodology on Online Social Networks... 203
Endnotes
1
A word, phrase or topic that is tagged at a greater rate than other tags is said
to be a trending topic.
2
A hashtag is a word or metadata tag prefixed with the hash symbol (#).
3
see (Davis 1993) for a full description of the Illusion spiral.
4
A Twitterbot is a program used to produce automated posts on the Twitter
microblogging service, or to automatically follow Twitter users.
204 Advanced Techniques for Collecting Statistical Data
ACKNOWLEDGEMENTS
This work has been supported in part by “Programa de Apoyo a Proyectos
de Investigación e Innovación Tecnológica” (grant no. PAPIIT IA301016).
Carlos Gershenson was partially supported by SNI membership 47907. J.
Mario Siqueiros-García was partially supported by SNI membership 54027.
We also aknowledge the support of projects 212802, 221341, 260021 and
222220 of CONACyT. Carlos Piña-García acknowledges UNAM for post-
doctoral fellowship.
AUTHORS’ CONTRIBUTIONS
The content extraction tool was programmed by C. A. Piña-García. All
authors helped to write the literature review and to collect data. C. A.
Piña-García wrote the majority of the paper with assistance from Carlos
Gershenson and Siqueiros-García. All authors read and approved the final
manuscript.
Towards a Standard Sampling Methodology on Online Social Networks... 205
REFERENCES
1. Backstrom, L, Leskovec J (2011) Supervised random walks: predicting
and recommending links in social networks In: Proceedings of the
fourth ACM international conference on web search and data mining,
635–644.. ACM.
2. Bar-Yossef Z, Gurevich M. Random sampling from a search engine’s
index. J ACM (JACM) 2008;55(5):24. doi: 10.1145/1411509.1411514.
3. Bhattacharyya P, Garg A, Wu SF. Analysis of user keyword similarity
in online social networks. Soc Netw Anal Mining. 2011;1(3):143–158.
doi: 10.1007/s13278-010-0006-4.
4. Caci, B, Cardaci M, Tabacchi ME (2011) Facebook as a small world: a
topological hypothesis. Soc Netw Anal Mining: 1–5.
5. Caci B, Cardaci M, Tabacchi ME. Facebook as a small world: a
topological hypothesis. Soc Netw Anal Mining. 2012;2(2):163–167.
doi: 10.1007/s13278-011-0042-8.
6. Davis P. Spirals: Prom Theodorus to Chaos. Wellesley, MA: AK
Peters; 1993.
7. Ferrara E, De Meo P, Fiumara G, Baumgartner R. Web data extraction,
applications and techniques: a survey. Knowl Based Syst. 2014;70:301–
323. doi: 10.1016/j.knosys.2014.07.007.
8. Ferri F, Grifoni P, Guzzo T. New forms of social and professional
digital relationships: the case of facebook. Soc Netw Anal Mining.
2012;2(2):121–137. doi: 10.1007/s13278-011-0038-4.
9. Fire, M, Puzis R (2012) Organization mining using online social
networks. Netw Spat Econ: 1–34. Springer.
10. Geweke J. Evaluating the accuracy of sampling-based approaches
to the calculation of posterior moments. MN, USA: Federal Reserve
Bank of Minneapolis, Research Department Minneapolis; 1991.
11. Gjoka M, Kurant M, Butts CT, Markopoulou A. Proceedings of IEEE
INFOCOM ’10. San Diego, CA: IEEE; 2010. Walking in Facebook: a
case study of unbiased sampling of OSNs; pp. 1–9.
12. Gjoka M, Kurant M, Butts CT, Markopoulou A. Practical
recommendations on crawling online social networks. Selected
Areas Commun IEEE J. 2011;29(9):1872–1892. doi: 10.1109/
JSAC.2011.111011.
206 Advanced Techniques for Collecting Statistical Data
Zurich, Switzerland
2
Department of Communication and Media Research, University of Zurich, Zurich,
Switzerland
Department of Psychology, University of Zurich, Zurich, Switzerland
3
BACKGROUND
Mobile data collection with smartphones—which belongs to the
methodological family of ambulatory assessment, ecological momentary
assessment, and experience sampling—is a method for assessing and
Citation: (APA): Seifert, A., Hofer, M., & Allemand, M. (2018). Mobile data collec-
tion: smart, but not (yet) smart enough. Frontiers in neuroscience, 12, 971. (4 pages)
Copyright: © 2018 Seifert, Hofer and Allemand. This is an open-access article distrib-
uted under the terms of the Creative Commons Attribution License (CC BY): http://
creativecommons.org/licenses/by/4.0/
210 Advanced Techniques for Collecting Statistical Data
2013), and machine learning (e.g., Bleidorn and Hopwood, 2018), are
required. As a result, an interdisciplinary research approach involving
researchers interested in collecting data with smartphones and experts
familiar with those forms of data collection, management, and analysis
is crucial. Such endeavors should be supported by funding organizations
and academic career programs, enabling the full potential of mobile data
collection with smartphones to be achieved.
As a fourth challenging area, we identify the contextual information
that can be collected with smartphone sensor data (i.e., passive data), as
researchers have to consider the different forms, intervals, and amounts
of sensor data (e.g., GPS data, app use, and accelerometer data). When
collecting passive data continuously over multiple days, researchers need
to consider more than just the data itself; they must also be able to interpret
what the measurements indicate and convert the data into psychologically
meaningful variables, such as sociability or mobility patterns (e.g., Mehl
et al., 2006; Harari et al., 2016). Although this task is fundamental to the
research, it often requires new skills of researchers and new approaches
within the technology—approaches that ideally automatically aggregate
passive smartphone-sensor-based data. For example, when collecting sound
files containing conversation, it would be very helpful to automatically
detect the spoken words of a target person (e.g., Mehl et al., 2001), detect
contextual information (e.g., Lu et al., 2012), or interpret GPS data in
terms of mobility patterns (e.g., Ryder et al., 2009). For such requirements,
preliminary solutions do exist (e.g., Barry et al., 2006; White et al., 2011),
but much more development and validation work is needed before we can
achieve automatic, preprocessed, and validated smartphone-sensor data that
can be combined with other types of data collection.
The fifth challenging area relates to the smartphone device itself. Mobile
data collection with smartphones requires more technical preparation and
greater technical confidence and skills, on the side of both the researcher
and participant, than is required in classic paper-and-pencil studies. Daily
technical hassles such as malfunctioning software and hardware, low
smartphone batteries, and operation systems crashing during ongoing
studies cost time and resources. Therefore, we highly recommend including
an explicit time buffer and anticipating a higher than usual drop-out rate
in smartphone studies to compensate for potential technical problems and
challenges (for more information on technical issues, please see Mehl and
Conner, 2012; Miller, 2012; Harari et al., 2016). Although the technical side
of mobile data collection with smartphones is likely to become more reliable
214 Advanced Techniques for Collecting Statistical Data
over time, more validation studies are required in this area and more ready-
made valid apps are needed. When using smartphones for data collection
within specific population groups, it is also important to consider the unique
needs of the target group. For example, when working with older adults, it
can be helpful to reflect participants’ potential lack of smartphone skills by
adapting briefings on smartphone/app use (Seifert et al., 2017).
The final, though certainly not least important, challenging area is that
of data security and ethical issues. Collecting mobile data has revived past
concerns about data protection and the ethical use of data. Using mobile
devices for data collection, including tracking behavior and lifestyle
patterns, introduces a unique dimension to individual participant protection.
When collecting intensive profiles of individuals, which is the main research
method within mobile data collection with smartphones, anonymization is
nearly impossible. Therefore, traceable real-life data requires an intensive
consideration of ethical and legal approval, the safeguarding of participant
privacy, and the establishment of data security and data privacy (Harari et
al., 2016; Marelli and Testa, 2018). As an example, Beierle et al. (2018b)
conceived a privacy model for mobile data collection apps. Zook et al.
(2017) present ten simple rules for responsible big data research, concluding
that ethical and data protection issues should not prevent research but that it
is vital to ensure “that the work is sound, accurate, and maximizes the good
while minimizing harm” (Zook et al., 2017, p. 8). When using participants’
own smartphones, it is also important that researchers acquire participants’
consent to share self-recorded data with researchers (Gustarini et al., 2016).
In a quantitative population survey among persons over 50 years of age,
Seifert et al. (2018) found that more than the half of this demographic
group is willing to share self-recorded data with researchers, regardless of
participants’ age, gender, education, technology affinity, or perceived health.
The sharing and use of participants’ own self-recorded data may require
new models of participant involvement, with the goal of creating a trusted
relationship between the data providers and researchers working with the
data (Beierle et al., 2018b; Seifert et al., 2018).
CONCLUSIONS
Mobile data collection with smartphones offers unique and innovative
opportunities for studying human beings and processes in real life and real
time. This approach offers researchers the opportunity to collect real-time
reports of participants in their natural environment and within their individual
Mobile Data Collection: Smart, but Not (Yet) Smart Enough 215
dynamics and life contexts with the help of a regular smartphone. However,
the approach also brings many challenges that provide interesting avenues
for future developments. To date, mobile data collection with smartphones
is already very smart, but we see the potential for even smarter mobile data
collection in the future.
AUTHOR CONTRIBUTIONS
All authors worked on this paper from conception to final approval and
share the same opinion.
ACKNOWLEDGMENTS
MH thanks the Swiss National Science Foundation for funding (No.
PY00PI_17485/1). MA thanks the UZH Digital Society Initiative (DSI)
from the University of Zurich and the Swiss National Science Foundation
(No. 162724) for funding.
216 Advanced Techniques for Collecting Statistical Data
REFERENCES
1. Albert M. V., Kording K., Herrmann M., Jayaraman A. (2012). Fall
classification by machine learning using mobile phones. PLOS ONE
7:e36556. 10.1371/journal.pone.0036556
2. Allemand M., Mehl M. R. (2017). Personality assessment in daily life:
a roadmap for future personality development research, in Personality
Development Across the Lifespan, ed. J. Specht (London: Elsevier
Academic Press; ), 437–454.
3. Aschwanden D., Luchetti M., Allemand M. (2018). Are open and
neurotic behaviors related to cognitive behaviors in daily life of older
adults? J. Pers. 10.1111/jopy.12409. .
4. Asparouhov T., Hamaker E. L., Muthén B. (2017). Dynamic
latent class analysis. Struct. Equ. Modeling 24, 257–269.
10.1080/10705511.2016.1253479
5. Barry S. J., Dane A. D., Morice A. H., Walmsley A. D. (2006). The
automatic recognition and counting of cough. Cough 2:8. 10.1186/1745-
9974-2-8
6. Beierle F., Tran V. T., Allemand M., Neff P., Schlee W., Probst T.,
et al. (2018a). TYDR: Track your daily routine. Android app for
tracking smartphone sensor and usage data, in Proceedings of the
5th International Conference on Mobile Software Engineering and
Systems - MOBILESoft ‘18; 2018 May 27–28 (Gothenburg: ACM
Press; ), 72–75.
7. Beierle F., Tran V. T., Allemand M., Neff P., Schlee W., Probst T., et
al. (2018b). Context data categories and privacy model for mobile
data collection apps. Procedia Comput. Sci. 134, 18–25. 10.1016/j.
procs.2018.07.139
8. Bleidorn W., Hopwood C. J. (2018). Using machine learning to
advance personality assessment and theory. Pers. Soc. Psychol. Rev.
1:1088868318772990 10.1177/1088868318772990
9. Bolger N., Laurenceau J. P. (2013). Intensive Longitudinal Methods:
An Introduction to Diary and Experience Sampling Research. New
York, NY: Guilford Press.
10. Cartwright J. (2016). Technology: Smartphone science. Nature 531,
669–671. 10.1038/nj7596-669a
11. Conner T. S., Tennen H., Fleeson W., Barrett L. F. (2009). Experience
sampling methods: a modern idiographic approach to personality
Mobile Data Collection: Smart, but Not (Yet) Smart Enough 217
Citation: (APA): Bond, D. M., Hammond, J., Shand, A. W., & Nassar, N. (2020). Com-
paring a mobile phone automated system with a paper and email data collection sys-
tem: Substudy within a randomized controlled trial. JMIR mHealth and uHealth, 8(8),
e15284. (13 pages)
Copyright: © This is an open-access article distributed under the terms of the Creative
Commons Attribution License (https://creativecommons.org/licenses/by/4.0/).
222 Advanced Techniques for Collecting Statistical Data
ABSTRACT
Background
Traditional data collection methods using paper and email are increasingly
being replaced by data collection using mobile phones, although there is
limited evidence evaluating the impact of mobile phone technology as part
of an automated research management system on data collection and health
outcomes.
Objective
The aim of this study is to compare a web-based mobile phone automated
system (MPAS) with a more traditional delivery and data collection
system combining paper and email data collection (PEDC) in a cohort of
breastfeeding women.
Methods
We conducted a substudy of a randomized controlled trial in Sydney,
Australia, which included women with uncomplicated term births who
intended to breastfeed. Women were recruited within 72 hours of giving
birth. A quasi-randomized number of women were recruited using the PEDC
system, and the remainder were recruited using the MPAS. The outcomes
assessed included the effectiveness of data collection, impact on study
outcomes, response rate, acceptability, and cost analysis between the MPAS
and PEDC methods.
Results
Women were recruited between April 2015 and December 2016. The analysis
included 555 women: 471 using the MPAS and 84 using the PEDC. There
were no differences in clinical outcomes between the 2 groups. At the end of
the 8-week treatment phase, the MPAS group showed an increased response
rate compared with the PEDC group (56% vs 37%; P<.001), which was also
seen at the 2-, 6-, and 12-month follow-ups. At the 2-month follow-up, the
MPAS participants also showed an increased rate of self-reported treatment
compliance (70% vs 56%; P<.001) and a higher recommendation rate for
future use (95% vs 64%; P<.001) as compared with the PEDC group. The
cost analysis between the 2 groups was comparable.
Comparing a Mobile Phone Automated System With a Paper and Email... 223
Conclusions
MPAS is an effective and acceptable method for improving the overall
management, treatment compliance, and methodological quality of clinical
research to ensure the validity and reliability of findings.
INTRODUCTION
Background
Participant engagement and response is a vital aspect of any clinical research
study. Many research studies are costly, labor intensive, and potentially
compromised because of the difficulties associated with patient compliance,
engagement, incomplete data collection, and inadequate follow-up [1-3]. The
method and type of data collection system utilized to recruit participants and
collect data throughout the study is important to ensure the quality, reliability,
and validity of data collection. In addition, it must be cost-effective and
acceptable to participants, funding organizations, and researchers [4-6].
Paper-based data collection in research studies is gradually being
replaced or used in conjunction with electronic data collection systems
[7], primarily in the form of emails containing links to web-based surveys.
Comparison of these two methods has been well documented [8-11].
In recent years, mobile phone technology has been increasingly used
to promote health-related behavioral change and self-management of
care via the use of apps and automated SMS text messages. Studies have
shown effective changes in psychological and physical symptoms [12-
14] as well as specific pregnancy and breastfeeding outcomes [15,16] by
sending individually tailored text messages to participants. However, a
Cochrane review specifically looking at mobile phone apps as a method
of data delivery for self-administered questionnaires found that none of
the included trials in the review reported data accuracy or response rates
[17]. Furthermore, a review of studies utilizing mobile phones for data
collection showed that they were based on very small sample sizes, collected
intermittent data (as opposed to daily), or had limited longitudinal data
collection (maximum 9 months) [18-21]. There is also limited assessment
of mobile phone technology as part of a web-based automated system,
224 Advanced Techniques for Collecting Statistical Data
Objectives
The primary aims of this study were to compare a web-based research
management system utilizing mobile phone technology with a traditional
delivery and data collection system using a combination of paper- and email-
based methods on clinical research outcomes and to assess the acceptability
and effectiveness of use, including cost analysis.
METHODS
Design
We conducted a prespecified substudy as part of the APProve (CAn
Probiotics ImProve Breastfeeding Outcomes?) trial to compare a mobile
phone automated system (MPAS) with a paper and email data collection
(PEDC) system. APProve was a double-blind randomized controlled trial
(RCT) evaluating the effectiveness of an oral probiotic versus a placebo
for preventing mastitis in breastfeeding women. It was conducted between
April 2015 and December 2016 in 3 maternity hospitals in Sydney, Australia.
Detailed methods have been published previously [25]. Briefly, it involved
the evaluation of a probiotic versus a placebo taken daily for 8 weeks for
the prevention of mastitis, which was assessed using short daily and slightly
longer weekly questionnaires during the first 8 weeks following birth and
longer follow-up questionnaires at 2, 6, and 12 months.
The MPAS was a data delivery and collection system that combined
treatment randomization, SMS delivery to participants, electronic data
collection, and data management. It was developed by the study team with
the aid of an eResearch (electronic research) company, which developed
the system based on our prospective design specifications. The system
integrated 2 established software services, SMS delivery and a web-based
Comparing a Mobile Phone Automated System With a Paper and Email... 225
survey tool, which were then linked to a secure web-based data management
system. The MPAS sent automated text messages to the participants’
mobile phones with links to self-administered web-based surveys. Each
survey link was embedded with the participant’s unique identifier, enabling
comparison across multiple surveys. A maximum of 2 automated reminders
were integrated into the system if a participant did not respond after 3 days.
The MPAS was pilot tested by 17 members of the research department,
with feedback and suggestions integrated into the system before study
commencement.
The PEDC included a combination of an 8-week calendar diary provided
to participants at the time of trial entry and emailed links to weekly and
follow-up surveys. The calendar diaries were identified with the participant
study number at the time of treatment randomization, and the start date was
manually entered. The A4-size calendar was preserved with a waterproof
coating, allowing for daily entries by pen. Participants were encouraged to
hang the calendar in a prominent place at home. PEDC users were supplied
with a stamped, addressed envelope to post the calendar back to the trial
coordinating center at the end of the treatment phase.
The study was approved by the Northern Sydney Local Health
District Human Research Ethics Committee, approval number HREC/14/
HAWKE/358, and registered with the Australian New Zealand Clinical
Trials Registry, registration number ACTRN12615000923561. Written
informed consent was obtained from all participants.
Data Collection
Baseline sociodemographic, clinical, and birth characteristics collected in this
study are shown in Table 1. All daily, weekly, and follow-up questionnaires
were identical for the 2 groups.
Maternal
Maternal age (years), mean (SD) 33.4 (4.9) 33.5 (4.0) 0.06 (618) N/Ad .95
Born in Australia, n (%) 256 (48.7) 56 (59.6) N/A 3.8 (1) .05
First baby, n (%) 312 (59.3) 44 (46.8) N/A 5.1 (1) .02
Allocated to probiotic, n (%) 265 (50.4) 46 (48.9) N/A 0.1 (1) .80
Caesarean section, n (%) 163 (31.0) 25 (26.6) N/A 0.7 (1) .39
Birthweight (grams), mean (SD) 3421 (458.1) 3456 (451.6) 0.69 (618) N/A .49
b
PEDC: paper and email data collection.
c
Test statistics using Pearson chi-square test for categorical variables and
2-tailed, independent sample t test for continuous variables with their
respective df are presented.
d
N/A: not applicable.
e
College, university, or vocational training after high school.
Comparing a Mobile Phone Automated System With a Paper and Email... 227
For the MPAS group, each study site was provided with an electronic
tablet with internet connectivity to enable the research assistant to enter the
participants’ details, conduct treatment randomization, and enter baseline
and hospital data directly into the web-based data management system.
All research assistants were trained in the use of the MPAS and given
individualized password-protected access to the website, which could be
accessed by phone, tablet, or computer. Only deidentified data were entered
into the database and linked to an individual study number generated
automatically at randomization. The only paper-based data for this cohort
included a signed patient information and consent form and a trial entry
form containing the participants’ contact details. Once randomized, the study
number generated by the MPAS was written in the trial entry form to allow
for reidentification, if required. An audit trail was integrated into the MPAS
to log all SMS messages sent and surveys completed. Daily and weekly
outcome data for the APProve trial for the first 8 weeks (56 days) following
birth were collected via self-completed questionnaires using automated
weblinks sent directly via SMS to the participant’s mobile phone. Before
the follow-up questionnaires at 2, 6, and 12 months (63, 180, and 360 days),
participants were sent an automated link asking for their preferred method
of receiving the questionnaires, with SMS, email, or post as options. On the
basis of the response, the MPAS would either send the participant an SMS
link to the relevant survey or alert the trial coordinator by an automated
email of the preference for an emailed or a postal questionnaire.
For the PEDC participants, baseline and hospital data were collected
on paper-based data forms and then entered into the web-based system at
the trial coordinating center. Once randomized to their allocated treatment,
participants were given a calendar diary by the research assistant to record
daily outcomes for 8 weeks. Weekly outcome data for the first 8 weeks
and follow-up questionnaires at 2, 6, and 12 months were collected by an
emailed weblink to a web-based survey sent by the clinical trial coordinator
(Figure 1).
228 Advanced Techniques for Collecting Statistical Data
Figure 1: Flow diagram comparing the mobile phone automated system with
paper and email data collection. MPAS: mobile phone automated system;
PEDC: paper and email data collection.
Outcomes
Outcomes evaluating participant acceptability, treatment compliance, and
effectiveness of data collection comparing the MPAS with the PEDC were
assessed in the 2-month follow-up questionnaire. Data were collected on the
ease of participation in the trial and the ease of remembering to take the study
treatment every day (both rated from 0 [very difficult] to 5 [very easy]), self-
reported compliance with taking the allocated treatment (compliance was
defined as having taken the product for ≥42 of 56 days, semicompliance as
having taken the product for 15-41 of 56 days, and noncompliance as having
taken the product for ≤14 of 56 days), whether the method of data collection
was helpful in reminding the participant to take the treatment (ranked from
0 [not helpful at all] to 5 [very helpful]), recommendation of the allocated
Comparing a Mobile Phone Automated System With a Paper and Email... 229
method of data collection for future studies, and the preference for how the
participant wanted to receive the follow-up questionnaires (SMS, email, or
post). The effectiveness of data collection was defined as the frequency of
completing the questionnaires at all time points.
We also assessed whether the data collection method had any impact
on the clinical trial outcomes. Clinical outcomes were collected during
the daily, weekly, and 2-month surveys. They included mastitis, maternal
infection, and breastfeeding status up to 2 months after birth. The mastitis
outcome measure was based on self-reported symptoms related to breast
infection or a clinical diagnosis of mastitis by a care provider [26].
Satisfaction with using their assigned method of data collection (MPAS
or PEDC) was assessed by using open-ended free text questions to elicit
written comments pertaining to what the participants liked the most and the
least about their assigned method of data collection and what suggestions
could be provided for future use. In addition, satisfaction with the method of
data collection was elicited from the MPAS users and responses ranked from
0 (did not like at all) to 5 (really liked it). This response was subgrouped into
2 categories: satisfied (4-5) and less satisfied (0-3).
The cost analysis of utilizing the MPAS compared with the PEDC was
also performed. Costs included those associated with the initial development
and ongoing usage of each system and personnel time associated with trial
participant survey collection and follow-up. A web-based time tracking
report was generated weekly to determine the average time required for
creating and sending emails and manual data entry from paper survey
collection.
Statistical Analysis
Baseline sociodemographic, clinical, and birth characteristics were compared
between the 2 groups. Categorical data were summarized using percentages,
and the differences in the characteristics between the 2 groups were assessed
using a chi-square test. Continuous outcomes with a normal distribution
were summarized using mean and SD, and the characteristics between the
2 groups were compared using t tests. Data with a nonnormal distribution
were summarized using medians, and the groups were compared using
nonparametric Wilcoxon tests. Satisfaction with the MPAS was analyzed
by maternal sociodemographic characteristics and treatment compliance.
Written responses were thematically assessed by 2 authors and an external
researcher, who each independently coded the data, followed by group
230 Advanced Techniques for Collecting Statistical Data
RESULTS
Participant Characteristics
Of 620 women, 526 women were quasi-randomized to the MPAS group
and 94 women to the PEDC group. There were no differences between the
groups except that a higher percentage of women in the MPAS group gave
birth to their first baby (P=.02; Table 1). After loss to follow-up of 10.5%
(55/526) participants in the MPAS group and 11% (10/94) in the PEDC group,
secondary outcomes were analyzed for 555 women. We found no difference
in the trial outcomes between the 2 data collection groups (Table 2). There
was also no difference in the ease of use between the MPAS and PEDC
groups. However, a higher proportion of participants using the MPAS were
compliant with taking the study treatment (331/471, 70.3% vs 47/84, 56%;
P<.001), were more likely to rate their method of data collection as being a
helpful reminder to record their symptoms (median 4.37 vs 2.63; P<.001),
and were more likely to recommend their assigned method for future use
(330/349, 94.6% vs 36/56, 64%; P<.001). There was little difference among
the characteristics of the women who were lost to follow-up compared with
those for whom we had follow-up data, except that at 2 months postpartum,
the former were less likely to be tertiary educated (45/65, 69% vs 472/555,
85.0%; P=.001).
Table 2: Impact and acceptability of the mobile phone automated system com-
pared with the paper and email data collection system
Any breast- 443e (94.5) 77 (91.7) N/A 0.1 (1) 1.55 (0.65 .32
feeding at to 3.69)
2 months, n
(%)
Exclusive 385f (82.3) 67 (79.8) N/A 0.3 (1) 1.18 (0.66 .58
breastfeed- to 2.11)
ing at 2
months, n
(%)
Ease of 3.76 (1.31) 3.57 (1.40) −1.02 (428) N/A 0.19 (−0.56 .31
participa- to 0.18)
tion (0-5,
5=very
easy), mean
(SD)
Ease of 3.21 (1.43) 2.95 (1.50) −1.3 (427) N/A 0.21 (−0.66 .21
remember- to 0.14)
ing to take
product (in-
dependent
of method;
0-5, 5=very
easy), mean
(SD)
Compliant with treatment, n (%) N/A 15.8 (2) N/A <.001
Compliant (≥42 331 (70.3) 47 (56.0) N/A N/A N/A N/A
of 56 days)
Semicompliant 87 (18.5) 14 (16.7) N/A N/A N/A N/A
(15-41 of 56
days)
Noncompliant 53 (11.3) 23 (27.4) N/A N/A N/A N/A
(≤14 of 56 days)
Helpful 4.37 (1.19) 2.63 (1.85) −9.3 (403) N/A 0.19 (−2.11 <.001
reminder to −1.38)
(data collec-
tion; 0-5,
5=very
helpful),
mean (SD)
Recom- 330 (94.6)g 36 (64.3)h N/A 50.8 (1) 0.19 (−2.11 <.001
mend for to −1.38)
future, n
(%)
b
PEDC: paper and email data collection.
232 Advanced Techniques for Collecting Statistical Data
c
Test statistics using Pearson chi-square for categorical variables and 2-tailed,
independent sample t test for continuous variables with their respective df
are presented.
d
N/A: not applicable.
e
N=469.
f
N=468.
g
N=349.
h
N=56.
Among the MPAS users, satisfaction was high with a mean score of
4.49 out of 5 (SD 1.0). There was no difference in satisfaction scores among
maternal characteristics. There was a difference in satisfaction related to
compliance, with participants most compliant with treatment being the most
satisfied with the use of the MPAS (P<.001; Figure 3). Nearly half of the
participants preferred to receive the questionnaires by either SMS (135/289,
46.7%) or email (139/289, 48.0%) at 2 months; however, the preference
for SMS increased to 60% for both the 6- and 12-month questionnaires
(142/241,58.9% and 135/224,60.2%, respectively). Very few women opted
to receive questionnaires by post (<5%).
Figure 3: Treatment compliance and satisfaction for the mobile phone auto-
mated system (n=555).
Responses to open-ended questions in the 2-month questionnaires
were received from 74.1% (349/471) MPAS participants and 67% (56/84)
PEDC participants. The themes identified were related to the factors that
the participants liked most and liked least about their method of data
collection as outlined in Table 3. Most of the MPAS participants stated that
the MPAS was easy, convenient, quick, accessible, and efficient to use. In
particular, many commented that web-based questionnaires were easy to
234 Advanced Techniques for Collecting Statistical Data
Table 3: Qualitative analyses of the likes and dislikes of mobile phone auto-
mated system users compared with paper and email data collection system users
Participant factors related to method of MPASa (n=349), n (%) PEDCb (n=56), n (%)
data collection
Liked the most
Ease of use 325 (93.1) 7 (12.5)
Good reminder to take treatment 75 (21.5) 10 (17.8)
Liked the least
Nothing 168 (48.1) 10 (17.8)
Time consuming 24 (6.9) 12 (21.4)
Functionality issues 77 (22.1) 10 (17.8)
Difficult to remember to complete 16 (4.6) 14 (25.0)
survey
b
PEDC: paper and email data collection.
Suggestions for future use by the MPAS participants included allowing
users to select the time of day to receive the SMS and to opt in or out of
reminder messages, limiting the number of questions on the questionnaire
to minimize scrolling, diversifying the content of each SMS for improved
interest, and improving the functionality to allow the questionnaires to be
completed later if interrupted. Many of the PEDC participants recommended
the use of SMS or a web-based app for data collection (Textbox 1).
Participants’ comments about the mobile phone automated system
compared with the paper and email data collection system.
Mobile phone automated system
• “I found using my phone to complete the surveys great as I could
do it easily when feeding my daughter.”
Comparing a Mobile Phone Automated System With a Paper and Email... 235
Cost Analysis
Cost analysis between the 2 groups showed a comparable per-person cost,
with the MPAS costing on average Aus $10 (US $7.21) more (Tables 4 and
5).
b
Labor is calculated at Aus $50 (US $36.04) per hour.
Emails are calculated at 5 min per email.
c
b
Labor is calculated at Aus $50 (US $36.04) per hour.
Emails are calculated at 5 min per email.
c
DISCUSSION
Principal Findings
This study demonstrates that an MPAS is an effective and acceptable tool
for improving study delivery and data collection within a randomized
trial as compared with a more traditional system. We have shown that the
mobile phone system improved treatment compliance and response rates,
demonstrated greater user satisfaction, is comparable in cost to PEDC, and
does not impact study outcomes.
of recall bias [23]. Reducing the burden and time of data collection on
the research assistant was significant, along with issues associated with
patient confidentiality and storage of physical case report forms [23,29].
The advantage of integrating the MPAS via a web-based platform ensured
access across mobile phone platforms and enabled accessibility to a large
and diverse population, especially for those living in rural, remote, or
disadvantaged areas or where mobility is restricted [31,32]. In addition,
staff sick leave and absences were less of an issue because of the automated
nature of the system, leading to increased flexibility of the research team,
which is important when managing research studies on small budgets in
small teams.
participants. However, we were able to resolve many of the issues and make
slight modifications to the software over time. This did not negatively impact
the response rates. A final limitation was that no assessment of participant
time was included in the cost analysis. This was not included as it was not
anticipated that there would have been a discernible difference in time cost
between the 2 groups. Posting the diaries and logging on to the computer
for the weekly questionnaires may have elicited more time from the PEDC
participants, but this would have been negligible.
Conclusions
Despite the increasing growth of web-based clinical trial management
systems, there has been little or no evaluation of these systems against
traditional methods of trial management systems. Since the commencement
of our trial, there have been improvements in the quality and availability
of electronic data collection systems. For example, REDCap (Research
Electronic Data Capture) is a secure web application for building and
managing web-based surveys and databases, specifically for research studies
and operations [37]. The system offers an easy-to-use and secure method of
flexible yet robust data collection, which is free to researchers affiliated with
universities. Using such a system would have decreased the costs associated
with the development of the web-based survey tool we utilized as well as
eliminated many of the functionality issues we experienced to reduce future
research costs.
Future research should focus on how to maximize the effect of mobile
phone technology, such as implementing strategies to improve long-term
engagement with participants by simplifying questionnaires, optimizing
the number of text messages, and personalizing the content and timing of
messages.
Although we evaluated MPAS in a perinatal population, the use
of mobile phone technology provides the opportunity to facilitate and
improve the quality and effectiveness of clinical research studies; enhance
patient interaction; and improve clinical research across a wide range
of methodologies, disciplines, and health care settings. Integration and
evaluation of mobile phone research management systems that are cost-
effective, efficient, and acceptable to both researchers and patients is
essential, given the increasing use of mobile phone technology [24] and
high costs of undertaking research. We have shown that the use of an
integrated MPAS is an effective and acceptable method for improving the
Comparing a Mobile Phone Automated System With a Paper and Email... 241
ACKNOWLEDGMENTS
Funding was provided by the Ramsay Research and Teaching Fund of Royal
North Shore Hospital and the Kolling Institute of Medical Research. NN
was supported by Australian National Health and Medical Research Council
Career Development (APP1067066) and DB by a University of Sydney
Postgraduate Award. In-kind support was provided by Intersect Australia
Ltd for research support and development of the MPAS. The funders of
the study had no role in the study design, data collection, data analysis,
data interpretation, or writing of the report. No payment was received for
writing this paper by pharmaceutical companies or other agencies. The
corresponding author had full access to all the data in the study and had the
final responsibility for the decision to submit for publication. The authors
would like to thank the research coordinators and midwives of Royal North
Shore Hospital, Royal Prince Alfred Hospital, and Royal Hospital for
Women for their assistance in trial recruitment and data collection and Ms
Andrea Pattinson for her assistance in the qualitative review of responses.
The authors also gratefully acknowledge the contribution of the women who
participated in this trial.
242 Advanced Techniques for Collecting Statistical Data
REFERENCES
1. Stone AA, Shiffman S, Schwartz JE, Broderick JE, Hufford MR.
Patient compliance with paper and electronic diaries. Control Clin
Trials. 2003 Apr;24(2):182–99. doi: 10.1016/s0197-2456(02)00320-3.
2. Wood AM, White IR, Thompson SG. Are missing outcome data
adequately handled? A review of published randomized controlled
trials in major medical journals. Clin Trials. 2004;1(4):368–76. doi:
10.1191/1740774504cn032oa.
3. Jüni P, Altman D, Egger M. Systematic reviews in health care:
assessing the quality of controlled clinical trials. Br Med J. 2001 Jul
7;323(7303):42–6. doi: 10.1136/bmj.323.7303.42. http://europepmc.
org/abstract/MED/11440947.
4. Sibbald B, Roland M. Understanding controlled trials. Why
are randomised controlled trials important? Br Med J. 1998 Jan
17;316(7126):201. doi: 10.1136/bmj.316.7126.201. http://europepmc.
org/abstract/MED/9468688.
5. Sanson-Fisher RW, Bonevski B, Green LW, D’Este C. Limitations of
the randomized controlled trial in evaluating population-based health
interventions. Am J Prev Med. 2007 Aug;33(2):155–61. doi: 10.1016/j.
amepre.2007.04.007.
6. Whitford H, Donnan P, Symon A, Kellett G, Monteith-Hodge E,
Rauchhaus P, Wyatt J. Evaluating the reliability, validity, acceptability,
and practicality of SMS text messaging as a tool to collect research
data: results from the feeding your baby project. J Am Med Inform
Assoc. 2012;19(5):744–9. doi: 10.1136/amiajnl-2011-000785. http://
europepmc.org/abstract/MED/22539081.
7. Nahm ML, Pieper CF, Cunningham MM. Quantifying data quality
for clinical trials using electronic data capture. PLoS One. 2008 Aug
25;3(8):e3049. doi: 10.1371/journal.pone.0003049. http://dx.plos.
org/10.1371/journal.pone.0003049.
8. Fitzgerald D, Hockey R, Jones M, Mishra G, Waller M, Dobson A. Use
of online or paper surveys by Australian women: longitudinal study
of users, devices, and cohort retention. J Med Internet Res. 2019 Mar
14;21(3):e10672. doi: 10.2196/10672. https://www.jmir.org/2019/3/
e10672/
9. Chen L, Chapman JL, Yee BJ, Wong KK, Grunstein RR, Marshall
NS, Miller CB. Agreement between electronic and paper Epworth
Comparing a Mobile Phone Automated System With a Paper and Email... 243
Xiang Huang
Guangdong University of Finance and Economics, College of entrepreneurship education,
Guangzhou 510320, China
ABSTRACT
The application of big data not only brings us great convenience, but also
brings social problems such as big data “familiar”, information leakage and
so on, which seriously affects customers’ willingness to participate and their
satisfaction with the enterprise. How to collect customer information in
order to improve customers’ willingness to participate is an urgent topic to
Citation: (APA): Huang, X. (2021, April). Big Data Collection and Object Participa-
tion Willingness: An Analytical Framework from the Perspective of Value Balance. In
Journal of Physics: Conference Series (Vol. 1881, No. 3, p. 032016). IOP Publishing.
Copyright: © Content from this work may be used under the terms of the Creative
Commons Attribution 3.0 licence (http://creativecommons.org/licenses/by/3.0).
248 Advanced Techniques for Collecting Statistical Data
collection methods to solve the “digital gap” between big data subject and
big data technology object.
From the perspective of value balance of big data object, this paper
attempts to construct a model, which can provide useful enlightenment and
reference for the research and practice of big data application.
①Procedural necessity
If the process of big data collection is just the necessary procedure for the big
data object to carry out other economic or social activities, in order to achieve
its goal, the big data object can only accept the data input requirements,
regardless of whether it is willing or not. On the contrary, if the data collection
process is not required by the big data object to carry out other activities,
or needs to add other actions in addition to the program needs, the big data
object may not be willing to participate in the data collection process. For
example, in the process of shopping on the e-commerce platform, in order
to complete the transaction, consumers must input the necessary identity
information and the information of the purchased object whether they
are willing or not, so consumers have to accept that the data is collected.
However, if the platform requires the input of personal information such as
height and weight, which has nothing to do with the transaction, or requires
additional income, purchase frequency and other consumption intention
information, then consumers are often reluctant to participate.
Big Data Collection and Object Participation Willingness: An Analytical... 251
②Activity value
If the process of big data collection is just the necessary procedure for the big
data object to carry out other economic or social activities, and the results of
these economic or social activities bring great utility and satisfaction to the
big data object, it will promote the big data object to complete this process.
The stronger the willingness to complete these economic or social activities,
the stronger the motivation to participate in the big data collection process.
③Information sensitivity
If the big data object thinks that the required data is private information, it
is often reluctant to participate. On the contrary, it is easier to be persuaded
to participate in data collection. For example, if big data objects are required
to input information such as marital status, family income level and sexual
orientation, it may cause big data objects to be very vigilant or even disgusted.
④ Process complexity
If the big data object is required to input a lot of information, or the input
process is complex and cumbersome, or the input information needs certain
knowledge and ability to identify and judge, which costs a lot of time and
energy, it may greatly reduce its willingness and enthusiasm to participate.
⑤Data security
If the information that big data objects are required to input is not sensitive
information such as private privacy, but if big data objects think that big
data subjects may abuse these private information when analyzing and using
them, or worry that big data subjects may cause information leakage when
storing data, which will eventually damage their rights and interests, the
participation of big data objects may be greatly reduced Wish and enthusiasm.
⑥Data value
If the big data object thinks that its participation in the big data collection
process will bring value to the big data subject, but it will also bring value to
itself, that is, it will create a win-win situation, or even a multi win situation,
then the big data object will have more willingness and enthusiasm to
participate. The higher the potential value of big data objects, the stronger
their motivation to participate in the big data collection process.
252 Advanced Techniques for Collecting Statistical Data
value of data, the object will think that the data collection process has little
to do with its own value, and participating in the data collection process is
mainly to help the big data subject to create value, so it is a “dedication”
data collection process. When the big data object has a low evaluation on
the value of the activity but a high evaluation on the value of the data, the
object’s participation in the big data collection process is mainly attracted
by the promised interests of the big data subject or the future value of the
big data application, so it is a “inducement” data collection process. When
the evaluation of big data object on activity value and data value is low, the
object does not have any motivation to participate in the big data collection
process, which is mainly induced by the big data subject through other
means, so it is a “fishing” data collection process.
REFERENCE
1. Yu Desheng, Li Xing. Research on dynamic evolutionary game of big
data maturity between consumers and businesses [J]. Price theory and
practice, 2019 (11): 129-132
2. Yuan Bo. Research on antitrust in the field of big data [D]. Shanghai
Jiaotong University, 2019
3. Wei Junwei. Social problems and Countermeasures of big data
technology application [D]. Central China Normal University, 2019
4. Song Jixin. Research on ethical issues and governance of big data
technology [J]. Journal of Shenyang Institute of Technology (SOCIAL
SCIENCE EDITION), 2018,14 (04): 452-455
5. LAN Yihui. The research, development and application of science and
technology should be people-oriented [J]. Scientific research, 2002
(02): 152-157
6. Yong Shi, Chun Shi, Shi-Yuan Xu, A-Li Sun, Jun Wang. Exposure
assessment of rainstorm waterlogging on old-style residences in
Shanghai based on scenario simulation[J]. Natural Hazards. 2010 (2)
7. Dapeng Huang, Chuang Liu, Huajun Fang, Shunfeng Peng. Assessment
of waterlogging risk in Lixiahe region of Jiangsu Province based on
AVHRR and MODIS image[J]. Chinese Geographical Science. 2008
(2)
8. Qiu J. Urbanization contributed to Beijing storms. Nature
News&Comment . 2012
9. Sang Y K,Wang Z G,Liu C M.What factors are responsible for the
Beijing Storm. Natural Hazards. 2013
10. Wu Z H,Huang N E, Long S R,et al. On the Trend,Detrending,and
variability of nonlinear and non stationary time Series. Proceedings
of the National Academy of Sciences of the United States of America
. 2007
Chapter
RESEARCH ON COMPUTER
SIMULATION BIG DATA
INTELLIGENT COLLECTION
13
AND ANALYSIS SYSTEM
Hongying Liu
Department of Computer Science and Engineering, Guangzhou College of Technology and
Business, Guangzhou 510850, China
ABSTRACT
As a characteristic of big data, the individual data in it is no longer isolated,
and the data and its underlying mechanisms have complex associations,
which make all data into an indivisible whole. The dynamic generation and
disappearance of data will change its original relationship and affect the
overall characteristics of the data. This feature of big data makes the subject-
Citation: (APA): Liu, H. (2021, March). Research on Computer Simulation Big Data
Intelligent Collection and Analysis System. In Journal of Physics: Conference Series
(Vol. 1802, No. 3, p. 032052). IOP Publishing. (7 pages)
Copyright: © Content from this work may be used under the terms of the Creative
Commons Attribution 3.0 licence (http://creativecommons.org/licenses/by/3.0).
258 Advanced Techniques for Collecting Statistical Data
INTRODUCTION
The emergence of technologies such as cloud computing, mobile
communications and big data has promoted the rapid development of
various application software, and has been widely used in logistics and
warehousing, power communications and smart tourism. The database in
cloud computing is the most critical part of application software, and it is
also the starting point and end point of application software data processing
and processing. Therefore, database design is an important factor affecting
the use and popularization of application software. At present, in the process
of using the database, as the number of visits by users gradually increases, the
scale of big data has also become huge [1]. The traditional database design
model is prone to load phenomena, which is not conducive to improving the
efficiency of database extraction, so a new intelligent storage method needs
to be added.
At present, many experts and scholars are conducting research on
network multi-resolution big data collection. For example, the multi-
resolution collection method of network data based on linear regression,
through the application of linear regression analysis method to construct a
sensing data model, maintain the characteristics of the sensing data, so that
the node only transmits the parameter information of the regression model,
Research on Computer Simulation Big Data Intelligent Collection and... 259
Related Theorems
Assume that for the same target, the initial state estimation and covariance
matrix of sensors i and j are , respectively. The
dynamic equation of target movement is:
(1)
Among them, the process noise wk is white noise with zero mean value,
the covariance matrix is Qk , and the measurement equations of the two
sensors are
(2)
Among them, the measured noise is white noise with zero mean, the
(3)
Research on Computer Simulation Big Data Intelligent Collection and... 261
(4)
Cross-covariance:
(5)
It can be seen from the above formula that the cross-correlation is caused
by a priori estimation process noise Qk−1 and measurement noise .
The concept of consistency: Suppose that the true state of the target is a , the
sensor’s estimated state of the target is , the estimated error covariance
is P , and the true error covariance is . The so-called
consistency is the estimated covariance P P ≥ .
(6)
Under such conditions, making full use of these correlation information
can improve the fusion accuracy. Algorithm flow:
(7)
262 Advanced Techniques for Collecting Statistical Data
Use formulas (8) and (9) to determine the fusion weight for fusion. In
the literature, the estimation method of the correlation coefficient bound is
given. The cross-covariance matrix can be estimated by formula (5), and the
(8)
Experimental Analysis
The proposed Java3D-based network multi-resolution large data acquisition
method, the optical fibre network communication data resolution acquisition
method and the network data resolution acquisition method based on the
linked list structure are compared with the results of the completion time of
the network data multi-resolution acquisition. The unit of completion time
is seconds, which is represented by s. The experimental results are shown
in Figure 2. In Figure 2, A represents the proposed method; B represents
the data resolution acquisition method based on optical fibre network
Research on Computer Simulation Big Data Intelligent Collection and... 263
CONCLUSION
Aiming at the various problems in the traditional Java3D network big data
multi-resolution acquisition process, a Java3D-based network big data
multi-resolution acquisition method is proposed. This method has shorter
completion time for multi-resolution acquisition of network big data, lower
cost, higher acquisition accuracy, certain application performance, and can
be widely used in various fields.
266 Advanced Techniques for Collecting Statistical Data
REFERENCES
1. Wang, L., & Wang, G. Big data in cyber-physical systems, digital
manufacturing and industry 4.0. International Journal of Engineering
and Manufacturing (IJEM), 6(4) (2016) 1-8.
2. Zhu, L., Yu, F. R., Wang, Y., Ning, B., & Tang, T. Big data analytics
in intelligent transportation systems: A survey. IEEE Transactions on
Intelligent Transportation Systems, 20(1) (2018) 383-398.
3. Jung, D., Tran Tuan, V., Dai Tran, Q., Park, M., & Park, S. Conceptual
Framework of an Intelligent Decision Support System for Smart City
Disaster Management. Applied Sciences, 10(2) (2020) 666-675.
4. Zhong, R. Y., Xu, C., Chen, C., & Huang, G. Q. Big data analytics
for physical internet-based intelligent manufacturing shop floors.
International journal of production research, 55(9) (2017) 2610-2621.
5. Sumalee, A., & Ho, H. W. Smarter and more connected: Future
intelligent transportation system. IATSS Research, 42(2) (2018) 67-71.
6. Zheng, X., Chen, W., Wang, P., Shen, D., Chen, S., Wang, X., ... &
Yang, L. Big data for social transportation. IEEE Transactions on
Intelligent Transportation Systems, 17(3) (2015) 620- 630.
7. Chih-Lin, I., Sun, Q., Liu, Z., Zhang, S., & Han, S. The big-data-driven
intelligent wireless network: architecture, use cases, solutions, and
future trends. IEEE vehicular technology magazine, 12(4) (2017) 20-
29.
Chapter
DEVELOPMENT OF A MOBILE
APPLICATION FOR SMART
CLINICAL TRIAL SUBJECT
14
DATA COLLECTION AND
MANAGEMENT
Hyeongju Ryu 1, Meihua Piao 2, Heejin Kim 3, Wooseok Yang 3 and Kyung
Hwan Kim 4
1
Biomedical Research Institute, Seoul National University Hospital, Seoul 03080, Ko-
rea
2
Office of Hospital Information, Seoul National University Hospital, Seoul 03080, Ko-
rea
3
Clinical Trials Center, Seoul National University Hospital, Seoul 03080, Korea
4
Department of Thoracic and Cardiovascular Surgery, Seoul National University Hospi-
tal, Seoul National University College of Medicine, Seoul 03080, Korea
Citation: (APA): Ryu, H., Piao, M., Kim, H., Yang, W., & Kim, K. H. (2022). Devel-
opment of a Mobile Application for Smart Clinical Trial Subject Data Collection and
Management. Applied Sciences, 12(7), 3343.(12 pages)
Copyright: © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is
an open access article distributed under the terms and conditions of the Creative Com-
mons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
268 Advanced Techniques for Collecting Statistical Data
ABSTRACT
Wearable devices and digital health technologies have enabled the
exchange of urgent clinical trial information. We developed an application
to improve the functioning of decentralized clinical trials and performed
a heuristic evaluation to reflect the user demands of existing clinical trial
workers. The waterfall model of the software life cycle was used to guide
the development. Focus group interviews (N = 7) were conducted to reflect
the needs of clinical research professionals, and Wizard of Oz prototyping
was performed to ensure high usability and completeness. Unit tests and
heuristic evaluation (N = 11) were used. Thematic analysis was performed
using the focus group interview data. Based on this analysis, the main
menu was designed to include health management, laboratory test results,
medications, concomitant medications, adverse reactions, questionnaires,
meals, and My Alarm. Through role-playing, the functions and configuration
of the prototype were adjusted and enhanced, and a heuristic evaluation
was performed. None of the heuristic evaluation items indicated critical
usability errors, suggesting that the revised prototype application can be
practically applied to clinical trials. The application is expected to increase
the efficiency of clinical trial management, and the development process
introduced in this study will be helpful for researchers developing similar
applications in the future.
INTRODUCTION
Clinical trials are essential to study the efficacy and risks associated
with drugs; however, they are costly and time-consuming, occasionally
requiring years for completion. A recent study found that the average cost of
development of a novel drug from drug discovery to the marketing approval
of a product is between $2 billion and $3 billion and can take anywhere from
12 to 18 years, with clinical trials being the most costly and time-consuming
phases of the entire process [1]. The burden of such conventional processes
for whole-drug development has become even more challenging with the
coronavirus disease 2019 (COVID-19) pandemic, especially because of
the difficulty in recruiting and retaining clinical trial participants [2,3,4].
Because of the limitations imposed by COVID-19, such as self-isolation,
site closures, and travel restrictions, as of October 2021, more than 2100
clinical trials have been reported to be explicitly suspended [5,6,7].
Development of a Mobile Application for Smart Clinical Trial Subject ... 269
CRPs from the Seoul National University Hospital Clinical Trial Center
were recruited to verify their application needs. Focus group interviews
(FGIs) were conducted by dividing the seven recruited volunteers into two
groups: research doctors and clinical research coordinators. The ideal sample
size for a focus group interview varies according to the literature [19]. In
this study, the sample size was selected to ensure less than 10 people per
group and more than two groups per concept, in accordance with previously
described criteria [20]. The purpose of the application to be developed was
explained to the interviewees, and the needs of the planned application
were collected based on clinical trial situations and application functions.
All seven CRPs (three doctors and four research coordinators) participated
in the FGIs. Through structured open-ended questions, requirements such
as essential needs and functions to be included in the application were
investigated using the FGIs. A qualitative thematic analysis was used for
data analysis. The data were analyzed by grouping the collected data into
similar concepts and then categorizing them [21].
The design of an application system needs to be developed to fully adopt
the practical needs of clinical trials; therefore, a stepwise approach to polish
the system was applied. The function and structure of the application were
designed considering the needs of CRPs collected through the FGIs. The
main menu of the application was designed, followed by the construction
of the information architecture and wireframes. The user interface was
modified to reflect the opinions of the CRPs from the FGIs.
Wizard of Oz (WOZ) prototyping was performed with CRPs by
using the user interface of the application to improve the user experience.
Before actual development, WOZ was a way to test usability through role-
playing with mockup software. Likewise, role-playing to increase usability
was performed by clinical trial experts and a team of software engineers
responsible for the development of application software. Based on the
results of WOZ prototyping, the usefulness and efficiency of the application
user interface were confirmed before actual production. Unit tests were
performed to check the programming errors and usability problems of the
initial version of the prototype application, and the revised version of the
prototype application was made based on the unit test results [22,23].
To test all of the functions in the revised version of the prototype
application, task scenarios were developed as a heuristic evaluation for a
total of 48 tasks in two detailed scenarios. Step-by-step, each scenario was
designed to accomplish tasks, including login, the input of adverse reactions
272 Advanced Techniques for Collecting Statistical Data
RESULTS
Table 1: Collected needs for the real-time clinical trial monitoring system
items such as weight, blood pressure, daily steps, and blood sugar level are
displayed in a single row for easy recognition. Each item can also have a
separate graph display with the most recent data. The value of each item
can be manually entered by the user, and data is linked with the Samsung
Health app, so when using a wearable device or another measuring device
that works with the Samsung Health app, it can also be entered through the
device. The user interface suggested for the trial version application was
developed by adopting the CRPs’ feedback to remove unnecessary text and
medical terms to make the screen less complicated and to include pictograms
for easy understanding for non-professional trial participants.
5 Symptom record Symptom record: Cough, stuffy nose, sore throat, fatigue, headache,
fever, loss of smell, loss of taste, etc. (corresponding to symptoms of
COVID-19) were reported.
Health Record: Blood pressure and ECG data are input through an
external device (wearable device).
Blood pressure: Data from all devices linked to Samsung Health can
be entered.
ECG: Real-time input through VP-100 (device certified by the Korea
Food and Drug Administration).
6 Daily The user’s medication, nutritional, and health measurement record
to-do items that must be entered each day are presented.
The status changes from to-do to done when the user completes that
task.
ECG data are transmitted through three paths. Raw ECG data are
transmitted in the order mentioned above, whereas the arrhythmia detection
algorithm runs in the batch server and transmits the result to the API server.
Finally, for ECG streaming, the data are sent to the external cloud, and the
CRP accesses the cloud from the internal network to check the streaming
data.
Table 3: Results of the heuristic evaluation of the clinical trial monitoring ap-
plication
DISCUSSION
In this study, to develop a real-time monitoring application, a stepwise
approach was applied to improve usability, starting with an analysis of
needs. The initial FGIs for needs analysis identified requests for “side
effects and adverse reaction identification services”, “concomitant drug
identification capabilities”, and “remote feedback functions”. The primary
Development of a Mobile Application for Smart Clinical Trial Subject ... 279
CRPs provide, and users often do not give active and negative feedback to
the CRPs in the clinical environment.
For interoperability and security, the creation of applications with Fast
Healthcare Interoperability Resources (FHIR) as a standard was considered
[36,37]. However, because of security reasons as well as practical difficulties
in recruiting a technician with FHIR-based production experience,
interoperability with the hospital network was not implemented.
Various solutions are also required for processing ECG data. For real-time
ECG streaming, the limited server resources and bandwidth with the DMZ
server caused delays in streaming. This problem was solved by transmission
using a cloud server. However, this approach introduced a security issue
because the cloud server was not located in the internal network. To solve
the security issue, the Amazon Web Services Cloud, which is known for
its relatively stable security among cloud services, was used. Only the
function of viewing the ECG graph transmitted by the cloud was performed
in the internal network, effectively blocking other data connections between
networks. The arrhythmia detection algorithm also encountered resource
issues. This problem was solved by physically separating one DMZ server
into two logical servers, configuring the batch server, and processing the
algorithm in the batch server. Thus, problems that occurred during the actual
development process were solved within the limited available resources.
Heuristic evaluation was used to confirm the direction of improvement.
In the heuristic evaluation, various categories of tests were planned, although
the actual results indicated the importance of improving the overall usability
of the application on the basis of individual errors rather than classifying
the errors by category. For example, if an error occurred because the button
on the screen was hidden, some participants reported it as a design error,
whereas others considered it a configuration error. In this regard, developers
should be careful about developing applications on a subjective basis without
fully reflecting the needs of users. The most common complaint identified
in the heuristic evaluation was that the input window and text were too
small. Thus, the evaluation suggested that animations or design elements
used to improve aesthetics may not benefit end-users who frequently use the
application. However, the positive responses to the screen composition and
other aesthetic aspects indicated the importance of identifying a compromise
for these aspects. It was considered that the CRP group newly participating
in application development would evaluate the application from a different
angle than the CRP group that continued to participate in application
Development of a Mobile Application for Smart Clinical Trial Subject ... 281
CONCLUSIONS
In this study, we developed an application to address the difficulties
associated with subject management in traditional clinical trials. After the
development of the application, a heuristic evaluation was performed to
reflect the user demands of existing clinical trial workers. These evaluations
made it possible to confirm various consistencies in the application functions
and user interface. Unlike other studies, this study explains the researcher-
282 Advanced Techniques for Collecting Statistical Data
AUTHOR CONTRIBUTIONS
Conceptualization, H.R.; resources, H.R.; methodology, H.R. and M.P.;
validation, H.K.; data curation, M.P. and H.R.; writing—original draft
preparation, H.R.; writing—review and editing, H.R. and M.P.; visualization,
H.R. and W.Y.; supervision, K.H.K.; project administration, K.H.K.; funding
acquisition, K.H.K. All authors have read and agreed to the published version
of the manuscript.
Development of a Mobile Application for Smart Clinical Trial Subject ... 283
REFERENCES
1. Moore, T.J.; Zhang, H.; Anderson, G.; Alexander, G.C. Estimated
Costs of Pivotal Trials for Novel Therapeutic Agents Approved by the
US Food and Drug Administration, 2015–2016. JAMA Intern. Med.
2018, 178, 1451–1457.
2. Fischer, S.M.; Kline, D.M.; Min, S.J.; Okuyama, S.; Fink, R.M. Apoyo
Con Carino: Strategies to Promote Recruiting, Enrolling, and Retaining
Latinos in a Cancer Clinical Trial. J. Natl. Compr. Canc. Netw. 2017,
15, 1392–1399.
3. Fogel, D.B. Factors Associated with Clinical Trials That Fail and
Opportunities for Improving the Likelihood of Success: A Review.
Contemp. Clin. Trials Commun. 2018, 11, 156–164.
4. Soares, R.R.; Parikh, D.; Shields, C.N.; Peck, T.; Gopal, A.; Sharpe,
J.; Yonekawa, Y. Geographic Access Disparities to Clinical Trials in
Diabetic Eye Disease in the United States. Ophthalmol. Retina 2021,
5, 879–887.
5. Carlisle, B.G. Clinical Trials Stopped by COVID-19 [Internet]. The
Grey Literature. 2020. Available online: https://covid19.bgcarlisle.
com/ (accessed on 1 January 2022).
6. Asaad, M.; Habibullah, N.K.; Butler, C.E. The Impact of COVID-19
on Clinical Trials. Ann. Surg. 2020, 272, e222–e223.
7. Hamzelou, J. World in Lockdown. New Sci. 2020, 245, 7.
8. Apostolaros, M.; Babaian, D.; Corneli, A.; Forrest, A.; Hamre, G.;
Hewett, J.; Podolsky, L.; Popat, V.; Randall, P. Legal, Regulatory, and
Practical Issues to Consider When Adopting Decentralized Clinical
Trials: Recommendations from the Clinical Trials Transformation
Initiative. Ther. Innov. Regul. Sci. 2020, 54, 779–787.
9. Hashiguchi, T.C.O. Bringing Health Care to the Patient: An Overview
of the Use of Telemedicine in OECD Countries; OECD Health Working
Papers, No. 116; OECD Publishing: Paris, France, 2020.
10. Weinstein, R.S.; Lopez, A.M.; Joseph, B.A.; Erps, K.A.; Holcomb, M.;
Barker, G.P.; Krupinski, E.A. Telemedicine, Telehealth, and Mobile
Health Applications That Work: Opportunities and Barriers. Am. J.
Med. 2014, 127, 183–187.
11. Won, J.H.; Lee, H. Can the COVID-19 Pandemic Disrupt the Current
Drug Development Practices? Int. J. Mol. Sci. 2021, 22, 5457.
284 Advanced Techniques for Collecting Statistical Data
12. Little, R.J.; D’Agostino, R.; Cohen, M.L.; Dickersin, K.; Emerson,
S.S.; Farrar, J.T.; Frangakis, C.; Hogan, J.W.; Molenberghs, G.;
Murphy, S.A.; et al. The Prevention and Treatment of Missing Data in
Clinical Trials. N. Engl. J. Med. 2012, 367, 1355–1360.
13. Inan, O.T.; Tenaerts, P.; Prindiville, S.A.; Reynolds, H.R.; Dizon,
D.S.; Cooper-Arnold, K.; Turakhia, M.; Pletcher, M.J.; Preston, K.L.;
Krumholz, H.M.; et al. Digitizing Clinical Trials. NPJ Digit. Med.
2020, 3, 101.
14. Kario, K.; Tomitani, N.; Kanegae, H.; Yasui, N.; Nishizawa, M.;
Fujiwara, T.; Shigezumi, T.; Nagai, R.; Harada, H. Development of
a New ICT-Based Multisensor Blood Pressure Monitoring System
for Use in Hemodynamic Biomarker-Initiated Anticipation Medicine
for Cardiovascular Disease: The National IMPACT Program Project.
Prog. Cardiovasc. Dis. 2017, 60, 435–449.
15. Korea National Enterprise for Clinical Trials [Internet]. The Grey
Literature. 18 May 2021. Available online: https://www.konect.or.kr/
kr/contents/datainfo_data_01_tab03/view.do (accessed on 15 March
2022).
16. Levy, H. Reducing the Data Burden for Clinical Investigators. Appl.
Clin. Trials. 2017, 26, 17. [Google Scholar]
17. Roger, S.P.; Bruce, R.M. Software Engineering: A Practitioner’s
Approach; McGraw-Hill Education: New York, NY, USA, 2015.
18. Pressman, R.S. Software Engineering: A Practitioner’s Approach;
Palgrave MacMillan: London, UK, 2005.
19. Carlsen, B.; Glenton, C. What about N? A Methodological Study
of Sample-Size Reporting in Focus Group Studies. BMC Med. Res.
Methodol. 2011, 11, 26.
20. Krueger, R.A. Focus Groups: A Practical Guide for Applied Research;
Sage Publications: Thousand Oaks, CA, USA, 2014.
21. Hsieh, H.F.; Shannon, S.E. Three Approaches to Qualitative Content
Analysis. Qual. Health Res. 2005, 15, 1277–1288.
22. Green, P.; Wei-Haas, L. The Rapid Development of User Interfaces:
Experience with the Wizard of Oz Method. Proc. Hum. Factors Soc.
Annu. Meet. 1985, 29, 470–474.
23. Pettersson, J.S.; Wik, M. The Longevity of General Purpose Wizard-
of-Oz Tools. In Proceedings of the Annual Meeting of the Australian
Development of a Mobile Application for Smart Clinical Trial Subject ... 285
1
U. Minho and INESC TEC, Braga, Portugal
2
Department of Information Engineering and Computer Science, University of Trento,
Trento, Italy
3
IMDEA Networks Institute, Madrid, Spain
Citation: (APA): Baquero, C., Casari, P., Fernandez Anta, A., García-García, A., Frey,
D., Garcia-Agundez, A., ... & Sanchez, I. (2021). The CoronaSurveys system for CO-
VID-19 incidence data collection and processing. Frontiers in Computer Science, 3,
641237. (10 pages)
Copyright: © 2021 Baquero, Casari, Fernandez Anta, García-García, Frey, Garcia-
Agundez, Georgiou, Girault, Ortega, Goessens, Hernández-Roig, Nicolaou, Stavrakis,
Ojo, Roberts and Sanchez. This is an open-access article distributed under the terms
of the Creative Commons Attribution License (CC BY): http://creativecommons.org/
licenses/by/4.0/
288 Advanced Techniques for Collecting Statistical Data
4
Inria Rennes, Rennes, France
5
Multimedia Communications Lab, TU Darmstadt, Darmstadt, Germany
6
Department of Computer Science, University of Cyprus, Nicosia, Cyprus
7
Department of Electrical and Computer Engineering University of Southern California,
Los Angeles, CA, United States
8
Consulting, Rennes, France
9
Department of Statistics, UC3M & UC3M-Santander Big Data Institute, Getafe, Spain
10
Algolysis Ltd, Nicosia, Cyprus
11
IMDEA Networks Institute and UC3M, Madrid, Spain
12
Skyhaven Media, Liverpool, United Kingdom
13
InqBarna, Barcelona, Spain
INTRODUCTION
During the current coronavirus pandemic, monitoring the evolution of
COVID-19 cases is of utmost importance for the authorities to make
informed policy decisions (e.g., lock-downs), and to raise awareness in the
general public for taking appropriate public health measures.
The CoronaSurveys System for COVID-19 Incidence Data Collection ... 289
DATA COLLECTION
The data collection subsystem consists of 1) a user-centered web and mobile
front-end interface, providing a straightforward and intuitive access to the
surveys, and 2) a data collection back-end enabling response aggregation in
a consistent and structured format to facilitate post-processing.
Figure 2: Snapshots of the Coronasurveys app. It shows the main app screen
(left), the information about the project shown when accessing the survey (cen-
ter), and the survey questions (right).
To preserve user engagement, minimize participant fatigue, and ensure
a steady flow of responses we initially designed a minimal survey consisting
of two simple questions:
• How many people do you know personally in this geographical
area? Include only those whose health status you are likely to
be aware of (The geographical area was previously selected, see
Figure 2.)
• How many of those were diagnosed with or have symptoms of
COVID-19?
We denote the reply to the first question as the Reach,riri, and the reply to
the second question as the Number of Cases,cici. In this way, the aggregated
value provides a rough estimate of the incidence of COVID-19. The
simplicity of the survey, together with the increased interest of people in the
initial stages of the pandemic, led to successful initial survey deployments
(e.g., 200 responses per week in Spain, 800 responses in the first day in
Cyprus, and more than 1,000 in Ukraine). Despite their simplicity, these
two questions were sufficient for producing rough preliminary estimates of
the cumulative incidence of COVID-19 in several countries, in a period in
which testing was scarce.
292 Advanced Techniques for Collecting Statistical Data
Data Aggregation
The back-end data collection engine was designed to provide seamless
aggregation of the data in a consistent and structured format. Timeliness,
consistency, and proper dissemination of the data were the three main pillars
of the aggregation process. CoronaSurveys updates its estimates daily to
provide a comparison with the estimates of officially confirmed cases,
which are also updated once per day. This daily aggregation also serves as a
privacy preserving measure, as we discuss in the next section.
During aggregation, survey responses are classified by country and
stored in individual files named as CC-aggregate.csv, where CC is the two
letter ISO code of the country. Each row in the file corresponds to a single
response and is composed of the elements that appear in Table 1: the date
of the response, the country for which the response reports, the country ISO
code, the region in the country for which the response reports (if any), the
region ISO code, the language used to fill the survey, the answers to the
survey questions (Var1,…VarnVar1,…Varn), a cookie that anonymously
identifies a participant, and a campaign field that can be used to identify
responses that correspond to specific survey dissemination campaigns. The
aggregated data is then provided to the estimation engine and published in
an online public repository (GCGImdea/coronasurveys, 2020).
User Privacy
Ensuring anonymity and privacy is important to minimize reservations
from participants on filling the survey. Ideally, we would like to acquire as
much relevant data as possible (e.g., geolocation), but this is orthogonal to
anonymity and is likely to lead to less responses. CoronaSurveys implements
four anonymity strategies:
DATA ANALYSIS
Based on the aggregated, anonymous data, CoronaSurveys employs several
methods to produce estimates of the number of COVID-19 cases in all
geographical areas for which sufficient data are available, comparing these
estimates with those provided by the official authorities. The estimation
methods are:
• cCFR-based: This method is based on estimating the
corrected case fatality ratio (cCFR), from the official numbers
of cumulative cases and fatalities, and taking an estimation of
the approximate number of cases with known outcomes into
consideration. It is also assumed that a reliable value of the
traditional case fatality ratio (CFR*CFR*) is available (We use
CFR*=1.38%CFR*=1.38% with a 95%95% confidence interval
of 1.23%1.23% and 1.53%1.53%, as described in (Verity et al.,
2020).) Then, the number of cases is estimated by multiplying
the official figure of cumulative cases in a region D by the ratio
cCFR(D)/CFR*cCFR(D)/CFR*, where cCFR(D)cCFR(D) is the
cCFR estimated for D.
• cCFR-fatalities: This method divides the official number of
fatalities on a given day d by CFR*CFR*, and assigns the resulting
number of cases to day d−Pd−P (P is the median number of days
from symptom onset to death). We use P=13P=13, following the
values reported by the Centers for disease Control and Prevention
(Centers for Disease Control and Prevention, 2021a).
• UMD-Symptom-Survey: This method uses the responses to
direct questions about symptoms from the University of Maryland
COVID-19 World Survey (Fan et al., 2020) to estimate active
cases. In particular, it counts the number of responses that declare
fever, and cough or difficulty breathing. This survey collects
more than 100,000100,000 individual responses daily.
The CoronaSurveys System for COVID-19 Incidence Data Collection ... 295
DATA VISUALIZATION
Finally, converting the computed data to meaningful visualizations is
essential to observe trends, insights, and behaviors from our data, as well as
to communicate our outcomes to a wider audience. Our visualization engine
employs the Grafana (Grafana Labs, 2018) framework, which enables the
creation of interactive plots of various types. We can group our plots into
three categories, based on the information they provide:
• CoronaSurveys participation statistics
• Global-scale visualizations
• Local-scale visualizations
To better map the effects of the pandemic and in order to capture a holistic
view of its impact, we present the computed estimates in both global, and
countrywide (local) visualisations. Global visualisations intend to expose the
distribution of the pandemic around the globe, and identify areas with higher
infection rates. Countrywide visualisations aim to pinpoint the estimated
magnitude of the problem compared to officially reported cases.
Global-Scale Visualizations
Our goal for the global visualisations is twofold: 1) to provide a snapshot
of the pandemic based on the latest computed estimates and 2) to provide
a comparative plot exposing the progress of the virus in multiple countries.
A map is one of the most intuitive ways to present an instance of the
data on a global scale. Therefore, Figure 4 presents a map visualization that
includes the estimates of the percentage of cumulative cases (infected) per
country based on the cCFR algorithm (ccfr-based). Bubble points can capture
the magnitude of a value by adjusting their color based on a predefined color
scale, and their radius relative to the maximum and minimum values on the
298 Advanced Techniques for Collecting Statistical Data
map. On the top left of the figure there are visible drop-down menus to select
other estimators and metrics.
Local-Scale Visualizations
For local-scale visualization, we display the evolution in the number of active
cases, new daily cases, and contagious cases (see Figure 6), estimated with
some of the methods described above. To estimate the number of active and
contagious cases when only daily cases are available (e.g., from confirmed
The CoronaSurveys System for COVID-19 Incidence Data Collection ... 299
data) we assume that cases are active and contagious for 18 and 12 days,
respectively, (Centers for Disease Control and Prevention, 2021a; Centers
for Disease Control and Prevention, 2021b). Observe in Figure 6 that the
ratios of active cases estimated on the last day (April 26th, 2021) with the
responses to direct symptom questions (blue line, 4.31%4.31%) and to the
indirect questions using NSUM (purple line, 2.87%2.87%) are one order
of magnitude larger than those obtained with the official number of cases
(0.33%0.33%) and the official number of fatalities (0.31%0.31%) (The
reason for the difference between the blue and the purple lines is currently
under evaluation.)
RESULTS
To test the feasibility of using CoronaSurveys to provide accurate estimates
of the number of cases, we conducted a comparison between our estimates
and the results of massive serology testing in Spain, a study conducted by
Pollan et al. (Pollán et al., 2020). In this study (García-Agundez et al., 2021),
we calculated the correlation between our estimates and the serology results
across all regions (autonomous communities) of Spain in the timeframe
of the serology study. The serology study recruited n=61075n=61075
participants, which represents 0.1787% ± 0.0984%0.1787% ± 0.0984%
of the regional population. In contrast, CoronaSurveys data provides
information on n=67199n=67199 people through indirect reporting, or
0.1827% ± 0.0701%0.1827% ± 0.0701% of the regional population.
This resulted in a Pearson R squared correlation of 0.89. In addition,
we observed that CoronaSurveys systematically underestimates the number
The CoronaSurveys System for COVID-19 Incidence Data Collection ... 301
CONCLUSION
In this article, we present the system architecture and estimation methods
of CoronaSurveys, which uses open surveys to monitor the progress of the
COVID-19 pandemic. Our graphical estimations require large amounts
of data from active participants, but provide insightful depictions of the
progress of the pandemic in different regions, offering an estimation of the
cumulative and active number of cases in different geographical areas.
The most important challenge and limitation of CoronaSurveys is the
number of survey responses. In this sense, the dissemination of our graphical
estimations is important to maximize user engagement and retention. For
this reason, in the future we aim to include a forecast of the number of cases
and fatalities based on recent data for different geographical areas, in order
to empower the dissemination of our graphical visualizations and with it
increase user recruitment.
In addition, our outlier detection methods are heuristic and could,
in the future, be improved to be more resilient to malicious responses.
CoronaSurveys is a work in progress, and features such as the number of
responses per day could be implemented to detect certain types of malicious
attacks which open online surveys may be subjected to.
Our first evaluation, comparing the results of CoronaSurveys with a
serology study in Spain provided excellent results, supporting open surveys
and indirect reporting as potential sources of information to track pandemics,
although further comparisons in different regions are required. An interesting
topic of discussion would be the minimum number of responses required to
provide reasonably accurate estimates, as increasing number of replies will
balance out individual inaccuracies of over- or underestimation and improve
the functionality of our outlier detection methods, following the “wisdom of
the crowd” phenomenon. Naturally, the minimum number of responses will
depend on factors such as population dispersion and cultural differences on
302 Advanced Techniques for Collecting Statistical Data
AUTHOR CONTRIBUTIONS
All authors listed have made a substantial, direct, and intellectual contribu-
tion to the work and approved it for publication.
The CoronaSurveys System for COVID-19 Incidence Data Collection ... 303
REFERENCES
1. Bernard, H. R., Hallett, T., Iovita, A., Johnsen, E. C., Lyerla, R.,
McCarty, C., et al. (2010). Counting Hard-To-Count Populations: the
Network Scale-Up Method for Public Health. Sex. Transm. infections
86 (Suppl. 2), ii11–ii15. doi:10.1136/sti.2010.044446
2. Centers for Disease Control and Prevention (2021a). Covid-19
Pandemic Planning Scenarios. Available at: https://www.cdc.gov/
coronavirus/2019-ncov/hcp/planning-scenarios.html (Accessed
December 12, 2020).
3. Centers for Disease Control and Prevention (2021b). Clinical Questions
about Covid-19: Questions and Answers. Available at: https://www.
cdc.gov/coronavirus/2019-ncov/hcp/faq.html (Accessed 05 09, 2021).
4. Fan, J., Yao, L., Stewart, K., Kommareddy, A. R., Bradford, A., Chiu,
S., et al. (2020). Covid-19 World Symptom Survey Data Api. Available
at: https://covidmap.umd.edu/api.html (Accessed May 28, 2021).
5. García-Agundez, A., Ojo, O., Hernández-Roig, H. A., Baquero,
C., Frey, D., Georgiou, C., et al. (2021). Estimating the COVID-19
Prevalence in spain with Indirect Reporting via Open Surveys. Front.
Public Health 9. Available at: https://www.medrxiv.org/content/10.11
01/2021.01.29.20248125v1 (Accessed May 28, 2021).
6. GCGImdea/coronasurveys (2020). Coronasurveys Data Repository.
Available at: https://github.com/GCGImdea/coronasurveys (Accessed
November 5, 2020).
7. Grafana Labs (2018). Grafana Documentation. Available at: https://
grafana.com/docs/ (Accessed May 28, 2021).
8. Institute for Health Metrics and Evaluation (2021). Covid-19 Results
Briefing in india. Available at: http://www.healthdata.org/sites/default/
files/files/Projects/COVID/2021/163_briefing_India_9.pdf (Accessed
May 03, 2021).
9. LimeSurvey Project Team/Carsten Schmitz (2012). LimeSurvey: An
Open Source Survey Tool. Hamburg, Germany: LimeSurvey Project.
10. Maxmen, A. (2020). How Much Is Coronavirus Spreading under the
Radar?. Nature 10. doi:10.1038/d41586-020-00760-8Available at:
https://www.nature.com/articles/d41586-020-00760-8
11. Oliver, N., Barber, X., Roomp, K., and Roomp, K. (2020). Assessing
the Impact of the Covid-19 Pandemic in spain: Large-Scale, Online,
304 Advanced Techniques for Collecting Statistical Data
Citation: (APA): Phatak, A. A., Wieland, F. G., Vempala, K., Volkmar, F., & Memmert,
D. (2021). Artificial Intelligence Based Body Sensor Network Framework—Narrative
Review: Proposing an End-to-End Framework using Wearable Sensors, Real-Time
Location Systems and Artificial Intelligence/Machine Learning Algorithms for Data
Collection, Data Mining and Knowledge Discovery in Sports and Healthcare. Sports
Medicine-Open, 7(1), 1-15. (15 pages)
Copyright: © Open Access. This article is licensed under a Creative Commons Attribu-
tion 4.0 International License (http://creativecommons.org/licenses/by/4.0/)
306 Advanced Techniques for Collecting Statistical Data
ABSTRACT
With the rising amount of data in the sports and health sectors, a plethora
of applications using big data mining have become possible. Multiple
frameworks have been proposed to mine, store, preprocess, and analyze
physiological vitals data using artificial intelligence and machine learning
algorithms. Comparatively, less research has been done to collect potentially
high volume, high-quality ‘big data’ in an organized, time-synchronized,
and holistic manner to solve similar problems in multiple fields. Although
a large number of data collection devices exist in the form of sensors. They
are either highly specialized, univariate and fragmented in nature or exist in
a lab setting. The current study aims to propose artificial intelligence-based
body sensor network framework (AIBSNF), a framework for strategic use
of body sensor networks (BSN), which combines with real-time location
system (RTLS) and wearable biosensors to collect multivariate, low noise,
and high-fidelity data. This facilitates gathering of time-synchronized
location and physiological vitals data, which allows artificial intelligence
and machine learning (AI/ML)-based time series analysis. The study gives
a brief overview of wearable sensor technology, RTLS, and provides use
cases of AI/ML algorithms in the field of sensor fusion. The study also
elaborates sample scenarios using a specific sensor network consisting of
pressure sensors (insoles), accelerometers, gyroscopes, ECG, EMG, and
RTLS position detectors for particular applications in the field of health
care and sports. The AIBSNF may provide a solid blueprint for conducting
research and development, forming a smooth end-to-end pipeline from
data collection using BSN, RTLS and final stage analytics based on AI/ML
algorithms.
Key Points
• A large number of wearable sensor technologies have given
rise to big data collection possibilities in the fields of sport and
healthcare.
• Emergence of body sensor networks, real time location systems
and multi sensor data fusion algorithm show great potential for
application in wide set of industries.
• The proposed AIBSNF framework has potential to provide a
solid blueprint for exploiting these rising technologies for end-
to-end application from data collection to knowledge discovery
across industries.
INTRODUCTION
In recent years, the application of big data, its acquisition, and analysis
using AI/ML algorithms have been applied in sports and healthcare
diagnostics [5]. It has resulted in improvements in the identification of
critical information and is being used in decision-making processes [5]. The
nature of these fields is such that certain physiological signs that signify
sports performance are also good indicators of mental and physical health
[5]. The physiological information and movement patterns required to
investigate athletic performance in sports, such as heart activity, recovery,
muscular strength coordination, balance, etc., have considerable overlap
with general health indicators. Considering this overlap, the data collection
tools for these physiological indices can potentially be used to analyze both
sports performance and available health predictors [9].
Figure 1 outlines the scope of the present review, it outlines the field of
application, technologies used for data collection and post data collection
analysis for knowledge discovery in the broad set of fields.
Wearable Biosensors
The rise of wearable sensors as tools for data collection seems to be ideal for
gathering physiological and vital data. Hence, wearable sensors have become
popular in medical, entertainment, security, and commercial areas [21]. A
recent review published in ‘Nature Biotechnology’ elaborated the rising
interest in wearable biosensor technology in academics, performance, and the
health industry [22]. Wearables show great potential to provide continuous,
real-time physiological data using dynamic non-invasive measurements of
biochemical and physiological markers. So far, these sensors have been
used for gathering precise, high-fidelity strategic data, which facilitate a
Artificial Intelligence Based Body Sensor Network Framework... 311
whole host of applications with, military, precision medicine, and the fitness
industry at its forefront [23–25]. Their precise fidelity and precision varies
based on the specifications of their varying use cases.
Advances in electronics, printing, non-invasive data collection, and
monitoring technology, have given rise to durable, unobtrusive, non-
invasive wearable clothing as an electronics platform capable of sensing
human locomotion and vital physiological signals [21, 25–29]. Such media
and miniaturization of sensors provide unprecedented capacity to gather a
wide range of data in many scenarios. By choosing a specific set of sensors
strategically located at different human body locations, there is potential to
collect precise data for solving interesting problems. Table 1 shows a non-
exhaustive list of non-invasive sensor technology that can potentially be
used in various combinations to gather physiological data for applications
in a wide range of disciplines.
3D-Light Detec- Precision 0.01 to 0.2 m @ ~ 200 m @ LoS, ~ 200 Thz [49]
tion and Rang- vehicle local- LoS both outdoor and
ing (LiDAR) ization indoor
Wireless Fidel- Indoor and 1–3 m @NLoS < 200, @ out- 2.4 to 5 GHz [42]
ity (Wi-Fi) outdoor po- door and < 60
sitioning for m indoor under
smartphones Wi-Fi covered
distance
Ultra Sound Indoor loca- Up to 0.01 m @ Up to 10 m @ 1–20 MHz [46, 50]
tion LoS and ~ 0.02 LoS indoor
@ NLoS
Bluetooth Real-Time Typically, Up to 2 m @ 2 MHz of [40, 51]
Indoor Posi- between 2 and 5 NLoS width in the
tioning m but can go up 2.4 GHz band
to 0.77 m using
different signal
processing algo-
rithms
Ultra-Wide Tracking Between 0.08 40–80 m 3.1 to 10.6 [42, 46, 52]
Band (UWB) and position and 0.2 m @ GHz
detection in LoS
sports
Computer Vi- Tracking Up to 0.05.–0.1 N/A N/A [40, 53]
sion of ball in m @ 340 fps
sports such
as Tennis and
Cricket
Tracking Up to 8.5% error N/A N/A
path length and under 1
of multiple m for marker-
objects based solutions
Global Posi- Measuring Up to 1.31 m/s > 100 km Out- 1575.42 MHz [54]
tioning System real-time error while mea- door and Indoor and 1227.6
(GPS) movement of suring velocity MHz
soccer play- and 6.05% error
ers in a test when measur-
situation ing position @
NLoS
Global Naviga- Smartphone Up to a few > 100 km Out- 1–2 GHz [55]
tion Satellite Location centimeters but door
System unstable
The fields of healthcare and sports have, to date, used RTLS and wearable
technologies separately for solving specific problems [43, 45, 54]. The scope
of the application has been limited thus far, but there seems to be a massive
potential for using RTLS in combination with wearable sensor technology.
Previous research has proposed and implemented frameworks in health care
Artificial Intelligence Based Body Sensor Network Framework... 315
and sports using these technologies separately. Still, there seems to be a lack
of integration of these two technologies.
Table 4: List of chosen sensors and their possible placement based on previous
applications and prior studies conducted
RTLS tag Position of the per- Center of Mass Tracking of patients, medical [73]
son in question, from staff and medical assets in a
a set reference point hospital
Physical Load, real-time posi- [7, 74]
tion data acquisition, tactical
analysis in team sports, coach-
ing and strategy development
Force Plate Pressure distribution Feet (as Insoles) Gait, biofeedback interven- [69]
on individual feet tions in stroke patients to
improve balance, mobility
Gait, analysis of athletes for [10]
technique and performance
optimization
SPECIFIC APPLICATIONS
Human experts can identify a tackle when they see it. This is primarily
due to the physical interaction of two players, which is unique to the tackle
itself. When this information is digitized using the BSN, the physics of the
ball and biomechanics, combined with the location data of all three parties
involved can be recorded. This data can be used for automatic sports-
specific event and lower limb movement detection [77, 78]. A time-series
clustering and classification algorithm can potentially identify all sets of
tackles automatically. Another advantage of tracking ECG, EMG, and RTLS
data is tracking of physical load on the cardiovascular system and individual
muscles can also be performed. When done on an ongoing basis, there is
potential to avoid injury, assess the preparedness of an athlete, and find new
correlations in a host of technical and tactical components of the game in
real-time [7, 74].
The same methodology can potentially be applied across multiple
individual and team sports to identify a wide variety of events. Automatic
event identification has been proposed in previous studies, but there is further
research warranted due to low reliability and validity of existing approaches
[37, 77]. The current BSN plus time series analysis framework would
potentially prove invaluable for multiple applications in sports, including
but not limited to analyzing technique, coaching, self and opponent analysis,
tactical analysis, talent identification, player selection, recruitment etc. [7,
74, 79]. Furthermore, broadcasting agencies can use such data to provide
visualizations and real-time information breakdown for live sporting events.
This may help enrich the ordinary viewer’s experience, providing them
with in-depth information from an expert’s perspective. Health performance
tracking is another up and coming field due to the rise of high-quality
low-cost sensors. AIBSNF can be potentially used to build biofeedback
mechanisms for continuous health tracking for a whole host of applications
such as sleep and recovery tracking, personalized training programs for
strength, mobility, cardiovascular endurance, and even ergonomic posture
feedback in workplaces [56].
GENERAL APPLICATIONS
CONCLUSION
The fields of RTLS and wearable biosensors are rapidly developing. There
has been tremendous progress in improving accuracy, validity, and reliability
in the sport and healthcare industries over the past decade [22, 40]. It is safe
to assume that they will continue to improve with the further layering of AI/
ML techniques. AIBSNF seems to be ideally positioned to take advantage
of the improvements in all these fields. It has the potential to impact a wide
range of research and development activities in multiple industries. Due
to the rapid pace of this development, numerous technological challenges
exist. Identifying the right sensors, and mashing them up successfully at
appropriate sample rates for time-synchronous data gathering, is possible
but challenging. Data protection at each level of collection, use of the right
algorithms, availability of computational power, and data science expertise
are needed to successfully implement such technology on a commercial scale.
The fields of sports and healthcare seem to be ideal areas where the proposed
mashup technologies can be of significant benefit. AIBSNF provides a high-
level understanding of how these technologies can be combined to develop
applications in multiple fields. However, there still exist a whole host of
technical challenges specific to the application domain. Further research and
development are required for the successful application of AIBSNF in the
highlighted industries.
ACKNOWLEDGEMENTS
The authors of the paper would like to acknowledge Ms Maithili Phatak for
her contribution for the artwork in the current paper. The authors would also
like to acknowledge the contributions of the management staff and language
correction team of the Institute of Exercise Training and Sport Informatics
at the German Sports University, Cologne.
AUTHORS’ CONTRIBUTIONS
AP: Primary author, writing and overall research, F-GW: Writing and
reviewing of the scenarios and the proposed framework, KV: Researching
and writing time series algorithms section, FV: Discussion and conclusion
writing, DM: Supervisor, contributions in introduction, sports applications
and framework development. All authors read and approved the final
manuscript.
Artificial Intelligence Based Body Sensor Network Framework... 327
REFERENCES
1. Harari YN. Homo Deus: a brief history of tomorrow. Homo Deus:
Random House; 2016.
2. Rajšp A, Fister I. A systematic literature review of intelligent data
analysis methods for smart sport training. Appl Sci. 2020;10:3013. doi:
10.3390/app10093013.
3. Roy R, Paul A, Bhimjyani P, Dey N, Ganguly D, Das AK, et al. A
short review on applications of big data analytics. In: Mandal JK,
Bhattacharya D, et al., editors. Emerg technol model graph. Singapore:
Springer; 2020. pp. 265–278.
4. Claudino JG, Cardoso Filho CA, Boullosa D, Lima-Alves A, Carrion
GR, GianonI RL da S, et al. The role of veracity on the load monitoring
of professional soccer players: a systematic review in the face of the
big data era. Appl Sci. 2021;11:6479.
5. Cottle M, Hoover W, Kanwal S, Kohn M, Strome T, Treister NW.
Transforming health care through big data: strategies for leveraging
big data in the health care industry. Inst. Heal. Technol. Transform. -
iHT. 2013.
6. MacLennan T. Moneyball: The Art of Winning an Unfair Game. J Pop
Cult. 2005;
7. Rein R, Memmert D. Big data and tactical analysis in elite soccer:
future challenges and opportunities for sports science. Springerplus.
2016;5:1–13. doi: 10.1186/s40064-016-3108-2.
8. Raghupathi W. Data Mining in Health Care. [Internet]. 1st ed. Healthc.
Informatics Improv. Effic. Product. Taylor & Francis; 2010. https://
www.taylorfrancis.com/books/e/9780429131059
9. Claudino JG, Capanema D de O, de Souza TV, Serrão JC, Machado
Pereira AC, Nassis GP. Current approaches to the use of artificial
intelligence for injury risk assessment and performance prediction in
team sports: a systematic review. Sports Med Open Sports Med Open;
2019. p. 1–12.
10. Taborri J, Keogh J, Kos A, Santuz A, Umek A, Urbanczyk C, et al. Sport
biomechanics applications using inertial, force, and EMG sensors: a
literature overview. Appl Bionics Biomech. 2020;2020.
11. Vijayakumar V, Nedunchezhian R. A study on video data mining. Int J
Multimed Inf Retr. 2012;1:153–172. doi: 10.1007/s13735-012-0016-2.
328 Advanced Techniques for Collecting Statistical Data
23. Jeong IC, Bychkov D, Searson PC. Wearable devices for precision
medicine and health state monitoring. IEEE Trans Biomed Eng IEEE.
2019;66:1242–1258. doi: 10.1109/TBME.2018.2871638.
24. Shi H, Zhao H, Liu Y, Gao W, Dou SC. Systematic analysis of a
military wearable device based on a multi-level fusion framework:
research directions. Sensors (Switzerland) 2019;19:2651. doi: 10.3390/
s19122651.
25. Seshadri DR, Li RT, Voos JE, Rowbottom JR, Alfes CM, Zorman
CA, et al. Wearable sensors for monitoring the physiological and
biochemical profile of the athlete. NPJ Digit Med. 2019;2:1–16. doi:
10.1038/s41746-019-0150-9.
26. Homayounfar SZ, Andrew TL. Wearable sensors for monitoring
human motion: a review on mechanisms, materials, and challenges.
SLAS Technol. 2020;25:9–24.
27. Zhou H, Zhang Y, Qiu Y, Wu H, Qin W, Liao Y, et al. Stretchable
piezoelectric energy harvesters and self-powered sensors for wearable
and implantable devices. Biosens Bioelectron. 2020;168:112569. doi:
10.1016/j.bios.2020.112569.
28. Dinh T, Nguyen T, Phan HP, Nguyen NT, Dao DV, Bell J. Stretchable
respiration sensors: Advanced designs and multifunctional platforms
for wearable physiological monitoring. Biosens Bioelectron.
2020;166:112460. doi: 10.1016/j.bios.2020.112460.
29. Heo JS, Eom J, Kim YH, Park SK. Recent progress of textile-based
wearable electronics: a comprehensive review of materials, devices,
and applications. Small. 2018;14:1–16. doi: 10.1002/smll.201703034.
30. Moran DS, Mendal L. Core temperature measurement: methods and
current insights. Sport. Med. 2002.
31. Rice P, Upasham S, Jagannath B, Manuel R, Pali M, Prasad S. CortiWatch:
watch-based cortisol tracker. Futur Sci OA. 2019;5:FSO416.
32. Wen W, Tomoi D, Yamakawa H, Hamasaki S, Takakusaki K, An Q, et
al. Continuous estimation of stress using physiological signals during
a car race. Psychology. 2017;6:978–86. https://www.researchgate.net/
publication/317012834_Continuous_Estimation_of_Stress_Using_
Physiological_Signals_during_a_Car_Race
33. Chu M, Nguyen T, Pandey V, Zhou Y, Pham HN, Bar-Yoseph R, et
al. Respiration rate and volume measurements using wearable strain
330 Advanced Techniques for Collecting Statistical Data
90. Lan KC, Litscher G, Hung TH. Traditional chinese medicine pulse
diagnosis on a smartphone using skin impedance at acupoints: a
feasibility study. Sensors (Switzerland) 2020;20:1–14. doi: 10.3390/
s21010001.
91. Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, et al.
A general reinforcement learning algorithm that masters chess, shogi,
and Go through self-play. Science (80- ). 2018;362:1140–4.
92. Crosby V, Wireless G. Body area networks for healthcare: a survey.
Int J Ad hoc Sens Ubiquitous Comput. 2012;3:1–26. doi: 10.5121/
ijasuc.2012.3301.
93. Mathur A, Gupta CP. Big data challenges and issues: a review. Lect.
Notes Data Eng. Commun. Technol. Springer; 2020. 10.1007/978-3-
030-24643-3_53
94. Kluge EHW. Artificial intelligence in healthcare: ethical
considerations. Healthc Manag Forum. 2020;33:47–49. doi:
10.1177/0840470419850438.
95. Gómez-González E, Gomez E, Márquez-Rivas J, Guerrero-Claro
M, Fernández-Lizaranzu I, Relimpio-López MI, et al. Artificial
intelligence in medicine and healthcare: a review and classification of
current and near-future applications and their ethical and social Impact.
2020. http://arxiv.org/abs/2001.09778
Chapter
DAViS: A UNIFIED SOLUTION
FOR DATA COLLECTION,
ANALYZATION, AND
17
VISUALIZATION IN REAL-TIME
STOCK MARKET PREDICTION
ABSTRACT
The explosion of online information with the recent advent of digital
technology in information processing, information storing, information
sharing, natural language processing, and text mining techniques has
enabled stock investors to uncover market movement and volatility from
Citation: (APA): Tuarob, S., Wettayakorn, P., Phetchai, P., Traivijitkhun, S., Lim, S.,
Noraset, T., & Thaipisutikul, T. (2021). DAViS: a unified solution for data collection,
analyzation, and visualization in real-time stock market prediction. Financial Innova-
tion, 7(1), 1-32. (32 pages)
Copyright: © Open Access. This article is licensed under a Creative Commons Attribu-
tion 4.0 International License (http://creativecommons.org/licenses/by/4.0/)
338 Advanced Techniques for Collecting Statistical Data
INTRODUCTION
The stock market prediction has become a prominent research topic for
both researchers and investors due to its important role in the economy and
obvious financial benefits. There is an urgent need to uncover the stock
market’s future behavior in order to avoid investment risks while achieving
the best profit margins for investments. Nevertheless, stock market decision-
making is difficult due to the stock market’s complex behavior and unstable
nature. Accurate prediction is even more challenging considering the need
to forecast the local stock market in different countries (Wu et al. 2019;
Selvamuthu et al. 2019; Gopinathan and Durai 2019) since there are unique
cultures, different norms, and diverse heterogeneous sources that can affect
investors’ decision-making processes. Therefore, we take the Thai stock
DAViS: A Unified Solution for Data Collection, Analyzation, and... 339
RELATED LITERATURE
Multiple techniques have been proposed to analyze the various phenomena
in financial markets (Wen et al. 2019; Kou et al. 2021). The overarching
goal of this research is to implement a computational model that derives
the relationship between contextual information and related stocks in
the financial market. We can divide the traditional models into two main
344 Advanced Techniques for Collecting Statistical Data
approaches based on the type of information they are focused on: technical
data or fundamental data.
Technical Analysis makes predictions on future stocks based on time-
series numerical data, such as opening and closing price and trade volume. The
main purpose of this approach is to find trading patterns that can be exploited
for future predictions. For example, Nayak et al. (2015) and Alhassan et al.
(2014) discovered a complicated stock pattern by utilizing the auto-regressive
model (AR), linearity, and stationary-time series. Nassirtoussi et al. (2015),
Nguyen et al. (2015), and Hagenau et al. (2013) predicted future stock prices
from historical data. Zhong and Enke (2019a) presented comprehensive big
data analytics based on 60 financial and economic features. They utilized
DNNs and traditional artificial neural networks (ANNs) along with the
principal component analysis (PCA) method to predict the daily direction
of future stock market index returns. Stoean et al. (2019) exploited deep-
learning methods with a heuristic-based strategy for trading simulations and
stock prediction. Nti et al. (2020) used an ensemble support vector machine
to boost stock prediction performance. However, the nature of stock price
prediction is highly volatile and non-stationary. Therefore, only utilizing
the numerical price data with technical analysis is inadequate to discover
dynamic market trends. In contrast, fundamental analysis integrates
information from outside market historical data such as news, social media,
and business reports as additional inputs for stock predictive models. For
example, Bollen et al. (2011) and Mao et al. (2011) proposed techniques
that mine opinions from social media for improved stock prediction. Vu et
al. (2012) first used a keyword-based algorithm to analyze and categorize
Twitter messages as positive, negative, and neutral. Then all features along
with historical prices were used to train a Decision Tree (C4.5) classifier to
predict the direction of future prices. Schumaker et al. (2012) investigated
the correlation between the sentiment of financial news articles and stock
movements. Later, Li et al. (2014) constructed sentiment vectors with the
Harvard psychological dictionary and used them to train a Support Vector
Machine (SVM) classifier to predict the daily open and closing prices. Jin
et al. (2013) presented Forex-Foreteller (FF), a currency trend model using
news articles as well as historical prices and currency exchange values. The
system used sentiment analysis and LDA (Blei et al. 2003a) to obtain a topical
distribution of each article. Akhtar et al. (2017) and Araque et al. (2017)
proposed ensemble model construction to enhance sentiment analysis. Such
DAViS: A Unified Solution for Data Collection, Analyzation, and... 345
methods are based on the work of Cheng et al. (2012), who examined whether
ensemble methods could outperform the base learning algorithms, each of
which learns from previous price information (as a time series). Afzali and
Kumar (2019) integrated a company’s textual information to improve stock
prediction performance. Lim and Tucker (2019) quantified the sentiment
in a financial market and social media to enhance performance in many
financial applications. Chattupan and Netisopakul (2015) used word-pair
features (i.e., a keyword and polarity word) to conduct a news sentiment
classification based on three sentiments: positive, negative, and neutral. In
addition, Lertsuksakda et al. (2014) used the hourglass of emotions which
is an improvement over Camras (1981)’s wheel of emotions—comprising
eight emotional dimensions, namely, joy, trust, fear, surprise, sadness,
disgust, anger, and anticipation—which has been utilized for many emotion-
inspired predictive tasks. While there have been many efforts to enhance the
performance of stock price prediction, few studies have provided an end-
to-end framework to collect, analyze, and visualize stock insights in a real-
time system. Our work differs from the exiting studies since we leverage
both technical and fundamental data from online news, social networks, and
discussion boards to support investors’ decision-making processes. Details
on our proposed model are provided in the next section.
PRELIMINARY
In this section, we present the notations used throughout this paper. We
denote the sets of stock companies, technical data analysis, and fundamental
data analysis as S, T, and F, where the sizes of these sets are |S|, |T|, and
|F|, respectively. The technical data analysis utilizes technical information
such as the price-earnings ratio, market capitalization, and volume. These
types of data can be kept in a tabular format of real numbers. Investors
can conveniently gather this information from many stock-price reporting
sources, such as the Stock Exchange of Thailand (SET),Footnote2,Yahoo
Finance,Footnote3 and Stock Radars.Footnote4 On the other hand, fundamental data
analysis involves monitoring primarily three basic factors (i.e., economic,
industrial, and organizational performance) that can affect stock prices.
Such analysis requires examining both quantitative and qualitative data.
While it is not convenient to represent qualitative data, often distilled from
news articles, in a defined structural format, such insight can be helpful to
investors and therefore cannot be neglected.
346 Advanced Techniques for Collecting Statistical Data
Definition 1
Technical Data Analysis Time Series (T): Each stock company \(s \in S\) has
historical stock prices sorted by the time in chronological order. We define
a company’s historical stock prices as where
t is the current timestamp of the recent stock price belonging to company s
and l denotes the number of historical days used as the time lag.
Definition 2
Fundamental Data Analysis Time Series (F): There are three types of
fundamental data analysis used in this study, including financial news
information, discussion board information, and social media information. We
denote as a set of financial news articles, as a set of social media
posts, and as a set of discussion board posts. For all the fundamental
(F) data, we sort each , and by time in chronological
order. We then define a company’s historical contextual text data input
as
where t is the current timestamp of the recent
contextual text data belonging to company s and l denotes the number of
historical days used as the time lag.
Definition 3
Stock Data Time Series Input: This research focuses on time-series data, that
is, historical prices along with contextual information are used as the input to
the proposed stock predictive model. We therefore combine the l historical
data from T and F, and construct the horizontal input data to the model as
where decisions to purchase or sell can be made on a daily basis. For the
lag observations, the data of the past three days (t−2, t−1, and t) is used to
supervise the model. Such lag settings were also used by Bollen et al. (2011)
to predict the Dow Jones Industrial Average (DJIA).
stock prices are visualized with real-time and useful information such as
the sentiment of financial news and discussion board posts with respect
to a particular stock. In addition, the most related news articles and top
relevant topics are ranked and displayed on our end-to-end framework as
supplementary insights to support decision-making in a dynamic stock
market investment.
Data Collection
In our research, four types of contextual information are investigated for
their predictability of stock prices.
Historical Price Information Technical information can be used for
mathematical calculations with various variables. In our research, we gather
information on stock prices using the application programming interface
(API) to download historical stock data from the SiamChartFootnote5 website,
with a focus on seven attributes: date, opening price, highest price, lowest
price, closing price, adjusted closing price, and trading volume.
Financial News Information Financial news often reports important
events that may directly and/or indirectly affect a company’s stock price.
Publicly available news articles from reliable news sources such as
KaohoonFootnote6 and Money ChannelFootnote7 are routinely crawled. To minimize
the assumption of a news article’s metadata, only common attributes such
as the news ID, textual header, content, timestamp, and news source are
parsed and stored. A news article is mapped to corresponding companies by
detecting the presence of stock symbols in the news content; it is a common
protocol for financial news sources to include related stock symbols in the
corresponding news articles.
Social Media Information: Investors often express their opinions on
social networks. In this research, Twitter messages (or tweets) are used as
social media information. The open-source Get Old TweetsFootnote8 is used
to collect public tweets. To allow the methodology to be generalized to
other social media platforms, only common social media information such
as textual content and timestamp is extracted and stored. User-identifiable
information such as usernames and mentions are removed before storing
and further processing.
Discussion Board Informatio: Discussion boards are used to exchange
opinions on a company’s situation, which may or may not be related to
stock prices. A discussion thread comprises the main post and a sequence
of comments. Such information could be used to infer the current sentiment
toward a particular company. PantipFootnote9 discussion forum is used in our
research due to its public availability and popularity among Thai investors.
Based on our observations, the messages and discussed topics usually contain
or are related to facts and company news that could be indicators for stock
price movements. Furthermore, the overall sentiment expressed by users
also indicates the situation of the mentioned companies and, subsequently,
their stock movements. For our study, only public discussion threads that
350 Advanced Techniques for Collecting Statistical Data
Data preprocessing
In this section, the techniques used in data pre-processing are explained.
These techniques can be divided into three steps.
Input-Data Extraction This phase refers to the process of extracting
useful content from the crawled HTML pages using an HTML parser. With
the help of Python BeautifulSoup4Footnote10 library, a document object model
(DOM) traversal is used to extract necessary information by defining the
ID, class, or tag that the content belongs to. Following this, the timestamp is
extracted to show when the article was released, which could help to visualize
the connection between the data and the prices in a storyline format.
Tokenization and Stop Word Removal Tokenization or word segmentation
is one of the first processes in traditional natural language processing (NLP).
While effective tokenization tools are available for standard languages
such as English, most algorithms for tokenizing Thai text are still under
investigation (Tuarob and Mitrpanont 2017; Noraset et al. 2021). In our
work, the Thai word segmentation open-source model developed by the
National Electronics and Computer Technology Center (NECTEC), namely
LexTo (Thai Lexeme Tokenizer)Footnote11 is used to tokenize the text. LexTo
is a dictionary-based tokenizer that implements the longest matching
algorithm. A textual document is mapped to corresponding companies using
stock symbol detection. Information pertaining to each company is also
extracted in this step.
Text Vectorization Using TF-IDF To use machine learning for text
analysis, textual information needs to be converted into a machine-readable
format since raw text data cannot be fed straight into the machine learning
algorithm. Specifically, each document must be represented with a fixed-
length vector of real numbers. This process is often referred to as vectorization.
A textual representation method will be performed by transforming tokenized
words in each document into a bag-of-words representation in which each
term presents one feature of a document vector. The bag-of-words approach
is used as the de facto standard of text analysis research due to its simplicity
and capacity to produce a vectorized representation of the text. Each term t
is given a term frequency-inverse document frequency (TF-IDF) (Manning
et al. 2009) score with respect to the document d, defined as:
DAViS: A Unified Solution for Data Collection, Analyzation, and... 351
(1)
(2)
(3)
where is the number of occurrences of word t in article d,
defines the document frequency of a term with t, and n is the total number
of documents. Generally, the TF-IDF scoring scheme prefers terms that
frequently appear in a given document but less frequently in the corpus.
Such terms are deemed to be both representative and meaningful. After
performing the TF-IDF weighting, a document can then be represented with
a vector of weighted terms. We use Python’s scikit-learnFootnote12 to vectorize
textual documents.
Figure 7: Illustration of the engineered features after applying PCA and Ward
clustering to reduce the feature space.
As a result of feature engineering, the feature space is reduced to 40
dimensions that are derived from the integration of two-dimensional
reduction techniques (the first 20 dimensions come from the principal
component analysis (PCA), and the other 20 dimensions come from Ward
hierarchical agglomerative clustering), as illustrated in Fig. 7.
DAViS: A Unified Solution for Data Collection, Analyzation, and... 353
(4)
(5)
354 Advanced Techniques for Collecting Statistical Data
(6)
where score(d) is the score of document d; sentiment is the weight of its
sentiment classes; informativeness is the weight of its category classes; date
is document release time, formatted as ‘yyyymmdd’; N(s|d) is the number of
stocks s related to a document d; \(\beta\) is the bias factor, which is a pre-
defined weight scheme added to compute the inverse relation of N(s|d), and
is set to 5.0 by default.
(7)
For the implementation, we apply the TwittDict algorithm proposed by
Tuarob et al. (2015), which is an extension of Latent Dirichlet Allocation
(LDA) (Blei et al. 2003b) that extracts emerging social-oriented key phrase
semantics from Twitter messages. Such key phrases extracted from a corpus
of news articles are ranked based on their prevalence probability. Top key
phrases are used to generate a tag cloud that captures the current topics
DAViS: A Unified Solution for Data Collection, Analyzation, and... 361
EXPERIMENTAL SETUP
Dataset Statistics
Table 2: Dataset statistics, including number of news articles, forum posts, and
tweets for each stock
Evaluation Metrics
In this section, the performance metrics used to evaluate the predictions,
in terms of both the magnitude of the error and directional accuracy, are
described as follows:
• Mean Absolute Percentage Error (MAPE) is an error-based
measurement that calculates the absolute error by percentage
with respect to the actual value.
• Directional Accuracy (DA) provides a measurement of prediction
direction accuracy. The predicted values can be considered
positive or negative directions.
Table 3: Example distribution of the training set (train), validation set (valid),
and testing set (test) from the dataset set of news articles
EXPERIMENTAL RESULT
In this section, the experiments are conducted to answer the following
research questions: RQ1: What is the proper feature engineering method
to use in DAViS-C for dimensionality reduction? RQ2: What is the proper
size of dimension decomposition used in DAViS-A’s decomposition
process? RQ3: What are the proper time lags (l) to use for stock prediction
in DAViS-A? RQ4: Does the proposed ensemble machine learning with
contextual text data in DAViS-A-w-c outperform the one without contextual
text data in DAViS-A-wo-c? RQ5: How do the different types of contextual
text data affect stock prediction performance? RQ6: How well does the
DAViS-V classification task perform on financial sentiment analysis and
news informative analysis? RQ7: How well does DAViS-V perform in the
document scoring task? RQ8: How well does DAViS-V perform in the
topic modeling task? RQ9: Can our proposed ensemble machine learning
approach in DAViS-A provide interpretable results to stock investors?
RQ10: Can our end-to-end DAViS framework provide useful insights for
investors to make real-time decisions on stock investments?
DAViS: A Unified Solution for Data Collection, Analyzation, and... 365
data used in DAViS-A-w-c includes all text from news articles, social media
messages, and discussion board posts. Next, we analyze the performance
comparison between the ensemble stock machine learning prediction with
and without contextual text data denoted as DAViS-A-wo-c and DAViS-A-
w-c, respectively. Table 5 shows that DAViS-A-w-c could outperform all
base estimators in terms of error-based performance metrics by yielding a
MAPE of 0.93% and a DA of 54.36%. Statistical tests shown in Table 6
confirm that the performance of our proposed ensemble stacking estimator is
statistically significantly different from that of the other baseline estimators,
especially in terms of DA. We also observed that including contextual text
data in DAViS-A-w-c could improve the stock prediction performance by
large margins.
Table 6: Comparison of the p-values from the Student’s paired t-test between
the proposed ensemble stacking and other baseline estimators (DAViS-A-w-c),
with \(\alpha\) = 0.05
Figure 13: The number of data points in each class in the dataset.
370 Advanced Techniques for Collecting Statistical Data
number of related stocks. The results of news articles scoring and ranking
are listed in Table 10.
Table 10: Ranked documents based on our proposed document scoring tech-
nique
Table 11: Notable key phrases of a selected company using TwittDict’s topic
modeling technique, where a key phrase is denoted as a mixture of topics, fre-
quency is its occurrence in the company’s corpus; score defines the relevance of
its key phrase corresponding to the company
As seen in Table 11, most of the key phrases might not convey sufficient
information. This might be because imbalanced news articles are generated
each day, as illustrated in Fig. 11. Thus, the topic modeling could be misled
by a high volume of the market news category. Although the extracted key
phrases do not provide meaningful messages to investors, it is undeniable
that there would be potential benefits if this topic modeling approach can
discover emerging insightful information early. Therefore, a possible
improvement would be to equip the system with the ability to automatically
perform document filtering and extract valuable topics.
DAViS: A Unified Solution for Data Collection, Analyzation, and... 373
Here DT, RF, KNN, BAY, AdaBoost (Ada), GB, and XGB are used as
base learners
news articles, discussion boards, and social media, is extracted and digested
using machine-learning techniques to gain insight into stock markets. As
discussed in the prototype model of DAViS, we proposed an interpretable
ensemble stacking of diversified machine-learning-based estimators in
combination with an engineered textual transformation using the PCA
and Ward hierarchical features to predict the next day’s stock prices. The
use of textual analysis with a topic modeling-based technique is applied
to extract useful information such as sentiment, informativeness, and key
phrases. Finally, we described how documents are scored and ranked based
on different variables in our system. Future studies could further develop the
system to include even more contextual knowledge and discover predictive
signals that could be deployed in an innovative algorithmic trading system.
Integrating the prediction into a trading strategy and comparing it with
existing ones could also further expand the practicality of our proposed
methods.
Notes
1. https://www.set.or.th/.
2. https://www.set.or.th/.
3. https://finance.yahoo.com/.
4. https://www.stockradars.co/.
5. http://www.siamchart.com.
6. https://www.kaohoon.com.
7. http://www.moneychannel.co.th.
8. https://github.com/Jefferson-Henrique/GetOldTweets-python.
9. https://www.pantip.com.
10. https://pypi.python.org/pypi/beautifulsoup4.
11. www.sansarn.com/lexto.
12. http://scikit-learn.org.
ACKNOWLEDGEMENTS
This research project is supported by Mahidol University (Grant No. MU-
MiniRC02/2564). We also appreciate the partial computing resources
from Grant No. RSA6280105, funded by Thailand Science Research and
Innovation (TSRI), (formerly known as the Thailand Research Fund (TRF)),
and the National Research Council of Thailand (NRCT).
DAViS: A Unified Solution for Data Collection, Analyzation, and... 377
REFERENCES
1. Afzali M, Kumar S (2019) Text document clustering: issues and
challenges. In 2019 International conference on machine learning, big
data, cloud and parallel computing (COMITCon). IEEE, pp 263–268
2. Akhtar MS, Gupta D, Ekbal A, Bhattacharyya P (2017) Feature
selection and ensemble construction: a two-step method for aspect
based sentiment analysis. Knowl Based Syst 125(Supplement C):116–
135 (ISSN 0950-7051)
3. Alhassan J, Abdullahi M, Lawal J (2014) Application of artificial
neural network to stock forecasting-comparison with ses and arima. J
Comput Model 4(2):179–190
4. Araque O, Corcuera-Platas I, Sánchez-Rada JF, Iglesias CA (2017)
Enhancing deep learning sentiment analysis with ensemble techniques
in social applications. Exp Syst Appl 77(Supplement C):236–246
(ISSN 0957-4174)
5. Blei DM, Ng AY, Jordan MI (2003a) Latent dirichlet allocation. J Mach
Learn Res 3(Jan):993–1022
6. Blei DM, Ng AY, Jordan MI (2003b) Latent dirichlet allocation. J Mach
Learn Res 3(Jan):993–1022
7. Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock
market. J Comput Sci 2(1):1–8 (ISSN 1877-7503)
8. Bomfim AN (2003) Pre-announcement effects, news effects, and
volatility: monetary policy and the stock market. J Bank Finance
27:133–151
9. Camras L (1981) Emotion: theory, research and experience. Am J
Psychol 94(2):370–372 (ISSN 00029556)
10. Chattupan A, Netisopakul P (2015) Thai stock news sentiment
classification using wordpair features. In: The 29th Pacific Asia
conference on language, information and computation, pp 188–195
11. Cheng C, Xu W, Wang J (2012) A comparison of ensemble methods in
financial market prediction. In: 2012 Fifth international joint conference
on computational sciences and optimization. IEEE, pp 755–759
12. Colas F, Brazdil P (2006) Comparison of svm and some older
classification algorithms in text classification tasks. In IFIP international
conference on artificial intelligence in theory and practice. Springer, pp
169–178
378 Advanced Techniques for Collecting Statistical Data
A aviation 307
Abnormal big data 259 B
accelerometers 92
back-end data collection engine 292
accurate data 269
banking 307
Accurate prediction 338
Bayesian Ridge regression (BAY)
Active data collection 92
354
activity value 248, 252, 254, 255
behavioural data 90, 92
agriculture 307
big data 247, 248, 249, 250, 251,
algorithms 172
252, 253, 254, 255, 256
ambulatory assessment 209
biowearables 269
analysis notes 47
Bluetooth radios 92
AOA (angle of arrival) 312
body sensor networks (BSN) 306
application programming interface
breastfeeding 222, 223, 224, 229,
(API) 275, 349
231, 234, 235, 237, 239, 243,
application software 258
245
artificial intelligence 306, 327
artificial intelligence-based body C
sensor network framework
case-control studies 123, 150
(AIBSNF) 306
case report 123
artificial neural networks (ANN)
case series studies 123
309
Clinical research professionals
astronomy 307
(CRPs) 269
audio sensors 92
clinical trial 223, 227, 229, 238,
automated research management
240, 241
system 222
cloud-based research 90
cloud computing 258, 259
382 Advanced Techniques for Collecting Statistical Data