FOUNDATIONAL STUDIES FOR MEASURING THE IMPACT, PREVALENCE, AND PATTERNS OF PUBLICLY SHARING BIOMEDICAL RESEARCH DATA A.

Specific Aims
Many initiatives encourage research data sharing in hopes of increasing research efficiency and quality, but the effectiveness of these early initiatives is not well understood. Sharing and reusing scientific datasets have many potential benefits: in addition providing detail for original analyses, raw data can be used to explore related or new hypotheses, particularly when combined with other publicly available data sets. Real data is indispensable when investigating and developing study methods, analysis techniques, and software implementations. The larger scientific community also benefits: sharing data encourages multiple perspectives, helps to identify errors, discourages fraud, is useful for training new researchers, and increases efficient use of funding and patient population resources by avoiding duplicate data collection. Eager to encourage the realization of such benefits, funders, publishers, societies, and individual research groups have developed tools, resources, and policies to encourage investigators to make their data publicly available. Despite these investments of time and money, we do not yet understand the rewards, prevalence or patterns of data sharing and reuse, the effectiveness of initiatives, or the costs, benefits, and impact of repurposing biomedical research data. Studies examining current data sharing behavior would be useful in three ways. First, an estimate of the prevalence with which data is shared, either voluntarily or under mandate, would provide a valuable baseline for assessing future adoption and continued intervention. Second, analyses of current behavior will likely identify subfields (perhaps research areas with a particular disease or organism focus, or those in well funded research groups) with relatively high prevalence of data sharing; digging into these can illuminate valuable best practices. Third, the same analyses will likely reveal subareas in which researchers rarely share their research datasets. Future research could focus on these challenging areas, to understand their unique obstacles for data sharing and refine future initiatives accordingly. You can not manage what you do not measure: understanding the rewards, prevalence, and patterns of data sharing and withholding will facilitate effective refinement of data sharing initiatives to better address real-world needs. The long-term goal of this research is to accelerate research progress by increasing effective data reuse through informed improvement of data sharing and reuse tools and policies. The objective of this proposal is to examine the feasibility of evaluating data sharing behavior based on examination of the biomedical literature. The central hypothesis of this proposal is: Analysis of the impact, prevalence, and patterns with which investigators share and withhold gene expression microarray research data can uncover rewards, best practices, and opportunities for increased adoption of data sharing. To evaluate the central hypothesis, I will perform the following specific aims: Aim 1: Does sharing have benefit for those who share? I will investigate the association between sharing raw microarray data and subsequent citation rate of published studies. I will use datasets generated by a small, relatively homogeneous set of cancer gene expression microarray clinical trials. Multivariate analysis will be used to statistically controlling for potential confounders. The results of Aim 1 provide motivation for Aim 2 and preliminary work for Aim 3. Note: This study has already been completed. Aim 2: Can sharing and withholding be systematically measured? Because the manual methods used to conduct Aim 1 do not scale to larger analyses, I will investigate automatic methods for measuring data sharing and withholding behavior. First, articles that generate gene expression microarray data will be identified using NLP on full-text research. Second, to assess whether the authors of these data-generating studies share or withhold their data, I will investigate using database submission citation links as evidence of data sharing. The results of Aim 2 will be used to generate a dataset for use in Aim 3.

Aim 3: How often is data shared? What predicts sharing? How can we model sharing behavior? First, I will apply the classification systems described in Aim 2 to wide spectrum of the biomedical literature to identify articles that have generated gene expression microarray data and, subsequently, which of the articles that generated data also shared it. Then, for each of the articles, I will collect and analyze features related to the authors, their institutional and funding environment, the study itself, and the publishing mechanism. I will use univariate and multivariate statistics to investigate which of these features are associated with dataset sharing. Finally, I will use exploratory factor analysis to derive a model that could be used to explain data sharing decisions based on my measured variables. This proposal describes a new, exploratory, and innovative research project that could radically impact the adoption of data sharing in biomedical research. Expected contributions for this proposal include (a) an assessment of the observed and measured rewards, prevalence, and patterns of gene expression microarray dataset sharing, (b) a large, publicly available dataset associating microarray study publications with data sharing status, and (c) tools and methods for continued research in this area. This developmental work will provide a strong foundation for refining initiatives to efficiently and effectively encourage data sharing.

B. Background and Significance
Widespread adoption of the Internet now allows research results to be shared more readily than ever before. This is true not only for published research reports, but also for the raw research data points that underlie the reports. Investigators who collect and analyze data can submit their datasets to online databases, post them on websites, and include them as electronic supplemental information – thereby making the data easy to examine and reuse by other researchers. Reusing research data has many benefits for the scientific community. New research hypotheses can be tested more quickly and inexpensively when duplicate data collection is reduced. Data can be aggregated to study otherwise-intractable issues, and a more diverse set of scientists can become involved when analysis is opened beyond those who collected the original data. Ethically, it has long been considered a tenet of scientific behavior to share results, thereby allowing close examination of research conclusions and facilitating others to build directly on previous work. The ethical position is even stronger when the research has been funded by public money, or the data are donated by patients and so should be used to advance science by the greatest extent permitted by the donors. However, while the general research community benefits from shared data, much of the burden for sharing the data falls to the study investigator. Unfortunately, these advantages only indirectly benefit the stakeholders who bear most of the costs for sharing their datasets: the primary data-producing investigators. A major cost is time: the data have to be formatted, documented, and released. Further, it is sometimes complicated to decide where to best publish data, since supplementary information and laboratory sites are transient. Beyond a time investment, releasing data can induce fear. There is a possibility that the original conclusions may be challenged by a re-analysis, whether due to possible errors in the original study, a misunderstanding or misinterpretation of the data[26], or simply more refined analysis methods. Future data miners might discover additional relationships in the data, some of which could disrupt the planned research agenda of the original investigators. Investigators may fear they will be deluged with requests for assistance, or need to spend time reviewing and possibly rebutting future re-analyses. They might feel that sharing data decreases their own competitive advantage, whether future publishing opportunities, information trade-in-kind offers with other labs, or potentially profit-making intellectual property. Finally, it can be complicated to release data. If not wellmanaged, data can become disorganized and lost. Some informed consent agreements may not obviously cover subsequent uses of data. De-identification can be complex. Study sponsors, particularly from industry, may not agree to release raw detailed information. Data sources may be copyrighted such that the data subsets can not be freely shared. Recognizing that these disincentives make the establishment of a voluntary data sharing culture unlikely without policy guidance, many initiatives actively encourage or require that investigators make their raw data available for other researchers. There is a well known adage: you cannot manage what you do not measure. For those with a goal of promoting responsible data sharing, it would be helpful to evaluate the effectiveness of requirements, recommendations, and tools. When data sharing is voluntary, insights could be gained by learning which datasets are shared, on what topics, by whom, and in what locations. When policies make data sharing mandatory, monitoring is useful to understand compliance and unexpected consequences.

Unfortunately, it is difficult to monitor data sharing because data can be shared in so many different ways. Previous assessments of data sharing have included manual curation and investigator self-reporting. These methods are only able to identify instances of data sharing and data withholding in a limited number of cases, and therefore are unable to support inquiry into patterns of data sharing behavior. This proposal addresses three phases of research critical to a full evaluation of data sharing behavior: Aim 1: Does sharing have benefit for those who share? Aim 2: Can sharing and withholding be systematically measured? Aim 3: How often is data shared? What predicts sharing? How can we model sharing behavior?

B.1 Significance of this proposal
This proposal describes a new, exploratory, and innovative research project that could radically impact the adoption of data sharing in biomedical research. This work directly supports NIH strategic initiatives. This work will provide a foundation for evaluating the effectiveness of the NIH’s data sharing policies. Further, the current proposal will contribute to the NLM’s goals of “contributing to comprehensive strategies for preservation of biomedical information in the US and worldwide” and “developing linked databases for discovering relationships between clinical data, genetic information, and environmental factors” . This work would also be relevant for the NCRR in its work to “facilitate information sharing among biomedical researchers” as part of its current strategic plan. Progress will also be of significant value to a broad cross-section of disciplines, including: • Funders, policy makers and thought leaders. Although some results of this analysis may be intuitive (a stronger journal data sharing policy results in more data sharing, or shared data permits reuse and thus supports a higher citation rate), these relationships have not yet been demonstrated. Concrete, supporting – or contradictory! – evidence will be of value to a wide spectrum of policy makers and thought leaders. For example, funders can improve the impact and efficiency of their investments through applying lessons learned from communities with high data sharing adoption to those with room for improvement. Database, software, and data standard developers. The usage patterns of those who share data provide critical requirement specification feedback for developing and refining databases, software, and standards to support data sharing and reuse. Learning who does not currently share data can provide insight into failings of current tools and opportunities for improvements. Biomedical informatics community. Informatics involved evaluation of the generation, use, and value of information resources; this research addresses this topic from a novel perspective. The biomedical informatics field will also benefit by exposure to methods it does not commonly apply. For example, my plans to apply NLP techniques to the biomedical literature through full-text portals could have wide applicability for information retrieval. Finally, the general biomedical informatics community will benefit if and when this research leads to initiatives that increase the rate of data sharing. Information science and digital library community. Data use behavior and resource usage metrics are active research topics in information science and digital library research. Several ongoing projects are investigating data use in the social sciences, but there has been little recent investigation measurement of research sharing in the biomedical arena. Furthermore, most of the information science studies have used survey approaches; my emphasis on measured variables will provide a diverse perspective. Open Science community. Grassroots movements to increase openness and transparency in science will benefit from rigorous, quantitative assessments of current data sharing behavior. Primary Investigators. Last but not least, I expect that this research will help inspire investigators to share their data and help inform the creation of tools that help them. As data sharing is evaluated and policies and incentives improved, hopefully investigators will become more apt to share and reuse study data and thus maximize its usefulness to society.

• •

Expected contributions, taking the form of papers and associated datasets, include: 1. an assessment of the observed and measured rewards, prevalence, and patterns of gene expression microarray dataset sharing

2. a publicly available dataset associating microarray study publications with data sharing status 3. a generalizable approach for developing practical, real-world natural language tools for information retrieval and extraction within a wide selection of biomedical literature 4. preliminary models of data sharing behavior Although I plan to limit this study to one datatype allows an in-depth analysis of many specific facets of data sharing and reuse, I believe the approach and many of the results will be generalizable across domains. As further support for the significance of this work, our preliminary work has been enthusiastically welcomed by peer reviewers; reviews have frequently declared the research to be “relevant and timely.” I believe this developmental work will provide a strong foundation for refining initiatives to efficiently and effectively encourage data sharing.

B.2 The potential benefits of data sharing
Sharing information facilitates science. Reusing previously-collected data in new studies allows these valuable resources to contribute far beyond their original analysis. In addition to being used to confirm original results, raw data can be used to explore related or new hypotheses, particularly when combined with other publicly available data sets. Real data is indispensable when investigating and developing study methods, analysis techniques, and software implementations. The larger scientific community also benefits: sharing data encourages multiple perspectives, helps to identify errors, discourages fraud, is useful for training new researchers, and increases efficient use of funding and patient population resources by avoiding duplicate data collection. Believing that that these benefits outweigh the costs of sharing research data, many initiatives actively encourage investigators to make their data available. Some journals require the submission of detailed biomedical data to publicly available databases as a condition of publication. Since 2003, the NIH has required a data sharing plan for all large funding grants and has more recently introduced stronger requirements for genome-wide association studies; other funders have similar policies. Several government whitepapers and high-profile editorials call for responsible data sharing and reuse, large-scale collaborative science is providing the opportunity to share datasets within and outside of the original research projects, and tools, standards, and databases are developed and maintained to facilitate data sharing and reuse.

B.3 Current data sharing practice
As highlighted above, sharing research data has many potential benefits to society. Although sharing of data has always been an aspiration of the scientific enterprise, it has only been common in a few subdisciplines. Forces are now converging to make it an achievable and everyday practice.

Forces in support of increased data sharing
Datasets are larger than they have ever been – and larger than any single team of scientists can analyze exhaustively. The ubiquitous sharing and reuse of DNA sequences in Genbank has clearly demonstrated the power of openly shared data. Other high-throughput hypothesis-generating datasets, such as genome-wide association studies , gene expression microarrays, proteomics mass spectra, and brain images allow data to be repurposed to answer multiple research questions. Extensive datasets are also generated within the clinical setting, particularly through the adoption of electronic health records. Stakeholders have begun to develop recommendations and guidelines for the complex ethical, legal, and technical issues surrounding the reuse and sharing of health data beyond primary healthcare. Research is increasingly performed within networks of multi-disciplinary teams. The NIH Roadmap and other initiatives have recognized that significant scientific progress requires collaboration. Collaborations develop and adopt frameworks, standards, tools, and policies to share data among investigators. This work can facilitate sharing their data beyond the boundaries of the original research partners. Today’s collaborative science on large datasets is performed within an extremely tight biomedical funding environment. Many funding agencies have instituted data-sharing policies, hoping to accelerate scientific progress while avoiding the cost of duplicative collection efforts. The NIH Data Sharing Policy, adopted in 2003, requires a data sharing plan for all research grants over $500K. The NIH stipulates additional requirements for

specific domains. For example, all funded genome-wide association studies (GWAS) are now expected to share their data in the centralized NCBI database, dbGaP. Complementing and extending these funding agency requirements, many biomedical journals require or recommend that data be shared as a condition of publication. Some journals delineate the responsibilities in detail and include procedures for addressing data sharing noncompliance. Open, centralized databases such as Genbank, Uniprot, and the Gene Expression Omnibus have evolved into de facto homes for specific types of data. Standards for minimum data inclusion and data formats have been developed for many types of datasets. The challenge of integrating datasets has spurred research progress on ontologies and semantic description methods. Projects such as NCBI’s Entrez database suite , the Semantic Web for Life Sciences , the National Center for Biomedical Ontology’s Bioportal framework , and caBIG provide visions for the future of research when data is more universally available and interoperable. Data sharing and integration are being actively pursued outside of biomedical research, in other scientific fields (physics, astronomy, environmental science) and also by the general public. Several websites encourage uploading and visualizing all sorts of data: the “Tasty Data Goodies” at Swivel (http://www.swivel.com) and IBM’s Many Eyes (http://www.many-eyes.com) are popular examples. Widespread adoption of Web 2.0 technologies, including blogging, tagging, wikis, and mashups, suggest that our next generation of scientists will expect and embrace a world of research remixes. Finally, I note the complementary forces of open access and pre-print publications, open notebook science projects, open source code, Creative Commons copyright licenses (http://creativecommons.org/) for many kinds of original content (including data), and two recent public access policies. The NIH Public Access Policy will require all NIH-funded investigators to submit their peer-reviewed manuscripts to PubMed Central to ensure public access, beginning in April 2008. In February 2008, the faculty of Harvard University voted to make all faculty scholarly publications freely available in an online open-access repository, the first such resolution by a university in the United States. While these policies do not apply to data beyond that provided within the manuscripts, they clearly demonstrate a political will to support sharing research results “to help advance science and improve human health” (http://publicaccess.nih.gov) and “promote free and open access to significant, ongoing research”.

Forces opposing data sharing
While many forces are converging to enhance our ability to share data, there are significant social, organizational, technical and legislative factors that may impede them. Investigators may restrict access to data to maximize the professional and economic benefit that they accrue from data they generate, even though they also gain advantage by accessing data produced by others. A recent review of genomic data sharing highlighted the complexity of stakeholder interests both for and against data sharing, beyond those of the original investigators. Study subjects may have personal interests in privacy and confidentiality that exceed their personal interests in contributing to new methods of detecting and treating disease. Academic health centers may view data sharing as a threat to intellectual property, with the potential to impede spin-offs and start-ups that bring revenue and act as incubators for future research. Industrial sponsorship may hinder plans for sharing data. Changes in the regulatory environment make the sharing of data more complex, and may necessitate more stringent oversight to ensure compliance and minimize risk. Finally, limitations imposed by specific technologies undermine the ability of a uniform approach to generalize across different data types and regulatory requirements. It is often difficult to effectively incent and mandate data sharing. Mandates are often controversial while requests and unenforced mandates are often ignored . The effect of funder policies like the NIH Data Sharing Policy have not been systematically studied, but anecdotal evidence suggests that many researchers view funder policies as optional, since they data sharing plans are not considered as part of scientific evaluation, and the mandate is only for a plan not the sharing itself. [Personal communication with Jenny Tucker, pending permission to include] I believe that a critical element in balancing these opposing forces is a better understanding of current data sharing behavior, patterns, and predictors to be used for communicating and refining sharing best-practices.

B.4 Related research on data sharing behavior
A few investigations into data sharing behavior and attitudes have initiated work in this area. Findings and outstanding challenges are highlighted below.

Measuring and modeling data sharing behavior
Most measurements of data sharing prevalence have manually searched for shared datasets across a subset of journals, or systematically contacted authors to ask for shared datasets. These studies have found that data sharing levels are high (but less than 100%) in a few cases, but overall prevalence is low. For example, Ochsner et al found that despite the maturity of gene expression microarray data sharing infrastructure and multitude of funder and journal mandates, overall data sharing across 20 journals in 2007 is about 50%. These analyses have not correlated their prevalence findings with other variables to detect patterns. Multivariate analyses have relied upon surveyed attitudes and intentions (described below), rather than measured characteristics.

Measuring and modeling data sharing attitudes and intentions
The largest body of knowledge about motivations and predictors for biomedical data sharing and withholding comes from Campbell and co-authors. They surveyed researchers, asking whether they have ever requested data and been denied, or themselves denied other researchers from access to data. Results indicated that participation in relationships with industry, mentors’ discouragement of data sharing, negative past experience with data sharing, and male gender were associated with data withholding. In another survey, among geneticists who said they intentionally withheld data related to their published work, 80% said it was too much effort to share the data, 64% said they withheld data to protect the ability of a junior team member to publish, and 53% withheld data to protect their own publishing opportunities. Occasionally, the administrators of centralized data servers publish feedback surveys of their users. As an example, Ventura reports a survey of researchers who submitted and reviewed microarray studies in the Physiological Genomics journal after its mandatory data submission policy had been in place for two years. Almost all (92%) authors said that they believed depositing microarray data was of value to the scientific community and about half (55%) were aware of other researchers reusing data from the database. In related research, the information science and management of information systems communities have developed several models of knowledge sharing. These models often use either case studies or opinions and attitudes gathered through validated survey instruments(, and many more). Studied domains include knowledge sharing within an organization, volunteering knowledge in open social networks, physician knowledge sharing in hospitals, participation in open source projects, academic contributions to institutional archives, and other related activities.

Identifying instances of data sharing
While surveys have provided insight into sharing and reuse behavior, other issues are best examined by studying the demonstrated behavior of scientists. Unfortunately, observed measurement of data behavior is difficult because of the complexity in identifying all episodes of data sharing and reuse. Although indications of sharing and reuse usually exist within a published research report, the descriptions are in unstructured free text and thus complex to extract. Most studies of data sharing to date have used a manual review to identify shared datasets (e.g. ). One automated approach for identifying data sharing behavior is to follow the “primary citation” field of database submission entries. Unfortunately, this is imperfect, since these references often missing when data is submitted prior to study publication. Populating the submission citation fields retrospectively requires intensive manual effort, as demonstrated by the recent Protein Data Bank remediation project, and thus is not usually performed. No effective way exists to automatically retrieve and index data housed on personal or lab websites or journal supplementary information. Related research has examined the degree to which data remains available after it has been shared. Multiple studies underscore the transience of supplementary information, website URLs, and corresponding author email addresses.

Evaluating the impact of data sharing policies
Despite many funder and journal policies requesting and requiring data sharing, the impact of these policies have only been measured in small and disparate studies. McCain manually categorized the journal “Instruction to Author” statements in 1995. A more recent manual review of gene sequence papers found that, despite requirements, up to 15% of articles did not submit their datasets to Genbank.. Analyses of reproducibility in the political science literature suggests that only actively enforced journal policies are effective. Studying the impact of data sharing policies is difficult because policies are often confounded with other variables. If, for example, impact factor is positively correlated with a strong journal data sharing policy as well as a large research impact, it is difficult to distinguish the direction of causation. Evaluating data sharing policies would ideally involve a randomized controlled trial, but unfortunately this is impractical. In related work, evaluations have been done to estimate the impact of reporting guidelines.

Estimating the costs and benefits of data sharing
Estimating the costs and benefits of data sharing would be challenging even with a comprehensive dataset of occurrences. A complete evaluation would require comparing projects that shared with other similar projects that did not, across a wide variety of variables including person-hours-till-completion, total project cost, received citations and their impact, the number and impact of future publications, promotion, success in future grant proposals, and general recognition and respect in the field. Pienta is currently investigating these questions with respect to social science research data and publications. Zimmerman has studied the ways in which ecologists find and validate datasets to overcome the personal costs and risks of data reuse. Examining variables for their benefits on research impact is a common theme within the field of bibliometrics. Research impact is usually approximated by citation metrics, despite their recognized limitations.

Related research fields
Evaluation of data sharing and reuse behavior is related to a number of other active research fields: code reusability in software engineering, motivation in open source projects, online knowledge sharing communities, and corporate knowledge sharing, tools for collaboration, evaluating research output, the sociological study of altruism, information retrieval, usage metrics, data standards, the semantic web, open access, and open notebook science.

B.5 Related research applications of methods Citation analysis for adoption and impact of open science
Citation analysis has been used to assess several aspects of the adoption and impact of open science, particularly literature open access. Eysenbach found that authors who chose to make their articles open access in the Proceedings of the National Academy of Sciences received more citations within the first year after publication, Wren correlated journal impact factor with the adoption rate of author-shared reprints, and many others. Other research have used citations to see how scientists use each other’s work and the relative impact of various study designs Many authors study factors that underlie citation rate; these highlight important factors to include as potential confounders whenever a detailed citation analysis of a new variable. Ongoing research attempts to identify the best way to represent various issues such as author ambiguity, author productivity, institutional environment, journal impact factor and clean and comprehensive citation data. Finally, several researchers have proposed methods for citations of data to make studying the issue of reuse easier in the long run, such as and , and examined the extent to which citations are an accurate proxy for peer ratings of quality .

Natural language processing of the biomedical literature
Natural language processing of the biomedical literature is traditionally organized into information retrieval, entity recognition, information extraction, hypothesis generation, and heterogeneous integration. Most work has been on abstracts, because they are free, easy to obtain, and in a standardized for mat from PubMed. Unfortunately, a great deal of information resides only in article full text. The TREC Genomics 2006/2007 tasks

opened up a selection of free text for Information Retrieval research, and the Open Access subset at PubMed Central is another homogeneous, free, easy subset to obtain. Consequently, more research is beginning to focus on full-text. Most research has focused on the needs of biologists or curators, but starting to be some investigations into automated techniques to help find articles for review based on the text, identification of the relationships between citing and cited articles , and analysis of the methods section to enumerate the diversity of wet lab method use Techniques vary depending on the task, but stemming, synonyms, and n-grams are a mainstay. Query expansion to include all query aspects have also been shown to help. The availability of full text articles in PMC, Google Scholar, and other portals is spurring new approaches. Finally, NLP techniques applied to clinical text might be of informative. For example, Melton et al also faces the issue of identifying records based on snippets of full text, though in their case it is adverse reactions in clinical discharge summaries.

Regression and factor analysis for deriving and evaluating models of sharing behavior
Most models of sharing behavior are based on established surveys, and thus evaluate their models using confirmatory analysis. However, a few research projects instead use linear regression, such as . Siemsen et al compare a regression model to that derived from constraining factor analysis. Finally, several studies involve exploratory factor analysis .

C. Preliminary Studies
In this section, I describe my preliminary results leading to this proposal. They include pilot work for Aims 2 (Section C.1) and 3 (Sections C.2, C.3, C.6), and a few publications that illustrate future application of the results (Sections C.4, C.5).

C.1 Identifying data sharing in the biomedical literature
In anticipation of Aim 2, I did some preliminary work that looked into the feasibility of identifying statements of data sharing from full text research articles: A pilot NLP system has been developed and validated for identifying data sharing from statements within article full text. Using regular expression patterns and machine learning algorithms on open access biomedical literature published in 2006, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). These results demonstrate the feasibility of using an NLP approach to automatically identify instances of data sharing from biomedical full text research articles. I extended this work to investigate the feasibility of retrieval through established query interfaces: In this study, we explore the possibility that deep analysis of full text may not be necessary, thereby enabling the querying of all reports in PubMed Central. We trained machine learning tree and rule-based classifiers on full-text open-access article unigram vectors, with the existence of a primary citation link from NCBI’s Gene Expression Omnibus (GEO) database submission records as the binary output class. We manually combined and simplified the classifier trees and rules to create a query compatible with the interface for PubMed Central. The query identified 40% of non-OA articles with dataset submission links from GEO (recall), and 65% of the returned articles without dataset submission links were manually judged to include statements of dataset deposit despite having no link from the database (applicable precision). I conclude that such approaches allow identification of articles with shared data sets with promising levels of precision and recall. However, I suspect that for the goal of this proposal, identifying data sharing through database links may be sufficient and preferable. Nonetheless, the experience gained in developing these classifiers will be valuable in developing NLP classifiers to identify dataset creation as part of Aim 2.

C.2 Preliminary analysis of prevalence and patterns of microarray data sharing
I have conducted preliminary work in assessing the prevalence and patterns of data sharing. This work confirms the feasibility of our approach and suggests some interesting findings. However, as preliminary work

it has a major limitation: the article cohort was not filtered to only include articles that create data, and thus the results may be biased by reuse studies. Our current proposal addresses this limitation in Aim 2, and proposes a wider and deeper analysis in Aim 3. We assessed the prevalence and patterns of Dataset-Sharing, using only links from within the GEO or ArrayExpress database. Of 2503 articles about gene expression microarrays, we found that 440 (18%) had primary-citation data source links from a major microarray database, suggesting that the authors of these papers shared their microarray data. Interestingly, studies with free full text at PubMed were twice (OR=2.1) as likely to be linked as a data source than those without free full text, as illustrated in Figure 3. Studies with human data were less likely to have a link (OR=0.8) than studies with only non-human data. The proportion of articles identified as a data source has increased over time: the odds of a data-source link for studies was 2.5 times greater for studies published in 2006 than 2002. As might be expected, studies with the fewest funding sources had the fewest data-sharing links: only 28 (6%) of the 433 studies with no funding source were listed within the databases. In contrast, studies funded by the NIH, the US government, or a non-US government source had data-sharing links in 282 of 1556 cases (18%), while studies funded by two or more of these mechanisms were listed in the databases in 130 out of 514 cases (25%).

Figure 1: Preliminary data sharing patterns. Preliminary Indications of Data-Sharing Patterns with 95% confidence intervals. From .

C.3 A review of journal policies for sharing research data
To confirm the feasibility of extracting journal data sharing policies from “Instruction to Author” statements, I conducted a review of journal policies for sharing microarray data. The results and methods from this study will be directly useful in Aim 3.

We examined the relationship between data sharing behavior and the strictness of a journal’s data sharing policy. As expected, we found that journals with the strongest data sharing policies had the highest proportion of papers with shared datasets. As seen in Figure 4, the journals with no data sharing policy, a weak policy, and a strong policy had a median data sharing prevalence of 8%, 20%, and 25% respectively. However, this study lacked a method of determining which articles were data producing, and so, these proportions should be interpreted relative to each other rather than to a theoretical maximum of 100%.

Figure 2: Relative data sharing prevalence by journal policy strength. A boxplot of the relative data-sharing prevalence for various journals, grouped by the strength of the journal’s data-sharing policy. For each group, the heavy line indicates the median, the box encompasses the interquartile range (IQR, 25th to 75th percentiles), the whiskers extend to data points within 1.5xIQR from the box, and the notches approximate the 95% confidence interval of the median. From

C.4 Recommendations for best practice initiatives and incentives
In collaboration with others in the Department of Biomedical Informatics and the caBIG DSIC working group, I recently published a paper highlighting ways in which Academic Health Centers can and should refine their initiatives and incentives for data sharing: Piwowar HA, Becich MJ, Bilofsky H, Crowley RS, on behalf of the caBIG Data Sharing and Intellectual Capital Workspace (2008) Towards a Data Sharing Culture: Recommendations for Leadership from Academic Health Centers. PLoS Med 5(9): e183 doi:10.1371/journal.pmed.0050183 This paper demonstrates there is an audience for discussion on data sharing policy; the results of the current proposal will serve to direct and strengthen such perspective pieces in the future.

C.5 Preliminary analysis and vision for the evaluation of data reuse
I have conducted preliminary work in evaluating data reuse. Although studying reuse is outside the scope of the current proposal, the methods and results from this study will inspire and facilitate future work in this innovative and important domain.

D. Research Design and Methods
The central hypothesis of this proposal is: Analysis of the impact, prevalence, and patterns with which investigators share and withhold gene expression microarray research data can uncover rewards, best-practices, and opportunities for increased adoption of data sharing.

To evaluate the central hypothesis, I will perform the following specific aims: Aim 1: Does sharing have benefit for those who share? Aim 2: Can sharing and withholding be systematically measured? Aim 3: How often is data shared? What predicts sharing? How can we model data sharing behavior? The purpose of this proposal is not to assess all data sharing behavior in biomedical research, but rather to explore three aspects of such an evaluation: (1) measure whether sharing data is associated with a citation benefit within a small cohort of clinical trials, (2) enable larger-scale future analyses by developing a system to automatically identify instances of dataset production and sharing, and (3) analyze the instances of dataset production and sharing for patterns associated with sharing behavior. These results will provide a strong foundation for future data sharing evaluations. To enhance my chances of success in this proposal, I will limit the scope of research as follows: • I will consider these questions within the context of gene expression microarray data. Microarray data provides a useful environment for investigation: despite being valuable for reuse costly to collect and mature in data standard and repository frameworks, it is often but not yet universally shared. Shared data will be defined as datasets that have been submitted to major, centralized databases. This excludes data shared upon request, included as supplementary information, submitted to small databases, or posted to a lab webpage: finding these resources is beyond the scope of the current project. Citation count will serve as a proxy for research impact. Studies will be limited to those indexed within PubMed, with English full text can be queried through a centralized portal (see Data Set section below for discussion) Analysis of data sharing predictors will be limited to variables that can be automatically derived from PubMed, article full text, and other database sources. I hope to consider features that require more manual interpretation and integration in the future.

• • •

The methods I will use to complete Aim 1 are fairly simple, and described in Section D1. Figure 3 illustrates the method I will use for Aims 2 and 3; details follow in sections D0, D2 and D3.

Figure 3: Method overview for Aims 2 and 3

D.0 Data Sets
This section describes the datasets I will assemble and use in the course of this project. The use of these datasets will be explained in more detail in sections D1-D3. Aim 1 The dataset used for Aim 1 consists of all cancer gene expression microarray articles identified in the 2003 systematic review by Ntzani and Ioannidis. Data sharing status was found through manual investigation of the research articles, predominant gene expression databases, and Google. Aim 2 and 3 I would ideally measure prevalence and patterns within a comprehensive set of articles that generated microarray data, manually annotated with data sharing status. Unfortunately, the method for Aim 1 is not feasible: no systematic review covers all published microarray articles, and the manual approach for identifying data sharing that was used in Aim 1 is too time consuming. Instead, I propose to develop, evaluate, and use automated methods to create a large annotated corpus on data sharing behavior.

Reference standard for annotations Fortuitously, a recently published letter to the editor provides a useful independent reference standard annotations on microarray dataset creation and sharing. The authors, Ochsner et al, manually reviewed all eligible articles published in 20 journals in 2007, and annotated each article for whether it produced original gene expression microarray data, and whether there was evidence that they shared this data in a database or on a website. Ochsner et al found almost 400 eligible studies, of which almost 200 had evidence of shared microarray data. Ochsner et al made their own review dataset available: their initial query, plus the PubMed IDs for the 400 articles that they considered to have generated microarray data, and links to all identified microarray datasets. I propose to use this dataset as a reference standard for evaluating the performance of automated annotation. Specifically, I propose to assemble a large set of gene expression microarray articles by querying the full text of research articles for indications that the study produced gene expression microarray data, and verify the precision and recall of this automatic identification using the Ochsner annotations. I will then use a combination of database links and full-text queries to automatically identify the data sharing status of each article, and again use the Ochsner study to ensure that the automatic identification of data sharing status is of sufficient accuracy. Full text Access to the full text of a research article is needed both of our annotations: whether a study has performed a particular wet lab experiment, and also whether the authors declare that they shared their research data. Queries of abstracts or MeSH terms, for example, have inadequate recall, retrieving only about 30% and 60%, respectively, of all articles known to have gene expression data deposited in GEO (Table 1). In contrast, a full text filter retrieves 96% of all articles known to have data deposited in GEO. Admittedly, the precision is likely extremely low for the simple query presented in Table 1, but a more refined query should hopefully be able to maintain relatively high recall while improving precision. Table 1: Poor recall of abstract and MeSH filters for identifying papers with shared microarray data Literature subset PMC articles published in 2007 linked from GEO datasets a filter of abstracts and titles a MeSH filter PubMed Central Query
pmc_gds[filter] AND ("2007"[EDate] : "2007"[EDate])

Number of articles 550 175

Recall reference 175/550= 32% 335/550= 61% 526/550= 96%

(gene[Title] OR gene[Abstract]) AND (expression[title] OR expression[abstract]) AND ((microarrays[title] OR microarrays[abstract]) OR (microarray[title] OR microarray[abstract])) AND pmc_gds[filter] AND ("2007"[EDate] : "2007"[EDate]) ("microarray analysis"[mesh] OR "gene expression profiling"[mesh]) AND pmc_gds[filter] AND ("2007"[EDate] : "2007"[EDate]) gene[text] AND expression[text] AND (microarrays[text] OR microarray[text]) AND pmc_gds[filter] AND ("2007"[EDate] : "2007"[EDate])

335

a full-text filter

526

Full text retrieval access through portals Two options exist for querying full text: I either need to download an extensive collection of full text articles and execute my queries locally, or I can issue online queries through one of several “full text query portal” environments. Unfortunately, downloading computable full text articles is complex, time consuming, and difficult to maintain due to licensing restrictions and a lack of access standards. In contrast, issuing a full text query online is simple, scalable, and the growth of Highwire Press and the flood of NIH research into PubMed Central suggest that online query is a promising approach for the future. Consequently, I propose to query full text through online portals. This choice will have several impacts on query development and results. First, it will limit recall to articles that are indexed in the full text query portals. I will combine the query results from PubMed Central, Highwire Press

and Scirus: they have adequate coverage, clean data, an interface that allows sufficient filtering, and an output format that can be aggregated with reasonable manual effort. An analysis of relevant articles in PubMed and datasets in the Gene Expression Omnibus suggests that these portals have access to more than 85% of the articles licensed to the University of Pittsburgh library (see Table 5 in the Appendix). Second, online portals offer a limited interface for queries. Only traditional Boolean queries are accepted, most portals have a fixed stop-word list, and n-gram “phrase” support appears to be weak in some engines (probably due to lack of indexing of stop words). This will exclude several common NLP techniques, such as term weighting and part of speech tagging. That said, accessing full text through these interfaces an underserved research area; our experiences will provide a valuable contribution. Full text computational access for development I wish to use statistical and lexical NLP techniques to derive queries that have high recall and precision. The statistical and lexical NLP analysis requires computational access to articles; access through full text portals is not sufficient for development analysis. I propose to use the following two bundles of full text for development: • • TREC Gen: The Highwire Press TREC Genomics 2006 cohort includes 162259 full text articles, of which at least 4168 include the words gene, expression, and microarray(s). Gold OA: The PMC Open Access subset includes 8816 articles published between 2000 and 2006, of which 3997 including the words gene, expression, and microarray(s) (there is a small amount of overlap between the two sets).

Proposed Copra This approach will require several corpora for query development and evaluation, as well as the final patterns study. The various corpora have differing requirements (in terms of annotation, size, and scope). I list the needed corpora in Table 2 with their requirements, then discuss related issues and decisions. Table 2: Proposed Corpora
Aim Nickname and Legend for Figure 4 DCQ Dev Name Requirements Copra and Annotations

2a

Corpus for developing the microarray Dataset-Creation Query

• must contain a relatively large number of articles across a wide spectrum of journals • must consist of articles that can be downloaded in machine-computable text for natural language processing (NLP) analysis • must not overlap the DCQ Dev Corpus • must be >= 300 articles (100 to be used for evaluation during development, 200 to be used for validation of the final query) • must be annotated with whether or not the studies produce their own gene expression microarray data • must have full text available for query

Downloaded full text from PMC OA + TREC IR corpora; no annotations on most of it, 100 articles in the Ochsner et al review with data creation annotation

2a

DCQ Eval

Corpus for evaluating microarray Dataset-Creation Query Evaluation

articles that can retrieved by Ochsner et al (20 journals + PubMed abstract query for “microarray/s OR genomewide OR expression profile/s OR transcription profile/profiling”); annotated for data creation via inclusion within the list of articles by Ochsner et al 100 articles in the Ochsner et al; data sharing annotation

2b

DSQ Dev

Corpus for developing the microarray Dataset Sharing Query

• must be >= 100 articles • must be annotated with whether or not the studies share their gene expression microarray data • must have full text available for query

2b

DSQ Eval

Corpus for evaluating the microarray Dataset Sharing Query Evaluation

• must not overlap the DSQ Dev Corpus • must be >= 200 articles • must be annotated with whether or not the studies share their gene expression microarray data • must have full text available for query • must contain as broad a spectrum of articles as possible • must have full text available for query

300 articles in the Ochsner et al review; data sharing annotation

3

Patterns

Corpus for estimating Data Sharing Prevalence, Patterns, and Modeling

articles retrieved from full-text portals using data-creation query that are ARE retrieved by the data sharing query vs. those that are NOT retrieved by the data sharing query

The relationship between all published articles, those I will use for our study of prevalence and model building and those I will use for query development is shown in Figure 4. Estimates for the relative sizes of the subsets are given in Table 5 in the Appendix.

Figure 4: Relationship between all PubMed articles and those included in study. Legend for proposed corpora is given in Table 2.

D.1 Aim 1 – Does sharing have benefit for those who share?
Goal: Measure the association between an article’s publication citation rate and whether its authors made their gene expression datasets publicly available. Importance: While the general research community benefits from shared data, much of the burden for sharing the data falls to the study investigator. Demonstrating a boost in citation rate would be a potentially important motivator for publication authors. To my knowledge, this is the first study to investigate a relationship between citation rate and biomedical data availability. This work also serves as preliminary work for measuring sharing prevalence and patterns. Dataset and Methods: We examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data.

Findings: The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression. Limitations: An important limitation of this proposal: associations do not imply causation. The research here will not be sufficient to conclude that data sharing causes increased citations. It would be possible that both factors stem from a common cause, such as a high level of research funding. The study is also performed on a small, relatively homogeneous set of studies. Status: In fulfillment of my Master thesis requirement, I have completed and published a study addressing Aim 1: Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308 The complete paper will be included in the final dissertation.

Summary
Sharing research data provides benefit to the general scientific community, but the benefit is less obvious for the investigator who makes his or her data available. We examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data. As seen in Table 1, trials published in high impact journals, prior to 2001, or with US authors were more likely to share their data. Table 3: Characteristics of eligible trials by data sharing. Reproduced from .

The 48% of trials which shared their data received a total of 5334 citations (85% of aggregate), distributed as shown in Figure 1.

Figure 5: Distribution of 2004-2005 citation counts of 85 trials by data availability. Reproduced from .

Whether a trial's dataset was made publicly available was significantly associated with the log of its 2004–2005 citation rate (69% increase in citation count; 95% confidence interval: 18 to 143%, p = 0.006), independent of journal impact factor, date of publication, and US authorship. Detailed results of this multivariate linear regression are given in Table 2. This result held even for lower-profile publications and thus is relevant to authors of all trials. Table 4: Multivariate regression on citation count for 85 publications. Reproduced from .

Research consumes considerable resources from the public trust. As data sharing gets easier and benefits are demonstrated for the individual investigator, hopefully authors will become more apt to share their study data and thus maximize its usefulness to society.

D.2 Aim 2 – Can sharing and withholding be systematically measured?
Working Goal: Develop and evaluate methods for identifying biomedical research data sharing and withholding. This will involve two sub-aims, discussed separately below: Aim 2a: identify studies that create data, and Aim 2b: identify the subset of these studies that share their data. For the purposes of Aim 3, articles identified as creating data (under Aim 2a) but not sharing data (under Aim 2b) will be considered to be withholding data. Aim 2a – Identify studies that create data Background Although MeSH terms (“gene expression profiling” OR “oligonucleotide array sequence analysis”) provide a useful filter for identifying articles about gene expression microarray data, they are not specific enough to find only those that generate gene expression microarray data. This is true for two reasons. First, these MeSH terms are sometimes used to annotate papers about other data types, such as RT-PCR or SAGE. My preliminary attempts to develop a refined MeSH query that excludes these other datatypes have not been successful. Second, even if the studies are about gene expression microarray data, annotation with these MeSH terms does not imply that the study generated its own microarray data. The study could, for example, be reusing the shared data of other researchers. A reuse study would not be in a position to share the raw microarray data, since they are not the primary authors. I want to exclude such cases from our “dataset creation” corpus to avoid considering it a case of data withholding. The information to determine whether a study created microarray data is often only found in article full text (usually in the methods section, but also elsewhere).

Proposed Method I propose to use statistical and lexical NLP approaches to design a query that can run in a full text articles via a portal, retrieving articles that have run gene expression microarray experiments. The method is illustrated in Figure 3. NLP Approaches I plan to use NLP techniques such as those explored in the preliminary work described in Section C.3 to create a classifier that identifies articles that have created gene expression microarray datasets. I anticipate that “wet lab” words such as isolate, hybridize, and probe will be relevant. Statements of data generation, such as "we generated gene expression" may also have sufficient precision, though may not be practical due to stop word exclusions within the portals. In general, I plan to investigate the development set for statistically predictive ngrams as well as develop manual rules. If this isn’t sufficient, I can explore some approaches that require more manual intervention: For example, a multi-step query approach, where the portals are queried multiple times, each time with a different strings, would allow feature weighting. I could involve MeSH terms as features if necessary. I could also experiment with additional NLP techniques such as semi-supervised training, bootstrapping cue phrases, patterns, and regular expressions. Reference Standard As discussed in Section D.0, I will use 300 random articles from the Ochsner et al review as a reference standard for the performance of the query. Query Evaluation Recall and precision will be calculated for the query responses given the reference standard. The contribution of this filter will be assessed by comparing its performance to a baseline filter in which one of the following words occurs in the article’s full text: isolate*, hybridiz*, or probe*. Since there are no established performance requirements, I will consider performance adequate if precision is above 70% and recall is high enough that use of the filter in Aim 3 will result in sufficient datapoints to power the subsequent analysis. I estimate having about 30 variables in my regression (see Aim 3). Opinions differ on how many datapoints are needed to adequately power a regression analysis. A rule of thumb is that medium effect sizes about 8 datapoints per variable + 50 , which suggests that 300 articles are necessary to estimate covariates for 30 variables. Some statisticians suggest that “the cases-to-Independent Variables (IVs) ratio should ideally be 20:1” or even 30:1 to limit bias. These requirements would suggest a need for between 600 and 900 data points. Alternatively, a regression power analysis (power = 80%, alpha = 0.05) suggests that about 1250 datapoints are required to detect a small effect size across 30 variables. I would like to take the conservative case and ensure that the NLP query retrieves at least 1250 articles that create microarray datasets. Risks and Contingency Plans The largest technical risk in the research plan is that it may be unexpectedly difficult to automatically identify dataset production with acceptable precision and recall. In this case, I plan to supplement the automated classification with manual curation, possibly resulting in a smaller cohort of articles for analysis. Limitations and Assumptions This approach assumes that data sharing is accompanied by mention in the text. Our preliminary work suggests this is usually true, but it may miss a small-but-important subset of circumstances where data is shared after publication. Aim 2b – Identify studies that share their data Background The Gene Expression Omnibus (GEO) has emerged as the dominant centralized repository for sharing gene expression microarray data, with many journal policies requiring submission to it specifically. It is well integrated with PubMed query results and contains links from submitted datasets to primary citation reports.

Our preliminary work, reported in Section C.1, suggests that database submission links have high recall for retrieving articles with data shared in centralized databases. Database submission links have the added benefits of almost-perfect precision, a wide scope without the need for access to full text, and no bias introduced through community norms in lexical statements of data sharing within full text. Relationships between articles with shared data and those that I will consider to have shared and withheld data for the purposes of this study are illustrated in Figure 6.

Figure 6: Data sharing classifications used in this study Proposed Method I propose to identify shared data using citation links from GEO, as identified by the PubMed filter “pubmed_gds[filter]”. Reference Standard and Query Evaluation It is important to evaluate the recall of GEO links to ensure it is sufficiently high that my analysis for Aim 3 is not unacceptably biased due to overlooking valid sharing mechanisms. A response rate of 70% is often considered sufficient to limit bias in survey, so I will adopt the same acceptability criterion. My proposed method for estimating recall is summarized in Figure 3; using 300 random articles from the Ochsner et al review as a reference standard, I will calculate: Recall = the number of articles that Ochsner as having shared data that also are linked from GEO divided by the total number of articles found by Ochsner as having shared data. Risks and Contingency Plans If the query evaluation suggests that GEO links provide a recall less than 70%, I will supplement the identification of data sharing by using article submission links from ArrayExpress and the Stanford Microarray Database. If recall is still less than 70%, I will develop and apply NLP filters such as the one developed in to sacrifice some precision for recall. Limitations and Assumptions This approach includes sharing to the predominant centralized database and excludes sharing to GEO for which there is no citation link within the submission entry. Although this is unfortunate and will lead to underestimating the prevalence of data sharing, I don’t expect it to bias our estimates of data sharing patterns. Another limitation is that the gold standard is only against articles that refer to data sharing. It is possible that datasets may be shared without mention in the article, though my preliminary data suggests this is rare.

D.3 Aim 3 – How often is data shared? What predicts sharing? How can we model sharing behavior?
Working Goal: Measure current data sharing and withholding behavior, and associate these sharing decisions with features that may predict or influence an investigator’s choice. This will be done through two sub-aims: Aim 3a: estimate prevalence of data sharing Aim 3b: assess individual contributions Aim 3c: investigate multidimensional factors Importance: Understanding the prevalence and patterns with which datasets are shared is a key step in evaluating and refining policies that encourage data sharing. To our knowledge, this will be the first extensive evaluation of observed data sharing behavior in the biomedical literature. Dataset: As discussed in D.0, the dataset will be comprised of all articles reachable by full text query within PubMed Central, Highwire Press, and Scirus (NPG + Elsevier) using the query developed in Aim 2a. Articles cited from within GEO, plus those found by any supplemental methods added in Aim 2b, will be considered to have shared data; the rest of the articles will be considered to have withheld data. Aim 3a – Estimating prevalence Background Although preliminary work and a recent survey have quantified the prevalence of gene expression data sharing, this has yet to be done on a large-scale basis, across a wide range of years and journals. Method Calculate prevalence of GEO link data sharing within our full sample as described in Section D.0, then adjust this raw estimate based on the relevant precision and recall values to account for over- and under- estimates in retrieval numbers due to query imprecision: • Raw number of data sharing articles = Number of articles identified by Data Sharing Query on Patterns Cohort • Precision-adjusted number of data sharing articles = Raw number of data sharing articles * Precision of Data Sharing Query • Fully-adjusted number of data sharing articles = Precision-adjusted number of data sharing articles / Recall of Data Sharing Query • Raw number of data creation articles = Number of articles identified by Data Creation Query on Patterns Cohort • Precision-adjusted number of data creation articles = Raw number of data creation articles * Precision of Data Creation Query • Fully-adjusted number of data creation articles = Precision-adjusted number of data creation articles / Recall of Data Creation Query Raw prevalence = Number of articles identified by Data Sharing Query on Patterns Cohort / Number of articles identified by Data Creation Query on Patterns Adjusted prevalence = Fully-adjusted number of data sharing articles/ Fully-adjusted number of data creation articles I will compare this estimate to the recent sample by Ochsner et al.. However, because the estimates were selected with different criteria, I do not necessarily expect the prevalence rates to be identical.

Aim 3b – Assessing individual contributions Proposed features I selected a set of features to collect and analyze, chosen based on degree of directness it serves as a proxy, completeness for which they are available, and ease of collection within the scope of this project. I hypothesize the following variables will be associated with an increased prevalence of data sharing:
Author characteristics (for both the first and last authors, separately) Feature number of prior gene expression publications number of prior publications career citations in PMC years since first publication published in open access journals before? previously reused gene expression datasets from GEO published papers with shared microarray data before personally shared data before Hypothesized direction for more probable data sharing more gene expression publications more prior publications more citations more years since first publication the author has published a paper in a gold OA journal the author has published a paper in the GEO data reuse catalog the author has published papers with shared data before the author has shared data before author has a current NIH grant the author is female Proposed data source PubMed with gene expression MeSH filter PubMed PubMed with LinkOut to PMC citations PubMed PubMed Central open access filter GEO data reuse catalog Limitations author name*, inexact filter author name* author name*, limited to PMC citations author name* author name* author name*, low recall because set is very incomplete misses datasets without GEO links author name*, misses datasets without GEO links author name low recall for non-Western names

PubMed with GEO datasets filter GEO database submitter list NIH CRISP download given name gender database

NIH PI gender

*author name issue involves missing data because the same data can have different names (or different representations via initials), and different authors can have the same name. I suspect this will occur often, however I believe it will not impact the results since I have no reason to believe it would occur with a different rate between articles with shared data and those without. Study characteristics
Feature organism under study disease under study Hypothesized direction for more probable data sharing non-human research non-cancer research Proposed data source MeSH terms MeSH terms Limitations

Will also include human*cancer interaction variable. Environmental characteristics
Feature Hypothesized direction for more probable data sharing most recent more co-authors Proposed data source Limitations

year of publication number of coauthors

MEDLINE MEDLINE

author country of residence number of funding sources journal prestige

US address more funding sources higher impact factor

MEDLINE MeSH terms for funding sources ISI Journal Citation Reports for 2007

open-access journal

gold open-access

PMC OA list

research-orientation of university number of datasets submitted by this institution university vs. other type of institution relative amount of tech transfer from this institution

top 25 NIH-funded university more datasets university

Carnegie classification reports (http://www.carnegiefoundation.org/classifications), Hendrix medical school data GEO submitter list, organization AUTM tech transfer report

only captures corresponding author country limited to those mentioned within PubMed impact factor changes with time, some journals not indexed does not include author-archiving (preprint archiving, or green open access) limited to US limited to GEO limited to US

less tech transfer

AUTM tech transfer report

limited to US

Policy characteristics
Feature journal data sharing policy funder data sharing policy Hypothesized direction for more probable data sharing policies that require a database accession number NIH grants after 2004 Proposed data source Categorization in MeSH terms Limitations not all journals listed limited to NIH within the US

To make help identify confounders and spurious results related to my measurement methods, I will include several additional variables to identify which query engine (PubMed Central, Highwire Press, and/or Scirus) was used to find each paper, and variables to flag when data for a particular variable was not in the scope of the data available (for example, articles with a non US-address are not within the scope of my institution ranking data source). Features outside current scope It would be nice to measure the impact of additional variables. Some are listed below. Unfortunately, these are difficult to systematically extract and therefore likely outside the scope of the current study: characteristics of all/any of the authors: • age • training location • institution location • trained in informatics, medicine, and/or biology • received training on data sharing • have previous positive or negative experiences with data sharing • involved in commercial activities • have any patents • • • • • • • have a particularly high/low workload know how to share data have plans to use the data again have plans for IP or commercial spinoffs degree of social pressure for commercial activities vs. openness at their institution believe that data sharing benefits others believe that data sharing benefits themselves

characteristics of the environment and study: • reuses data • funded by industry • relative funding level • data sharing required by other funders

• • • •

data sharing plan included in study proposal data sharing plan funded specifically article has been self-archived (green open access) attributes of the dataset (trial size, Affymetrix platform, …)

Statistical Analysis I will compute the univariate odds for each of the features to assess the degree to which they are associated with sharing datasets that have been produced, following the methods of Eysenbach in my use of the nonparametric Wilcoxon Mann-Whitney test for continuous variables that are not normally distributed, and a comparison of proportions using Fisher’s exact test for dichotomous variables and the Freeman-Halton test for variables with more than two levels. Finally, I will use multivariate logistic regression to compute the independent association of each variable to the probability of sharing. I will report the coefficients and 95% confidence intervals. Risks and Contingency Plans Some of the variables may prove more difficult to extract than expected. In this case, if the particular variable is not essential, I will defer its analysis to future work. Limitations and Assumptions An important limitation of this proposal: associations do not imply causation. The research here will not be sufficient to conclude, for example, that a policy change associated with increased data sharing will in fact cause increased sharing. It would be possible that both factors stem from a common cause. This study has several additional limitations. Although restricting the study to only microarray allows an indepth analysis of specific facets of data sharing, future work should apply the methodology and lessons learned to other datatypes to quantify generalizability. The study is limited by the accuracy with which I can identify dataset creation and data sharing. The study is limited to published articles with queriable full text, and thus will omit some older articles or those published in more obscure journals. I am not considering datasets as shared if they are available upon request or published online in another venue than a major database, and may thereby discount an important and effective sharing mechanism. I will be unable to unambiguously identify authors, and thus my estimations of previous publishing, data sharing behavior, and grant information will contain errors. This analysis assumes that the first and last authors are the main decision makers about whether or not to share datasets. This may not be true. Finally, many variables are US-centric, which erodes our ability to understand the influences of institutional factors or funding levels, for example, in the rest of the world (about 46% of microarray papers have author addresses within the US). Aim 3c – Factor model of data sharing behavior Background and Method As the last component of this project, I intend to use the data collected in Aim 3b to derive a model of data sharing behavior using exploratory factor analysis. Many models have been explored for data sharing attitudes and intentions (highlighted in Section B.4), but to my knowledge this will be the first model to explain demonstrated data sharing actions from observed variables. As such, an exploratory factor analysis is appropriate. I plan to use standard techniques to assess how many factors are appropriate, based on the data analysis, and attempt a general interpretation of the resulting factors (e.g. Kolekofsky et al’s exploratory factor analysis of information sharing within an organization). Expected Results I expect that the resulting model will provide some insight into data sharing behavior, and provide a useful springboard for confirmatory analysis in a different domain, with different proxy variables, or with survey opinion data.

Risks and Contingency Plans The analysis may fail to find a robust or interpretable model, in which case I will attempt principal component analysis(e.g. ), and/or other exploratory clustering techniques to form homogeneous composite variables, similar to the approach taken by . Limitations and Assumptions This approach assumes that my list of observed variables includes proxies for many of the actual decisionmaking influences. It will be important to emphasize that the model does not imply causality, but only association.

D.4 Future Directions
This work will provide valuable experience, tools, and results for future explorations. Possible directions include: • • • • • • • • • Confirming validity of the data model with an independent data set Supplementing observed characteristics with opinions and attitudes gathered through an opinion survey of authors Studying the sharing of additional datatypes Identifying datasets shared outside of GEO databases Identifying and analyzing data reuse Implementing, evaluating, and analyzing a Data Reuse Registry Social network analysis of data sharing and reuse behavior Identifying attributes associated with which author decides whether study data will be shared, and which author does the submitting work Articulating and advocating best-practices, to hopefully reduce the “activation energy” of data sharing

D.5 Time Table

Bibliography and References Cited

Appendices Table 5: Recall of articles through full-text query interfaces Sample Query Number of articles in text match of abstracts (full text is expected to find many more) gene[text] AND expression[text] AND (microarray[text] OR microarrays[text]) Number of PubMed "2000"[PDAT] : "2007"[PDAT] articles Number of PubMed "2000"[PDAT] : "2007"[PDAT] articles with links to AND "loattrfull text"[sb] full text Number of PubMed "2000"[PDAT] : "2007"[PDAT] articles with links to AND "loprovupittlib"[Filter] full text from Pitt a) that are housed at PMC b) that are housed at Highwire c) that are housed at Elsevier Science or Nature Publishing Group (probably reachable via Scirus) Any of a+b+c minus overlaps "2000"[PDAT] : "2007"[PDAT] AND (pubmed_pmc[filter] OR loprovhighwire[filter] OR "loftextnpg"[Filter] OR "loftextes"[Filter]) 27% Reach as a percent of those available via Pitt library 13844/16323= 85% 13844/20880 = 66% 3719 / 3776 = 98% 3719 / 4287 = 87% 13844 3719 "2000"[PDAT] : "2007"[PDAT] AND pubmed_pmc[filter] "2000"[PDAT] : "2007"[PDAT] AND loprovhighwire[filter] "2000"[PDAT] : "2007"[PDAT] AND ("loftextnpg"[Filter] OR "loftextes"[Filter]) 4738 969 7239 2213 31% 20880 19582 4287 21% 4221 22% 16323 3776 23% 4465 1604 36% Number of articles linked from GEO AND pubmed_gds[filter] GEO as percent of text matches

20%

Table 6: Estimated maximum number of articles that will be found by NLP data creation filter Query engine Query Number of articles

a) Highwire

b) Scirus ScienceDirect and Nature Publishing Group c) PubMed Central

All: gene expression (microarray OR microarrays) Citation Year: 2007 Highwire-hosted gene expression (microarray OR microarrays) Infotype: Articles Source: SciDirect and NPG For 2007 gene[text] AND expression[text] AND (microarrays[text] OR microarray[text]) AND ("2007"[EDate] : "2007"[EDate])

returned for 2000-2007 34373

20667

14888

… PubMed Central not in Highwire or Elsevier or NPG

PubMed Links for PMC (Search gene[text] AND expression[text] AND (microarrays[text] OR microarray[text]) AND ("2007"[EDate] : "2007"[EDate])) NOT ("loprovhighwire"[Filter] OR "loftextnpg"[Filter] OR "loftextes"[Filter])

(3866 of 10000) * 14888 = 5756

Totals

60796

Table 7: Number of articles available through full-text query interfaces across various Portals Literature Database "gene expression" microarray "gene expression" microarray hybridiz* "gene expression" microarray hybridiz* accession PubMed (title and abstract) Google Scholar (using “hybridized”) PubMed Central PubMed Central Open Access HighWire Press hosted 2265 21100 213 9.40% 4620 21.90% 3148 2063 1851 58.80% 1203 58.31% 7543 3601 1450 542 45.05% 61 of 115 (subset of 1450)(30 of 61 in PMC) 53% 839 45.33% 188 58.3% 1870 40.48% 311 58.8% 2 0.94% 0 + links to GEO

47.74% HighWire Press subset 2028 1048 51.68% 436

40.27% 41.60%

Scirus articles

5153

2437

916 (includes 191 Science Direct)

47.29%

37.59%

Table 8: Full text accessibility across journals Full text availability of journals that publish gene expression microarray data frequently (rank 1-10) and less frequently (rank 40-50). Additional column indicates whether the aricles will be included in the data collection set (requires full text queries through PMC, Highwire, or Scirus NPG+Elsevier)
rank Journal Full text queriable through centralized portal Highwire PMC Highwire PMC PMC Highwire PMC Scirus: NPG Highwire Highwire Highwire Scirus: Elsevier Highwire Scirus: NPG none (Wiley) Scirus: Elsevier Scirus: NPG Highwire Highwire Highwire data collection                          evaluation

1 2 3 4 5 6 7 8 9 10 41 42 43 44 45 46 47 48 49 50

Bioinformatics BMC Bioinformatics Cancer Research BMC Genomics PNAS J Biol Chem Nucleic Acids Research Oncogene Clinical Cancer Research Physiol Genomics Faseb J Gene Carcinogenesis Brit J Cancer J Neurochem Dev Biol Nat Genetics Plant Cell Mol Cancer Ther J Neurosci

Table 9: Availability of full text for journals used in Ochsner et al.
Journal Full text queriable through centralized portal

Proc Natl Acad Sci U S A Cancer Res J Biol Chem J Immunol Blood Mol Cell Biol

PMC Highwire Highwire Highwire Highwire Highwire

Endocrinology Am J Pathol Mol Endocrinol FASEB J J Endocrinol Mol Cell Nat Methods Nature EMBO J Nat Genet Nat Med Nat Cell Biol Cell Science

Highwire PMC Highwire Highwire Highwire Scirus: Elsevier Scirus: Scirus: PMC Scirus: Scirus: Scirus: Scirus: NPG NPG

     

 NPG NPG NPG Elsevier

Table 10: Data sharing breakdown from Ochsner et al.
location not shared GEO ArrayExpress SMD journal other Grand Total number 211 138 24 4 11 9 397 percent 53% 35% 6% 1% 3% 2% 100%

Table 11: Sample size required for confidence levels by population size From Population size 2000 3000 5000 10000 20000 50000 100000 +/- 3% 696 788 880 965 1,014 1,045 1,058 +/- 5% 323 341 357 370 377 382 383 +/- 7.5% 157 162 165 168 169 170 170 +/- 10% 92 94 95 96 96 96 96

http://www.zoomerang.com/MKT/samplesize-calculator/step3.html and also see http://www.surveysystem.com/sscalc.htm

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.