This section includes relevant background text, definitions and examples, policy statements, video, and expert commentary. It should be read by those looking for a thorough understanding of data acquisition and management.

»Foundation Text«

Table of Contents

I. What are data?

Data are a collection of facts, measurements, or observations used to make inferences about the world we live in. Data can range from material created in a wet laboratory, such as an electrophoresis gel or a DNA sequence, to that obtained in social-science research, such as a filled-out questionnaire, videotapes, and photographs. Data can be microscope slides, cell lines, climate patterns, soil samples, astronomical measurements, and spectrographic analyses. Custom software or hardware and specialized methods can be data, too.

II. Who owns data?

A. It may depend on who sponsors the research

Although graduate students, postdoctoral fellows, or even some faculty in academia performing research may believe that they own the data collected, they are wrong. As employees of a university, they are working for hire for the university, which, in most cases, owns the rights to the data. In federally sponsored research, the university owns the data but allows the principal investigator on the grant to be the steward of the data. The PI takes responsibility for the collection, recording, storage, retention, and disposal of data. When data are published, the copyright is retained by the PI, who then assigns it to the publisher of the journal. Had the faculty member undertaken a research project on behalf of the university, the university would have the copyright to the data. But since faculty members generally perform research on their own, the copyright belongs to them. Data and data books collected by undergraduates, graduates, and postdoctoral fellows on a research project belong to the grantee institution, and students should not take their data when they leave without making appropriate arrangements. Retaining copies of data is allowed with permission, and although this is not always done, it is certainly good practice. When faculty members leave an institution, they have to negotiate with the university to keep their grants and data. With industry-funded or privately funded research, data can belong to the sponsor, although the right to publish the data may or may not be extended to the investigator.

B. The Bayh-Dole Act

The Bayh-Dole Act of 1980 allowed universities to have control of the intellectual property, such as patents, generated from federally funded research. With a patent in hand, universities could exclusively license the patent to businesses; for the past 25 years or so, many universities, including Columbia, have benefited from the licensing revenue. Recent inventions, in the form of new drugs and computer technologies, have also helped the public. The law has encouraged new relationships between academic researchers and companies, but critics such as Derek Bok (2003) have charged that Bayh-Dole promotes universities' selling out their interests to industry rather than relying on raising money from tuition and other sources.

C. The role of research subjects in ownership of data

Since data can be defined as cell lines and DNA sequences, controversies have arisen concerning whether research subjects and patients actually own their own tissue or DNA. A case brought by John Moore against the University of California raised issues about whether a patient has ownership of his data, which in this case was tissue used in research to develop a cell line that had commercial interests. In 1976, Moore had gone to UCLA Medical Center seeking treatment for hairy-cell leukemia. Research performed on cells from his spleen led to the development of a patent in 1984. But Moore sued the University of California Regents, the researcher, and the company with which he was working, among other defendants, stating that the altered tissue was his own property and that he wanted to recover damages. He also said that he had not been informed about the potential use of his tissue by the researcher The California Supreme Court held that Moore had a right to sue the research physician for failing to inform Moore of what he intended to do with his cells. However, Moore did not win the right of ownership of his cells and thereby any entitlement to the data and subsequent financial proceeds that might be generated from the research done on the cells; the Court said that if all subjects had the right to their own tissue it could hinder biomedical research.

A more recent case concerning data ownership revolves around the desire for research subjects to have control over data and involves the development of a genetic test for Canavan disease. Daniel Greenberg and his wife had a son with Canavan disease, an inherited degenerative condition that afflicts Ashkenazi Jewish people and causes the loss of body control and death, usually before children reach their teens. In 1987, Greenberg had sought out Dr. Reuben Matalon, who eventually moved to Miami Children's Hospital, to try to develop a reliable screening test for the Canavan gene mutation. At the time, no such test existed. The Greenbergs raised money and found other families in which Canavan disease was present and asked them to donate blood and tissue samples to Dr. Matalon's effort. Dr. Matalon was able to develop a DNA-based test for the disease, and Miami Children's Hospital received a patent on the Canavan gene and its various mutations in 1997. To help offset some of the hospital costs in paying for Dr. Matalon's research, the hospital decided to charge a $12.50 royalty fee on the test. But Greenberg and the families who participated in donating tissues sued, asking that the test be put in the public domain, available to laboratories free of charge and without the problem of licensing agreements. The case was settled in 2003. The parties decided that Miami Children's Hospital could maintain ownership and licensing of the gene patent, while the hospital could continue to license and collect royalty fees for clinical testing for the Canavan gene mutation. The agreement also allows license-free use of the Canavan gene in research to cure Canavan disease, including gene-therapy research, and genetic testing in pure research and in mice used to research Canavan disease.

III. What are some of the best ways to collect data?

According to Francis Macrina, surveys show that trainees receive little instruction on the principle of record-keeping in research. Therefore, for the scientific enterprise to be productive in the long run, there needs to be positive and comprehensive mentoring of students in data management. Most institutions simply do not provide the education and training resources needed to formally instruct trainees and junior researchers in good data-management practices. The reader should review the Mentoring Tutorial that is part of this RCR series for more information on the best mentoring practices.

In essence, there is no one way to keep data, but data should explain why research was done, how it was done, where primary data are kept, what happened and didn't happen, an interpretation of the data, and what's next. Data should allow another researcher the ability to repeat the experiment. The data also should be kept in a way that is easy to understand. Legally, federal sponsors of research have the right to audit data and examine records that are relevant to a grant. Data can also be important commercially, in new drug applications to the Food and Drug Administration and for patents on new technologies.

A. Reliable methodologies in research practices

Before data collection occurs, Michael Kalichman points out, researchers need to ask themselves whether certain topics of investigation should even be pursued. Once it is decided that a project is valid, decisions then have to be made about what methods will be used to study the question at hand. Will animal or human subjects or scarce resources be necessary to perform the research in question?

Also, proper statistical methods and experimental design need to be employed because, as the expression goes, "Garbage in, garbage out." If an experiment is poorly designed, time and resources can be wasted; if too few subjects are chosen, statistical inferences can become useless. In addition, researchers need to determine what potential biases, such as any conflicts of interest, may exist. Kalichman recommends that before a study is undertaken a researcher should:

The Challenge of Outlier Data

The Treatment of Outlier Data

B. The practice of keeping research notebooks, paper vs. electronic

While many experts recommend collecting data in bound and paper-based notebooks with numbered pages on which the date and time of research can be clearly enumerated, many researchers employ a mixture of electronic and paper-based approaches.

Collecting, Recording, and Keeping Data

While both types of data can be manipulated by someone deciding to engage in misconduct, checks and balances in both can make it harder to do so. The federal Office of Research Integrity follows the British Medical Research Council advice on good record-keeping. The ORI states that data should be stored in such a way that it permits a complete retrospective audit, and that it is monitored regularly to ensure completeness and accuracy. Virginia Commonwealth University also provides excellent guidance about keeping good records. The following is a best-practices summary for good record-keeping:

C. Authorizations to collect data

1. Institutional Review Boards and human-subject research:

What is human-subject research? Should studies of tissues in repositories be considered human-subject research, requiring informed consent by the donors every time a sample is used? Is surveying people on the World Wide Web for their political views considered human-subject research? If in doubt, researchers should contact their Institutional Review Boards (IRB). The IRBs at Columbia University, like IRBs elsewhere, convene to ensure the protection of the rights and welfare of human subjects participating in research at the university. The boards have the power to approve, disapprove, or modify research protocols involving human subjects. The board operates under all federal laws, particularly 45 CFR 46, with respect to human-subject research. The IRB can monitor studies and suspend or terminate them if there is a danger to subjects or if a researcher is not complying with appropriate guidelines. Boards are made up of members of the facultyand staff and have representatives from outside of the university. Institutions can have several boards that review research protocols.

A board is responsible for determining whether the benefit of the research is sufficient to counterbalance any risks associated with the project. Boards monitor the nature of the informed consent given to the research subjects as well as the issue of whether confidentiality is maintained during the study and afterward. Boards also ensure that special protection is afforded vulnerable groups, such as children or pregnant women, the mentally ill, prisoners, and people with severe illness. They also ensure that selection of subjects is equitable and that participation in a research project is voluntary.

Who Owns the Data?

Some research that seems to involve human subjects is exempt from review by the IRB, but the protocol still needs to be submitted to the IRB. Exempt research does not require informed consent.

The federal government sets out eight issues that must at least be included in the informed-consent document:

  1. The purpose and procedures of the research.
  2. The risks.
  3. The benefits.
  4. Alternatives to participation.
  5. Mechanisms used to protect confidentiality.
  6. If there is greater than minimal risk, an explanation of any compensation available if injury occurs during the research project.
  7. Contact information for rights as a research subject and in the event of a problem.
  8. The right of the participant to withdraw from the research at any time.

2. HIPAA and privacy of data:

The Health Insurance Portability and Accountability Act of 1996 expands the confidentiality requirements set forth under 45 CFR 46 with respect to patient medical records and information. Researchers looking at clinical data need to know whether they are doing investigations that they can certify will not result in the disclosure of information about a patient; and whether they can obtain a waiver of authorization from a Privacy Board (a committee that looks at HIPAA issues at Columbia and other institutions that don't use the IRB for privacy issues) or need authorization from a patient. When a researcher is preparing to do research and needs to look at medical-records data, he or she can fill out a form - the Investigator's Certification for Reviews Preparatory to Research - certifying to the Privacy Board that they will not remove data. The Privacy Board does not have to review the certification. A researcher can also be exempt from HIPAA authorization if working with data that is de-identified, which means that it lacks information that can be traced backed to an individual. The researcher would then sign a form stating that he or she is using de-identified information. Researchers can obtain a waiver of authorization from the Privacy Board if the researcher can make the case that the research couldn't be done without getting the waiver. For example, if a researcher needs to analyze 1,000 medical records for a research protocol but cannot go back and contact each individual, the Privacy Board might provide a waiver of authorization. If the researcher needs only 100 records, the Privacy Board would make a determination if the researcher can contact all the people about the nature of the information used to obtain authorization. Authorization provides the subject with information about the nature of the disclosures concerning the medical record and the individual's right to obtain his or her own medical record. For more information, researchers should contact their HIPAA compliance office.

3. Animal subject regulations:

Concern for animal welfare guides federal regulations about their use in research. Institutions have Institutional Animal Care and Use Committees (IACUC), which approve protocols and are subject to federal law. Columbia University has an IACUC (, which, by asking the following questions, works to ensure that a research project does not put an undue burden on the animals:

Collecting Data from Animal Subjects

IV. What are the issues in data storage and retention?

The National Institutes of Health require that grant recipients keep all data for three years beyond the time that the final expenditure report of the grant is reported. But different agencies and academic societies have different requirements, and universities usually make it the principal investigator's responsibility to abide by the rules. The National Science Foundation has a similar requirement. However, the American Psychological Association expects its members to retain data for a minimum of five years, and different universities also require data retention for varying periods of time, usually for a minimum of five years. Legally, data need to be retained for patent protection and in case there are any misconduct allegations pending based on the data.

Beyond the professional and legal obligations, researchers in practice will store data as long as they feel it is necessary. But confidential data has to be stored in such a way that access cannot be available. Audits may be necessary to determine whether data have been stored properly. Some investigators store electronic data with archival resources.

Good Data Management Practices

V. What are the obligations to share data?

A. Pros and cons:

Academic tradition holds that faculty members will publish significant research results and engage in the free exchange of information at, say, student seminars about research progress. Sharing is good for the research enterprise as a whole; colleagues will learn new techniques, gain insights, and not repeat experiments that are not fruitful, wasting valuable resources. Indeed, many peer-reviewed publications, such as chemistry and astronomy journals, require that researchers provide their data.

Who Owns the Data?

Recently, because of controversies about the secrecy of data from clinical trials for drugs, the pharmaceutical industry is beginning to offer its data in online databases. Sharing has benefits for individual researchers, in that it can lead to collaborations. But some information cannot be released because of privacy and human-subject protection concerns. Also, the release of early data before publication can jeopardize the ability of an investigator to be the first to publish a research finding, if a competitor can take the data and publish it first. Data that can lead to patents also cannot be shared prematurely. After publication, however, it is expected that scientists will share their raw data, if doing so does not create an undue financial burden.

B. Some policies and considerations

1. National Institutes of Health Data Sharing Policy:

In 2003, the NIH instituted a new policy on data sharing. The new policy applies to investigator-initiated one-year $500,000 grants and may have an impact on smaller grants too. The goal of the policy is to expedite the timely release and sharing of final data to enhance the research enterprise. Release of data can be complex. Intellectual-property considerations, non-governmental sponsorship issues, and human-subject confidentiality protection must be considered before data are released.

The NIH is requiring that investigators asking for the funding level include with their grant applications information about how they plan to share the data generated from their research. If a grant is awarded, the data-sharing plan must be enacted.

2. U.S. Patriot Act:

An important issue for researchers these days is how scientists balance the free exchange of some sensitive scientific data and information with the possibility that a terrorist or an enemy might use the material against us. Indeed, the American Society of Microbiology changed the review policies for its publications, requiring editors to be alert to sensitive information. But some scientists began withholding submissions for review, an unintended consequence of the policy. Likewise, the United Kingdom's Royal Society and Wellcome Trust also recently issued a report saying that scientific societies and funding institutions should take more responsibility in vetting and preventing the release of risky technical details. But the report said that censoring basic science would not prevent terrorist attacks, and could actually make it more difficult to prevent harm, since secret data would not be subject to sufficient validation by other researchers and might not be accessible in the event of an emergency. Science, proponents of free exchange argue, can never protect against every conceivable type of attack or idea. The organization called for scientists to be self-governing regarding the release of potentially dangerous information. However, some research today remains classified, and there have been times in history when important fundamental research also had to be classified. For example, the activities of the Manhattan Project, which developed the atomic bomb, were kept secret, although much of the science and technology ultimately became public. The current U.S. Patriot Act attempts to balance the ability of researchers to share data with national-security interests. Among its many provisions, the act creates restrictions on the transport of potentially dangerous biological specimens. It also characterized a type of research, called "sensitive but unclassified," which requires review by the federal funding agency before publication.

3. Council on Governmental Relations :

Because of the increased mobility of researchers, scientific challenges to data, charges of failures of scientific integrity, interactions of academics with industry, potential conflicts of interest, and litigation over ownership of data, the Council on Governmental Relations, an association of research universities, issued a report in 1996 for universities to use when reviewing their policies on data access and retention. It provides a resource without being proscriptive. The report describes the definition of data as wide, at least as seen through the lens of different governmental agencies, such as the NSF, the Public Health Service, and the Environmental Protection Agency. Each has different requirements for what they consider data and, therefore, for data retention and sharing. The council also addressed data custody, pointing out that universities usually give custody of data to researchers, who then are responsible for keeping the data in trust, "not moving or destroying it without appropriate advance notice and permission from the legal owner" - which, typically, is the university. Who has access to data is also taken into account. Universities claim access to researcher data, especially in cases of scientific misconduct, but who else should have access is subject to ethical, intellectual-property, and research-based considerations. Concerns are time of access (before or after publication), level of access (raw, transformed, or summarized data), and cost of sharing the data.

4. National Science Foundation data-sharing policy:

"The NSF expects significant findings from research and activities it supports to be promptly submitted for publication, with authorship that reflects the contributions of those involved. It expects investigators to share with others at no more than incremental cost and within a reasonable time, the data, the samples, physical collections and other supporting materials created or gathered in the course of the work. It also encourages awardees to share software and inventions to make them useful and usable. Exceptions may be allowed to safeguard the rights of individuals and subjects, the validity of results or the integrity of collections."

Continue to the next section: → Resources