Research dataset


#1

This is a proposed new standard glossary term. See this post for background on this review track. To comment on the term below either click the blue “Reply” button at the bottom or select a passage of text in the term and click the “Quote” pop-up to create a comment about that section only.

Short Definition: An organized collection of research data in a computational format, defined by a theme or category that reflects what is being measured/observed/monitored.

Extended Definition: An organized collection of research data in a computational format, defined by a theme or category that reflects what is being measured/observed/monitored. The presentation of the dataset in the application is enabled through metadata. In machine-to-machine interactions, a research dataset is a compilation of data that constitutes a programmable data unit that has been collected and organised using one process. It has a single Data Owner, a single license, one set of semantics, ontologies, vocabularies, and has a single data format and internal data convention. A research dataset must include its version.

Synonyms:

Acronym:

Related Terms: http://dictionary.casrai.org/Dataset_series

Sources:

Term Lead: Lesley Wyborn


#2

Definitions are supposed to “fit with common usage”. Often research datasets do not include a version.


#3

True, but in the current context a research dataset should have some form of version control. Some of the language can be tightened up.


#4

I tend to agree with Claire about inclusion of version control. This would fall under one of the best practices generally covered by effective data management, and put into practice through data management plans.

Before I comment on ‘research dataset’, I think think it beneficial to first understand what we mean by a dataset. The issue that arises in the CASRAI context is that there are two different definitions:

Data Set

A series of structured observations, measurements or facts identified from the research which can be stored in a database

Dataset

Any organized collection of data in a computational format, defined by a theme or category that reflects what is being measured/observed/monitored. The presentation of the data in the application is enabled through metadata.

Before any compound terms can be made, we must first agree upon which of the data set / dataset definitions we want to use.

The current definition (Research Dataset) blends elements of other definitions. Assuming we can agree on what a dataset actually is, we then need to understand how a “Research Dataset” is different from a ‘regular’ dataset. By ‘research’ do we mean that the dataset is created in the context of a research project? Or that the information in a dataset is used to support research? My concern with the term Research dataset is that it seems as though it could apply to all datasets.

As a minor side note, I am not clear as to why the definition includes the condition "in a computational format’. While probably 99.99% of datasets are in a computational format, the format itself is not a vital element (or is it?). Most research datasets will be captured in computational formats. That being said, I can imagine cases where a researcher could collect data and observations by hand, capturing that data in a paper notebook (i.e. a non-computational format). These datasets are paper-based, and would therefore not fit under the definition we are proposing here. This is an unusual case, I admit. But I don’t see the need to make the computational format clause, and I think by removing it the definition would actually be improved.


#5

There should not be two terms, data set and dataset. These definitions will have to be merged and a preferred term (dataset) identified with the other (data set) being a synonym. Thank you for catching that, Matt. If I’m not mistaken, the scope of the glossary when this work first started a few years ago was limited to digital data. In that case, it is not necessary to specify computational format. It also not necessary, nor perhaps even desirable, to specify in the definition that the observations are identified from the research. In this case, removal of any reference to computation and research yields a definition that would apply, also, to paper format. It might be best to remove metadata from the definition and identify it as a related term that is defined elsewhere.


#6

This topic was automatically closed after 0 minutes. New replies are no longer allowed.


#7

#8

This topic was automatically closed after 0 minutes. New replies are no longer allowed.