A Lexicon of Foundational Definitions
Glossary for Research Software attempts to collate foundational terminologies as well as surface debates over contended meanings. The resource will remain living and updated on an ongoing basis, with the versioning records available at the title bar. We have chosen to source, rather than compose, definitions as a way to also direct people toward spaces to read further on given topics. When a keyword has multiple definitions, they will be indicated by bracketed number, though their order is incidental based on time of discovery. Because this document continues to grow, the grid structure of terms is intended to provide an overview perspective, with each term linked to the definition below.
Facilitating shared understanding is an important means of progressing the field of Research Software. If you would like to contribute to this guide, please send your keyword and/or citable definition to Dan Rudmann.
A data repository is an archive for research data and software. A trusted digital repository is a digital archive whose mission is to store, manage, and provide reliable, long-term access to digital resources, and it has been certified by an official organisation. A well-known certification for data repositories is Core Trust Seal.1
Describing the contribution of software to research closely relates to a number of other issues around the role of software in research. This includes publishing research software in a persistent and citable way, ensuring the availability of research software (and data, online services and other artefacts) for the long-term, promoting the recognition of software as a valuable research output in its own right, and ensuring that the developers of research software have their contributions recognised and rewarded. These are concerns which affect not only those using research software, but those who develop or modify research software, those who release research software, paper reviewers, programme committees, publishers and funders.2
Copyright is the area of law that deals with creation, ownership, sale, and use of creative and expressive works. Copyright is intended to protect the interests of authors and inventors for a limited period of time, after which the public can benefit from the use and distribution of their creative work. Exclusive right to their work in turn will encourage the continued creation and dissemination of works for the betterment of society.3
I define ‘data’ as a relational category applied to research outputs that are taken, at specific moments of inquiry, to provide evidence for knowledge claims of interest to the researchers involved. Data thus consist of a specific way of expressing and presenting information, which is produced and/or incorporated in research practices so as to be available as a source of evidence, and whose scientific significance depends on the situation in which it is used. In this view, data do not have truth-value in and of themselves, nor can they be seen as straightforward representations of given phenomena. Rather, data are essentially fungible objects, which are defined by their portability and their prospective usefulness as evidence… I propose to view data as any product of research activities, ranging from artefacts such as photographs to symbols such as letters or numbers, which is collected, stored, and disseminated in order to be used as evidence for knowledge claims. This does not mean that whoever gathers data already knows how they might be used. Rather, what matters is that observations or measurements are collected with the expectation that they may be used as evidence for claims about the world in the future. Hence, any object can be considered as a datum as long as (1) it is treated as potential evidence for one or more claims about phenomena and (2) it is possible to circulate it among individuals.4
[1] A type of persistent identifier used to uniquely identify objects. The DOI system is particularly used for electronic documents such as journal articles. The DOI system began in 2000 and is managed by the International DOI Foundation.5
[2] A DOI is a digital identifier of an object, any object — physical, digital, or abstract. DOIs solve a common problem: keeping track of things. Things can be matter, material, content, or activities.
A DOI is a unique number made up of a prefix and a suffix separated by a forward slash. This is an example of one:
10.1000/182
. It is resolvable using [DOI Foundation’s] proxy server by displaying it as a link: https://doi.org/10.1000/182.Designed to be used by humans as well as machines, DOIs identify objects persistently. They allow things to be uniquely identified and accessed reliably. You know what you have, where it is, and others can track it too.6
In contemporary settings of applied computational research, such as data science, the use of the term ‘domain’ is ubiquitous. The term serves to identify, demarcate, and characterize spheres of worldly action or knowledge, for instance, biology as the ‘domain science’ of life or geologists as the ‘domain experts’ of the earth. The use of the term implies, necessarily, that there is more than one domain, that they are in some way distinct from each other, and thus that domains are topically specific. The concept of a domain is set against a proposition that there is a more general, even universal, method or technique; so, for instance, a data analytic tool may be dubbed ‘domain independent,’ meaning that it can be of use across many, and sometimes all, domains. This general or universal quality is characterized as the feature of a kind of field or expert, recently data science, but in the past a feature of other specializations such as the computing and information sciences.7
In 2016, the ‘FAIR Guiding Principles for scientific data management and stewardship’ were published in Scientific Data. The authors intended to provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. The principles emphasise machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.8
European regulation with rules for processing personal data. In Dutch: Algemene Verordening Gegevensbescherming (AVG).9
By far, the most widely used modern version control system in the world today is Git. Git is a mature, actively maintained open source project originally developed in 2005 by Linus Torvalds, the famous creator of the Linux operating system kernel.[…] Having a distributed architecture, Git is an example of a DVCS (hence Distributed Version Control System). Rather than have only one single place for the full version history of the software as is common in once-popular version control systems like CVS or Subversion (also known as SVN), in Git, every developer's working copy of the code is also a repository that can contain the full history of all changes.10
[1] The building blocks of the digital research infrastructure system include:
large scale compute facilities, including high-throughput, high-performance, and cloud computing
data storage facilities, repositories, stewardship and security
software and shared code libraries
mechanisms for access, such as networks and user authentication systems
people: the users, and the experts who develop and maintain these powerful resources.11
[2] Digital research infrastructure (DRI) is the collection of tools and services that allow researchers to turn big data into scientific breakthroughs.
In today's digital age, data is an essential tool for scientific progress; it underpins quality research in every discipline. As the global innovation race speeds up, only the countries that have world-class digital research infrastructure in place will be able to stay competitive. To maintain Canada's science and research excellence and make sure we can benefit from these ideas, we must coordinate our national computing power and connectivity with the best software and storage services for data.
The four key elements of a country's digital research infrastructure are:
digital network for research and education, allowing researchers to share data and collaborate across Canada and around the world
data management (DM), allowing researchers to find and access data
research software (RS), enabling researchers to access and use data
advanced research computing (ARC), involving super computers that allow researchers to analyze massive amounts of data12
A license is a document that acts as your official permission for others to do, use, or own something that you are the copyright owner for (like code you've written, or data you've gathered). It's a crucial part of scholarly communications -- your colleagues need to know exactly how they can use your materials, and you can establish boundaries that you are comfortable with. Releasing your materials without a license creates ambiguity. No one can use your work if you do not include a license, because the implication is that you have reserved every right (including to copy or modify your code) for yourself.13
Standardised structured information explaining data items like, but not limited to: purpose, origin, time references, geographic location, creator, access conditions and terms of use of a data collection. Documentation and explanation of the data.14
Open source doesn’t just mean access to the source code. The distribution terms of open-source software must comply with the following criteria:
1. Free Redistribution
The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license shall not require a royalty or other fee for such sale.
2. Source Code
The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
3. Derived Works
The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software.
4. Integrity of The Author’s Source Code
The license may restrict source-code from being distributed in modified form only if the license allows the distribution of “patch files” with the source code for the purpose of modifying the program at build time. The license must explicitly permit distribution of software built from modified source code. The license may require derived works to carry a different name or version number from the original software.
5. No Discrimination Against Persons or Groups
The license must not discriminate against any person or group of persons.
6. No Discrimination Against Fields of Endeavor
The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.
7. Distribution of License
The rights attached to the program must apply to all to whom the program is redistributed without the need for execution of an additional license by those parties.
8. License Must Not Be Specific to a Product
The rights attached to the program must not depend on the program’s being part of a particular software distribution. If the program is extracted from that distribution and used or distributed within the terms of the program’s license, all parties to whom the program is redistributed should have the same rights as those that are granted in conjunction with the original software distribution.
9. License Must Not Restrict Other Software
The license must not place restrictions on other software that is distributed along with the licensed software. For example, the license must not insist that all other programs distributed on the same medium must be open-source software.
10. License Must Be Technology-Neutral
No provision of the license may be predicated on any individual technology or style of interface.15
An operating system is a program that acts as an interface between the computer user and computer hardware, and controls the execution of programs. The operating system (OS) manages all of the software and hardware on the computer. It performs basic tasks such as file, memory and process management, handling input and output, and controlling peripheral devices such as disk drives and printers. Most of the time, there are several different computer programs running at the same time, and they all need to access your computer’s central processing unit (CPU), memory and storage. The operating system coordinates all of this to make sure each program gets what it needs.16
A package management system organizes and simplifies the installation and maintenance of software by standardizing and organizing the production and consumption of software collections. As a software developer, you can benefit from package managers in two ways: through a rich and stable development environment and through friction-free reuse.17
A persistent identifier is a long-lasting reference to a digital resource. An identifier is a label which gives a unique name to an entity: a person, place, or thing. Unlike URLs, which may break, a persistent identifier reliably points to a digital entity.18
A computing platform or digital platform is an environment in which a piece of software is executed. It may be the hardware or the operating system (OS), even a web browser and associated application programming interfaces, or other underlying software, as long as the program code is executed with it. Computing platforms have different abstraction levels, including a computer architecture, an OS, or runtime libraries. A computing platform is the stage on which computer programs can run.19
The README is usually available even before the software is installed, exists to get a new user started, and points them towards more help...At a minimum, your README should:
Explain what the software does. There’s nothing more frustrating than downloading and installing something only to find out that it doesn’t do what you thought it did.
List required dependencies. We address dependencies in more detail in Rule 5.
Provide compilation or installation instructions.
List all input and output files, even those considered self-explanatory. Link to specifications for standard formats and list the required fields and acceptable values in other files. If there is no rigorous definition for a format, explain its parts as clearly as possible in plain English.
List a few example commands to get a user started quickly.
State attributions and licensing. Attributions are how you credit your contributors; licenses dictate how others may use and need to credit your work.20
[1] software that: solves complex modeling problems in a scientific context (physics, mathematics, biology, medicine, social science, neuroscience, engineering); supports the functioning of research instruments or the execution of research experiments; extracts knowledge from large data sets; offers a mathematical library, or similar.21
[2] A computer-based application that converts inputs into outputs to support the user in one or more research tasks.22
[3] Research Software includes source code files, algorithms, scripts, computational workflows and executables that were created during the research process or for a research purpose. Software components (e.g., operating systems, libraries, dependencies, packages, scripts, etc.) that are used for research but were not created during or with a clear research intent should be considered software in research and not Research Software. This differentiation may vary between disciplines. The minimal requirement for achieving computational reproducibility is that all the computational components (Research Software, software used in research, documentation and hardware) used during the research are identified, described, and made accessible to the extent that is possible.23
Software itself is the set of instructions or programs that tell a computer what to do. It is independent of hardware and makes computers programmable. There are three basic types: System software to provide core functions such as operating systems, disk management, utilities, hardware management and other operational necessities. Programming software to give programmers tools such as text editors, compilers, linkers, debuggers and other tools to create code. Application software (applications or apps) to help users perform tasks.24
Revision control is the process of managing multiple versions of a piece of information. In its simplest form, this is something that many people do by hand: every time you modify a file, save it under a new name that contains a number, each one higher than the number of the preceding version.25