In this section, I describe the three technological elements I use to create born-open data. The three basic elements are: 1. a shared local storage file system, 2. the use of Git repository software and the GitHub open-source web repository, and 3. automation through a task scheduler. I provide a brief overview of each step. There is good and bad news here for those wishing to implement the system. The good news is that all the protocols are standard, well used, and available for all major operating systems including Windows, Mac-OSX, and Unix/Linux. Any skilled IT professional should know them well or be willing to learn them. The bad news is that implementation will require a bit of tinkering with the specifics dependent on machines, networks, and operating systems. Given that archiving data is a critical enterprise that may affect the researcher for the length of her or his career, it seems that investing in these or comparable technologies is warranted and beneficial.

Element 1: Shared local storage

In my lab, behavioral data are collected across several computers. One key problem is coordinating among these computers. Code to run the experiments must be placed on each, and the outputted data must be merged into a master set. These tasks must be done accurately to insure the integrity of the data. In some labs, assistants move files from one machine to another, often via memory sticks. Not only is this approach labor intensive, it may not be reliable. The better approach is to use one shared drive. The lab server shares a drive where both the experimental code and the data are stored. This drive is the master drive, and the other computers that collect the data read and write to it rather than to their own internal drives. When behavioral data are created, then they are created only on the master drive and there is no need to move data files. Setting up a shared drive is not too difficult and most researchers have either the knowledge to do so in their preferred operating system or institutional support.

Element 2: Git and GitHub

Once data have been stored in a single location, they must be made into a repository. A repository is more than a set of files; it contains also versions, changes, and logs. Because repositories contain logs and versions, they provide for a digital audit trail. For me, the easiest approach is to use a single repository for all experiments in my lab. This repository spans several folders and files. Making a repository, versioning, and logging is performed by a dedicated software application. Over the years there have been several options, but there is a dominant application: Git. Git is dominant because in my opinion it is the most flexible and convenient. One advantage of Git is that it interfaces seamlessly with GitHub, a public website that hosts a ridiculously large number of open-source projects. Git is very easy to find and install, and the GitHub clients, such as GitHub-for-Windows and GitHub-for-Mac include it. Git is part of the Linux operating system and no further installation is needed for Linux servers.

At the heart of the system is GitHub, the largest web host for sharing open-source projects (http://www.github.com). GitHub may be used at no cost and when used in this mode, the information is freely and publicly available. GitHub was originally designed for the development of open-source code, but is now used for a wide range of publicly available projects. Anyone can make a GitHub account and use GitHub software to create a local repository and link it to GitHub. GitHub is designed to be fairly user friendly and provides extensive help and services. Git repositories may be uploaded to GitHub. The GitHub copy is then made available on the web either through the Git system or through the web on demand. Trained IT professionals should be familiar with Git and GitHub because many useful projects are archived on these systems.

Here is a very brief step-by-step example of how to set up Git and GitHub. I take the perspective of Kirby, my dog, who has never used Git or GitHub. Kirby will be storing cute photos of himself as data.

1. The first task is to set up a repository on GitHub server. The following steps are followed.

Kirby goes to GitHub and signs up for a free account (last option). Once the account is set up (with user name KirbyHerby), a bedazzling screen with a lot of options for exploring GitHub appears.

To create his first repository on the server, Kirby presses the green button that says “+ New repository” on the bottom left. Figure 1A provides a screen shot of the button.

Kirby now has to make some choices about the repository as shown in Fig. 1B. He names it “data,” enters a description of the repository, makes it public, initializes it with a README and does not specify which files to ignore or a license. He then presses the green “Create repository” button on the bottom, and is given his first view of the repository. Kirby’s repository is now at http://github.com/KirbyHerby/data, and he will bark out this URL to anyone interested. The repository contains only the README.md file at this point.

2. The next task is to set up a version of the repository on Kirby’s local computer and link the local and Github version:

Kirby downloads the GitHub application for his operating system (http://mac.github.com or http://windows.github.com), and on installation, chooses to install the command-line tools (this will be helpful subsequently in task scheduling).

Kirby enters his GitHub username (“KirbyHerby”) and password. He next has to create a local repository and link it to the one on the server. To do so, he chooses to “Add repository” and is given a choice to “Add,” “Create,” or “Clone.” Since the repository already exists at GitHub, he presses “Clone.” A list of his repositories shows up, and in this case, it is a short list of one repository, “data.” Figure 1C shows the screen shot. Kirby then selects “data” and presses the bottom button “Clone repository.” The repository now exists on the local computer under the folder “data.” There are two, separate copies of the same repository: one on the GitHub server and one on Kirby’s local computer.

3. The final task is to add files so that others may see them:

Kirby add his data files to his local data repository as follows: In this case, being that Kirby is a dog, his data are his favorite photos. Kirby copies the photos to the files in the usual way, which for Mac-OSX is by using the Finder. Figure 1D shows the Finder window in the foreground and the GitHub client window in the background. As can be seen, Kirby has added three files, and these show up in both applications. Kirby has no more need for the Finder and closes it to get a better view of the local repository in the GitHub client window.

Kirby is now going to save the updated state of the local repository, which is called committing it. Committing a local action, and can be thought of as a snapshot of the repository at this point in time. Kirby turns his attention to the bottom part of the screen, which is shown in Fig. 1E. To commit, Kirby must add a log entry, which in this case is, “Added three great photos.” The log will contain not only this message, but a description of what files were added, when, and by whom. This log message is enforced—one cannot make a commit without it. Finally, Kirby presses “Commit to master.”

Kirby now has to push his changes to the repository to the GitHub server so everyone may see them. He can do so by pressing the “sync” button.

Kirby’s additions are now available to everyone at http://github.com/KirbyHerby/data. Moreover, as Kirby gets new photos, he can add them by copying the files into the data directory on his local computer, committing a new version of the repository with a new message, and syncing up the local with the GitHub server version. After Kirby added his first three photos, he then added a fourth one by following these steps.

There is a lot more to Git and GitHub than this. Multiple people may work on multiple parts of the same project. Git and GitHub have support for branches, tagging versions, merging files, and resolving conflicts. Help for Git and GitHub is available in the online book plainly titled, ”Git Book” at http://git-scm.com/book/en/v2. The system does take some time to learn, but there is a big payoff outside of data archiving. It can be used to version much of the academic process including analysis and manuscript preparation. I find it indispensable in keeping a reliable pipeline from data collection to final manuscript submission.

Element 3: Execution and scheduling

The last step is to execute the adding of the day’s data files to the Github repository. The steps are as files: 1. All new files with certain filename extensions are added to the local version of the repository. Then the local version is committed, meaning that the state of all files is logged, and the commit is timestamped and labeled. Finally, the new state is uploaded to the GitHub site. I have written these steps in the following script, which is executed nightly.

Fig. 1 Screen shots for using Git and GitHub in creating repositories. a After creating an account on GitHub, Kirby creates his first repository on the server. b Creating the repository requires a few choices, which are shown here. c Setting up a version of the repository on the local machine form the server is called “cloning.” d Files may be added the usual way. e The local repository is updated by committing changes into a new snapshot. The new snapshot can be synced (pushed) with (to) the repository on the server Full size image

This script is executed nightly by a task scheduler. Setting up the a task scheduler is the last step. I use CRON tables on my Linux server. Task scheduling is built into Linux, Mac OSX, and Windows.Footnote 2 Trained IT personnel should be capable of writing Git scripts and automatizing their execution.

Concerns about git and github

There are important technical concerns:

File sizes

Git and GitHub do not place any size restrictions on the size of files. Nonetheless, Git does not manage very large files efficiently. I have read though I cannot find the citation that Git is not recommended for files larger than 100 MB.

Repository sizes

Git and GitHub do not place any restriction on the size of repositories. Nonetheless, Git slows down considerably when repositories exceed a gigabyte. If our behavioral repository becomes too large, we will start a new one. There is no limit on the number of public repositories a user may have.

Permanence of Git & GitHub

It is reasonable to wonder about the permanence of Git and GitHub. Git is an open-source application, much like emacs, R, Linux, and the like. It is too widely adopted, too useful, and too beloved to go away. GitHub is a different matter. GitHub is a private company much like DropBox or Google, and it theoretically can fail. It is more likely that GitHub would include advertisements or charge a small fee rather than go away.

Curation

In most curated archives, reposited materials cannot be deleted or changes. The material is immutable, and indeed, most university and society archives work this way. Files on GitHub may be changed by the uploader. Fortunately, changes are logged and older versions of the same file are saved. GitHub should not be considered a properly curated archive. Instead, it is a useful workaround until properly curated archives are reconfigured to accept incrementally added materials by automatic processes. Because GitHub changes are logged and versioned, GitHub offers the community high confidence on the integrity of data.