The dataset is made available on a website hosted by the Lifelong Kindergarten group (LLK) at the MIT Media Lab https://llk.media.mit.edu/scratch-data/ with an archival copy placed in the Harvard Dataverse network (Data Citation 1). The website includes a web page for each table which contains: an expanded version of the summary information below describing the type of data stored in each table and additional context and explanation; the content of each codebook including a description of every field in the table, the type of data stored in the field (i.e., integer, character string, etc.); and basic summary statistics, tables, and visualizations (e.g., visualizations of the distribution of observations in a table over time). Critically, the page also includes an accounting of every time observations were omitted from the dataset. Finally, each page includes hyperlinks to the downloadable datasets.

The datasets are included in three different formats. All datasets are included in ‘RDATA’ format. RDATA is an external representation of objects in the GNU R programming language. The objects can be loaded into R using the function ‘load()’ or ‘attach()’. RDATA is an implementation of External Data Representation (XDR) which is a standard data serialization format. Numeric data are also provided in comma separated values format (CSV). Textual data are provided in in JavaScript Object Notation (JSON) format. Because the JSON files are very large, they are not formatted as a single JSON object. Instead, they are formatted as a ‘stream’ of JSON dictionaries where each observation is formatted as a separate JSON dictionary, separated by newlines. Newlines and carriage returns (‘\ n’ and ‘\ r’) within text fields are escaped. This format is very similar to the format used for other social media data including data returned by the Twitter streaming API and GitHub event data.

The dataset is separated into three sections: core tables, text tables, and project analytics tables. Each of the tables is described briefly below. A summary of the number of observations, fields, and formats are in Table 1.

Table 1 Overview of each of the tables included in the dataset including the name of the table, the number of observations included in the databases, the number of fields or columns, and the file formats provided. Full size table

Core tables

Core tables contain data and metadata documenting the major objects and relationships captured by the Scratch online community website. The data in these tables are numeric and do not include user-inputted text, which are stored in text tables. These datasets are provided in RDATA format and comma-separated values format (CSV). This portion of the dataset includes the following 18 tables:

users Each row in this table represents a user account that was publicly visible on the Scratch website at the time that these data were collected. Each account had a public profile web page (see Fig. 2).

Figure 2: Example profile page for the user ‘SampleProjectsTeam’ on the Scratch website. By default, the page shows a user’s most recent 15 projects and includes links to additional pages with older projects. Because the user has not ‘friended’ (i.e., followed) any others, there is a box in the left-most column explaining that they have ‘No friends yet.’. Full size image

projects Each row in this table represents a project that was publicly visible on the Scratch website at the time that these data were collected. Projects were the interactive creations (e.g., video games, interactive art, simulations, and animations) that users shared onto the website. Each project had its own web page (see Fig. 3).

Figure 3: Example of a page created for a project on the Scratch website. Each project had a similar page associated with it. Visitors to the page can view and interact with the project in the upper right half of the page. The rest of the page is dedicated to presenting project metadata with a space for commenting below. Full size image

galleries Each row in this table represents a gallery that was publicly visible on the Scratch website at the time that these data were collected. Galleries were collections of projects displayed on a web page (see Fig. 4).

Figure 4: Example of gallery page on the Scratch website. Like the user profile page shown in Fig. 2, galleries primarily include lists of projects listed in reverse order of when they were added to the gallery. A description and basic metadata are shown in the left column. Not shown in the figure, there is a space for comments on the gallery itself below the list of projects. Full size image

friends Each row in this table represents the event of a user ‘friending’ another user. The ‘friendships’ reflected in this dataset include all friendships that were publicly visible on the Scratch website at that time that these data were collected. Friendships were unidirectional relationships between users displayed in the website as ‘friends.’

downloaders Each row in this table a represents the event of a user downloading a project. In order to offer the most consistent measure of downloads, this dataset counts only the first download per user. The identity of the user downloading a project was not publicly displayed, so it is not included in this table.

favoriters Each row in this table represents the event of a user adding a project to their favorites (i.e., bookmarking it). The list of favorites appeared in the user’s profile page.

lovers Each row in this table represents the event of a user clicking a heart-shaped ‘love-it’ button that appears on every project’s page. This action was socially framed as an expression of appreciation for a project. The identity of the user ‘loving’ a project was not publicly displayed and is not included.

viewers Each row in this table represents the event of a user viewing or loading the webpage of a project for the first time. The identity of the user viewing a project was not publicly displayed and is not included.

project_comments Each row in this table represents a comment left on a project. This table includes all comments that were publicly visible on every project page in the Scratch website at the time of composing this dataset. This table contains the metadata for these comments and includes who, when, and on which project each comment was posted. The text or content of each comment is in the pcomments_text text table.

gallery_comments Each row in this table represents a comment left on a gallery. This table includes all comments that were publicly visible on every gallery page in the Scratch website at the time that these data were collected. This table contains the metadata for these comments and includes who, when, and in which gallery each comment was posted. The text or content of the comment is in the gcomments_text text table.

projects_galleries Each row in this table reflects a project’s inclusion in a gallery at the point of data collection. Projects are listed multiple times in this table if they were included in multiple galleries.

tags_projects Each row in this table represents a tag placed on a project. Projects can have zero or more tags. Each tag could also be associated with additional projects or galleries. Tags were publicly displayed on project pages.

tags_galleries Each row in this table represents a tag placed on a gallery. It is similar in structure and purpose to the tags_projects table. Tags were publicly displayed on gallery pages. Like projects, galleries could have zero or more tags and each tag could be associated with additional projects and galleries.

frontpage_projects Each row in this table represents a project that was displayed on the front page for any of several reasons (see Fig. 5). For example, projects could be placed on the front page automatically for being among the most remixed or the most viewed projects on the website. They could also be added manually by an administrator or a curator.

Figure 5: Snapshot of the front page of the Scratch website that was presented to users at http://scratch.mit.edu/. The top row of projects are projects ‘featured’ by administrators. The second row includes projects selected by a ‘curator’ whose username is ‘the_hawk_arisen’. The header helped users navigate to different sections of the website. Full size image

featured_projects Each row in this table represents a project that was displayed on the front page of the Scratch website in a section called ‘Featured Projects’ that held three projects at a time (see Fig. 5). Only Scratch website administrators could add projects to this section. The process of selecting featured projects was entirely manual and was usually driven by a decision about the quality and interestingness of the project.

featured_galleries Each row in this table represents a gallery that was displayed on the front page of the Scratch website in a section called ‘Featured Galleries.’ Featured Galleries worked identically to Featured Projects and were displayed in the same way.

studio_galleries Each row in this table represents a gallery that was displayed on the front page in a section called ‘Design Studio.’ Only the website administrators could add galleries to this section. The process of adding new galleries was manual and the featured galleries were usually created by administrators.

curators Each row in this table represents a user who at some point was in charge of selecting projects for the front page of the website, in a section labeled ‘Curated By’ (see Fig. 5). This section displayed three projects selected by the curator. When the curator added a project to their list of favorites, the project would be displayed in the ‘Curated By’ section automatically. There was only one active curator at a time.

Text tables

The ‘text’ tables include textual data created by users. In each case, these tables correspond to a table included among the core tables; each includes ID numbers corresponding to observations in those tables. We have separated these tables both because these data are very large and because there are challenges specific to encoding and escaping user-inputted data making it impossible to produce CSV files that can be reliably loaded across statistical software packages and spreadsheet applications.

Some of these tables include data with unknown or invalid text encoding. The Scratch website was designed to only record UTF-8 encoded Unicode text. However, errors and data corruption meant that some of the data in the MySQL database are invalid UTF-8 text. In general, we have elected to include these poorly encoded data because they are the actual data submitted by users and displayed on the website. These datasets are all provided in RDATA format and as streams of JSON dictionaries. We have included software in Python for loading JSON files that will identify and truncate invalid text records. The text portion of the dataset includes the following 8 tables:

projects_text Each row in this table represents a project that was publicly visible on the Scratch website at the time that these data were collected. This table contains the free-form and the unstructured text fields in the projects table.

galleries_text Each row in this table represents a gallery. This table contains the free-form and the unstructured text fields in the galleries table.

comments_text Each row in this table represents a comment left on a project. This table contains the free-form and the unstructured text fields in the pcomments table.

gcomments_text Each row in this table represents a comment left on a gallery. This table contains the free-form and the unstructured text fields in the gcomments table.

tags_text Each row in this table represents a tag used on a project and/or gallery. This table contains the free-form and the unstructured text fields in the tables tags_projects and tags_galleries.

project_block_stacks Each row in this table represents the textual representation of a code block associated with a sprite used in the most recent version of a project shared on the Scratch website at the point of data collection. The code is presented in both a human-readable form, and in the raw format as it is stored in the Squeak programming language (used to create this version of Scratch). The table only includes block stacks that are intended to be executed (see Fig. 6 for an example).

Figure 6: Example of a stack of blocks. In Scratch, the execution of code is triggered by a ‘hat’ block that determines when the stack gets executed. For example, the execution of the blocks in the image shown is triggered by the ‘hat’ block ‘when [green flag] clicked’. Full size image

project_block_stacks_disconnected Each row in this table represents Scratch blocks that are never executed (i.e., those blocks that do not have an execution trigger or ‘hat block’ on top).

project_strings Each row in this table represents a text string—free-form text strings typed by users as part of their code—used in the most recent version of a project shared on the Scratch website at the point of data collection. These text strings include variable names, and messages that are printed on the screen.

Project analytics tables

Although the Scratch website primarily recorded data about interactions around projects, the website software parsed projects to store basic details about the contents of all project files. This included information on project metadata like the number and type of blocks used, the number and type of media elements used, and the history of when projects were saved. The project analytics tables include these metadata. These data are missing for several thousand project files which the website software could not parse. A examination of several of these projects suggests that many of these projects were corrupt or created with modified versions of the Scratch Authoring Environment software that resulted in non-standard binaries.

These datasets are provided in RDATA format and in comma separated values (CSV) format. The project_media table also contains user-inputted text strings corresponding to filenames for media files and, as a result, is provided in JSON as well as CSV and RDATA. This portion of the dataset includes the following 6 tables:

project_blocks Each row in this table represents a project shared on the Scratch website at the point of data collection. Each column of this row represents the frequency counts of each block type for a particular project. The value of each cell represents the number of times a particular block is used.

project_drums Each row in this table represents a drum used in the most recent version of a project shared on the Scratch website at the point of data collection. The list of drums was taken from the General MIDI Level 1 Percussion Key Map.

project_media Each row in this table represents a media item (e.g., an image or a sound) attached to a sprite in the most recent version of a project shared on the Scratch website at the point of data collection. A single project could have many pieces of media.

project_midi_instruments Each row in this table represents a musical instrument used in the most recent version of a project shared on the Scratch website at the point of data collection. Instrument types included recognizable instruments like pianos, guitars, and flutes in the General MIDI Level 1 Instrument Patch Map.

project_save_history Each row in this table represents a time when a user saved a project to their local storage device (e.g., hard drive) in the most recent version of a project shared publicly on the Scratch website at the point of data collection. This data was generated from a log of save ‘events’ that was included in each project. If a project was a remix, its log included the full log of the project (or projects) on which the current project was based.

project_sprites Each row in this table represents a sprite used in the most recent version of a project shared on the Scratch website at the point of data collection. The table includes information on the number of scripts, sounds, and images associated with each sprite.