History and internal organisation

In 2012, U.K. Steiner and D. Misevic started the AgeGuess.org citizen science project. They form the core committee and are responsible for creating and updating protocols for data collection and for the overall infrastructure of the database, as well as for securing funding. They are supported by webpage building and database expert, J. Vieillefont, who created and maintains the current version of the webpage and the database. The first fully functional version of AgeGuess.org was coded by Charlotte Le Pesquer. Furthermore, a team of scientific advisors spanning both academic disciplines (e.g. public health) and industry (e.g. pension providers) helps shape the scientific directions of the project and highlights funding opportunities. Depending on availability of funding, one or more pre- or post-doctoral fellows have worked on data analysis and outreach.

Variables and descriptions

The most up-to-date version of the data is accessible to the public (i.e. users with an user account at ageguess,org) as five csv files; for those who do not have an account and do not want to create one a version dating from spring 2019 is available for download from the UK Data Service20. The data that is continuously and ongoing collected are stored in a MySQL database. These five csv data files respectively contain information on guess, photos, gamers, quality, and report, using those names with the prefix “ag_” for AgeGuess and .csv extensions. In the following, we describe the variables in each of the csv files. All missing data are encoded as NA.

The ag_guess.csv file stores the information regarding the age guesses using the following variables: uid, guess_id, photo_id, ageG, outG, and access. The uid, guess_id, and photo_id variables contain the individual identifiers of the user who made the guess, the guess itself, and the photograph guessed on. The ageG and outG variables describe the guessed age and the deviation in the guess from the real age in years, respectively. The access variables store the timestamp when the guess was made in date and time UTC + 1:00 in the format ‘YYYY-MM-DD HH:MM:SS’. While repeated guessing by the same person on the same photograph is no longer possible due to the current version of the algorithm controlling the photos displayed to the users, this was possible in early implementations of AgeGuess. Data on repeated guesses are available from previous versions of the database upon request.

The ag_photos.csv file stores the information regarding the photographs using the following variables: uid, photo_id, age, relation, gender, ethnicity, birth_country, birth_year, death_age, and created. The uid and photo_id variables represent the individual identifiers the user who uploaded the photograph and of the photograph. The relation variable indicates whether the photograph is of the user or of another person to which the user has a relation (categories: user, unrelated of friend, mother/father, son/daughter, sibling, half sibling, maternal/paternal grandparent, maternal/paternal aunt/uncle, maternal/paternal cousin, grandchild). The gender, ethnicity, birth_country, birth_year, death_age variables contain the respective basic demographic information for the person in the photograph. The created variable stores the timestamp when the photograph was added in date and time UTC + 1:00 in the format ‘YYYY-MM-DD HH:MM:SS’.

The ag_gamers.csv file stores the information regarding the users (aka gamers) with the following variables: uid, g, ng, points, gender, ethnicity, birth_country, birth_year, access, and created. These variables store the individual identifier of the user (uid), the number of correct guesses the user made (g), the number of other guesses (ng), and the points gained in the online game (points). Furthermore, the file contains the users’ basic demographic information regarding gender, ethnicity, birth country, and birth year, stored in variables of these names. Finally, the access and created variables store the timestamp in date and time UTC + 1:00 of when the user last logged in and of when the user created an account with AgeGuess, respectively.

The ag_quality.csv file contains information on quality reports that users have made on photographs. The variables are uid, quality_id, photo_id, quality, and created. The uid, quality_id, and photo_id variables contain the individual identifier of the user who made the assessment, the identifier of the quality assessment, and of the photo on which the assessment was made, respectively. Quality itself is encoded as 1 = high, 2 = medium, 3 = low in the quality variable. The timestamps of the assessment in formats described above are stored in the created variables.

Finally, the ag_report.csv file pertains to information on any other reports made on photographs. The variables are uid, photo_id, report_id, comment, and created. The uid, photo_id, and report_id variables store the individual identifiers of the user who made the report, the photograph on which the report was made, and the report itself, respectively. Report categories are rotation needed, cropping needed, none or more than one person, copyright infringement, offensive content, and combinations thereof. The AgeGuess team regularly edits photographs after receiving a report, for example when cropping is needed, and retains the edited photographs if suitable. Photographs and data associated to the other report categories are deleted. Finally, after internal checks the system adds reports related to missing photographs and inaccurate data on birth year and age. The timestamps of the report in formats described above are stored in the created variables. The ag_quality.csv and ag_report.csv are mostly for system-intern use and the data are not included in the distribution at www.ageguess.org/download but can easily be received on request.

Data summary

After running the data cleaning protocol (see below), AgeGuess has, as of spring 2019, 4434 users from ~120 countries of origin of which 2339 are female, 1757 male, and the rest is unknown (Fig. 1). Most users identified as Caucasian/White (3024), followed by Asian (299), Hispanic (265), Black (120), Other (208), and 518 users did not provide an answer. The users have uploaded 4710 photos of 2855 females and 1855 males (Fig. 2). The age of the persons displayed in the photographs ranges from 5 to 100 years old. The earliest and latest corresponding birth years were 1877 and 2012, respectively. The persons in the photos were identified as Caucasian/White (3746), followed by Asian (343), Hispanic (255), Black (103), and Other (246). The data contain repeated measures on 519 individuals with more than 242 individuals having uploaded three or more pictures of themselves.

Overall users have guessed ages 220,231times. We have at least 10 repeated guesses for each photograph, with a maximum of 385 repeated guesses and a median of 42 guesses. The variation in number of guesses stems from earlier versions of the photograph-selecting algorithm, which did not account for the number of previous guesses on a photograph. The deviation of the mean age guess from real age for each photograph is normally distributed with a mean close to 0 (Fig. 3a). The relationship between mean perceived age and real age for each photograph is shown in Fig. 3b.

Fig. 3 (a) Frequency distribution of the deviation of the mean guessed age from real age (n = 4710). The green line marks no deviation between mean perceived and real age. (b) Mean perceived age plotted against real age. Each data point represents the mean perceived age of one of the 4710 images. Data points belonging to different birth decades are coloured differently. Full size image

Technical validation

The data originate from citizen scientists. Such data are often approached with skepticism from the scientific community, even though citizen scientist frequently perform equally well as trained scientists in collecting data21. The data collected can contain both false and missing data that may have been entered by users either by mistake or intentionally. Therefore, we perform some basic data cleaning steps before publication of the data and provide basic tests for data quality and accuracy. From the Guess data we delete all guesses that are more than two times the standard deviation away from the mean age guess on a photograph. We further remove all guesses on photographs that have less than 10 guesses, since it is known that substantial uncertainty in rating ages exists19,22,23. This uncertainty mainly arises within guessers among repeated guesses and to a much lower degree among guessers24. Simply put, a perceived age estimate based on only a few guesses is less accurate than one based on more than 10 guesses, and therefore, we exclude photos with less than 10 guesses to improve data quality. Using the information in the Report data (see above), we delete guesses on photos with inaccurate age or birth year. Since not all inaccurate birth years are flagged by internal system checks, we replace in both the Photos and Gamers data all unrealistic birth years (<1800 or >2019) with NA. The whole, uncleaned data set can be obtained upon request.

Furthermore, the data quality of the AgeGuess database is subject to a trade-off common to many citizen science projects, where large quantities of data are obtained at the expense of representativeness of sample and data accuracy. Neither the AgeGuess users nor the persons displayed in the photograph are representative samples of the population with respect to age, geographic location, or ethnicity, though information on both the displayed person in the images and the users is available to account for biases. Such biases are also frequently found in classic scientific studies8,25,26. Furthermore, the uploaded photographs are not standardised with respect to posture, lighting, face expression, clothing, background, distance to camera, hairstyle or dye, make-up, or the use of accessories such as hats, jewellery, or glasses. Some of these factors may be used to deliberately conceal age: older adults may use particularly make-up and hair dye to appear younger, while younger adults may manipulate their looks to appear older. A bias for older individuals being perceived younger and younger individuals being perceived older has been shown to be independent of such factors24,27. We do discourage editing photographs to alter the age appearance and offer a report option to flag those photographs, however, some manipulated photographs may have remained unnoticed.

We have no direct means to control the accuracy of the chronological age users enter when uploading photographs. However, we can indirectly detect large mistakes or deliberately provided false information by identifying and excluding outliers. In certain citizen science projects concerns arise due to the ability of the citizen scientists to accurately perform the demanded tasks compared to classical trained scientists. For the database presented here this should be of little concern. Previous highly controlled scientific studies on perceived age rating showed that geriatric nurses, who were considered experts in rating ages of older women, did not perform different in rating ages compared to two other groups: young male students, who were expected to be the worst raters, and same-aged peers, i.e. older women8. Confidence in the collected data also comes from small side-projects that allow us to assess the quality of the data. For instance, 10 students at the University of Southern Denmark aimed at outcompeting the users of AgeGuess, first by spending several weeks studying scientific literature on factors that influence perceived age to train themselves to be good at rating ages. When they rated ages on AgeGuess.org, they were disappointed to not have performed any different than the users on AgeGuess.org (data not published). Also, when comparing the variance (standard deviation, SD) in the difference between chronological age and perceived age between highly controlled studies19,22,23,24 and the AgeGuess data the variance in the difference was comparative to the data generated by the citizen scientists (6–8 in classical studies, 6.9 for the AgeGuess data). This similarity in age estimation might not be expected since the images in the controlled studies have been obtained under strict standardized settings, such as controlled posture, lighting, face expression, clothing, background, distance to camera, make-up, and without accessories such as hats, jewellery, or glasses8. Such standard settings should lower variance. The classical studies partly included specific age groups, e.g. some have included only persons above 708, and variance is increased for judging the age of older individuals23, as is also found in our data where the variance at least of the oldest old is slightly higher compared to very young persons guessed on (Fig. 3b). A limited age range can also reduce the variance since the raters realize that the persons guessed on are within such an age range22.

Overall, the citizen science data agrees with basic findings of controlled studies and shows similar variances. Still, anyone using the data should be aware of the uncertainties that come with a citizen science approach of collecting data and that such data is prone to additional error and noise, even though we could not yet detect such increased error. We therefore evaluate the data collected by the citizen scientists to be largely accurate and that the quantity of the data (guesses made) vastly outweighs the potential data quality issues, such as missing data and data entered by mistake or intentionally erroneously entered.