In Australian universities and around the world, researchers are building algorithms to spot a face in a crowd or track a person's movements.

The task requires training data — and lots of it — but the ethical origins of some of the vast image collections that such systems are tested on have been called into question.

American researcher Adam Harvey is concerned they're "contributing to the growing crisis of biometric surveillance technologies", and in some cases, were collected without explicit consent.

He has investigated various face recognition training datasets as part of his MegaPixels project.

One popular dataset known as Duke MTMC, for example, contains more than 2 million frames of 2,000 students on campus at the US university and was built to analyse the motion of objects in video.

While the research team put up posters notifying passers-by they were being captured, Duke University professor Carlo Tomasi apologised in June for a number of lapses, and for "making the data available without protections", he wrote in the college newspaper.

The Duke MTMC dataset was taken by eight cameras observing more than 2,700 identities over 85 minutes. ( ABC News: DukeMTMC screenshot )

That last point is key for Mr Harvey. In many cases, such datasets have been freely available for anyone to download online, until recent awareness and protest caused them to be taken down.

While Duke MTMC was not created as a facial recognition dataset, versions of it have been used for research supported by commercial surveillance companies and governments, Mr Harvey found, including the US Department of Homeland Security and National University of Defense Technology in China.

Datasets popular among Australian researchers

Face recognition and computer vision are two areas of machine learning where large datasets are needed to train complex models.

In many cases, researchers can't collect the data at the scale needed.

"What happens if there's a reliance on a very small number of benchmark datasets, like the Duke one," explained Ben Rubenstein, a machine learning researcher at the University of Melbourne.

Not all experiments using such datasets are interested in surveillance. Instead, it can be what he called "a sandbox for improving machine learning".

Nevertheless, Dr Rubenstein believes academics should question where their data comes from.

"Don't use a dataset that's been collected in an unethical way, or, if it had been collected in Australia, it would have been collected illegally," he said.

As the origins of some image datasets have been exposed, questions are being asked regarding their use by Australian universities, and whether current ethical guidelines provide sufficient scrutiny.

Research papers show Duke MTMC is highly popular among Australian machine learning academics, as is a dataset known as Market-1501.

One or both have been used by academics at Sydney University, the University of Technology Sydney and the University of Queensland, among many other institutions.

Market-1501 was built for "person re-identification", according to the academic paper written by its creators.

Person re-identification aims to identify a pedestrian, for example, captured on one camera across multiple cameras. It's considered a key element of intelligent surveillance systems.

To create Market-1501, six cameras were placed in front of a supermarket at Tsinghua University in Beijing, capturing passers-by. ( ABC News: Market 1501 screenshot )

To create Market-1501, six cameras were placed in front of a supermarket at Tsinghua University in Beijing — one of China's top universities.

All told, 1,501 identities were collected, but the paper makes no mention of how consent was obtained from passers-by, or which ethical guidelines were followed.

The researchers, as well as Tsinghua University, did not respond to multiple requests for comment. The dataset's original GitHub page now returns a 404 error, but other projects have annotated the images for attributes such as gender and age.

Microsoft Research is also credited via a researcher on the paper. A spokesperson said, "Our research is guided by our principles [and] fully complies with US and local laws".

University rules in question

Megapixel's Mr Harvey said he didn't think it was possible for computer vision researchers to scrutinise every dataset.

"I think the responsibility falls on the academic institutions to establish and communicate clear policies about this and that researchers should be barred from publishing new papers with illicit datasets," he said.

Some Australian universities said they will revisit their procedures, to ensure some scrutiny of datasets created by outside researchers.

Greg Welsh, head of external communications at UTS, pointed out that new details regarding the Duke MTMC dataset, specifically the creator's apology, only emerged recently.

"Naturally, we would expect our researchers to discontinue using such databases once these concerns around use were raised," he said.

"While UTS already has rigorous procedures and governance controls in place for the ethical creation of its own human-related datasets, it will establish new systems and processes to cover the use of datasets created by others."

Both Market-1501 and Duke MTMC datasets were used for person re-identification work published in 2018 by Sydney University in conjunction with researchers from the Chinese technology companies SenseNets Technology Limited and SenseTime Group Limited, raising additional questions.

Both companies have been linked by various news reports in early 2019 to Chinese Government surveillance efforts in the Xinjiang region, focused on the Muslim Uighur minority. SenseTime and the SenseNet researcher were approached for comment.

A University of Sydney spokesperson said it was aware there have been instances where existing datasets have been used by researchers without seeking required ethics approval.

The University Research Integrity and Ethics team is considering the issue. "We are also reviewing relevant contracts to ensure they comply with our ethical, regulatory and legal requirements," she said.

The University of Queensland is also looking into the matter.

Datasets scraped from the internet

Ethical concerns around image databases go beyond people who had their picture taken without their knowledge or consent. The use of publicly available images also raises new questions.

Abhinav Dhall was involved in the creation of an image database, linked to the Australian National University and other institutions: HAPpy PEople Images (HAPPEI), which was built in part by taking images from photo sharing site Flickr, which showed people at social events.

Rather than a facial recognition database, HAPPEI was built to train computers in "affective cognition" or determining the overall emotion of a group of people.

While the people whose photos were gathered were not asked for their consent, Dr Dhall, who is now a lecturer in the Human-Centred Artificial Intelligence lab at Monash University, said the team were careful to ensure the photos were available under creative commons licenses. He added that the identities of people in Flickr images may be unclear, especially if the files were uploaded by someone else.

"If the data is from the internet, then we look into the licenses of the data. Some researchers may go ahead and contact the creators of the data as well," he said.

A sample of images from the HAPPEI database. ( ABC News: HAPPEI screenshot )

A spokesperson for the Australian National University said it "consistently encourages all researchers using large datasets for artificial intelligence and machine learning work to seek ethics approval — regardless of whether it is in the public domain or not".

University of Melbourne researcher Niels Wouters has used facial recognition databases in his own work (Biometric Mirror), to examine public opinions about the ethical use of new technology.

He questions whether machine learning image collections should be built from scraping sites like Flickr or Google search, even if the photos are under the appropriate license.

Especially as there is a widening gap between why photos were shared online, the ends to which they can be put, and the increasingly ambiguous licensing conditions that online platforms propose.

"The simple fact that family photos might end up in that database and be analysed for content — there are still ethical questions, even though you might be legally allowed to do it," he said.

"Once such a dataset is out there and advertised or published online, we quickly lose track of who actually accesses [it] and for what reason and for what purpose."