A mosaic of about 50,000 images from the MegaFace dataset, which includes over 3.5 million. Photo / Adam Harvey via The New York Times
An online tool targets only a small slice of what's out there, but may open some eyes to how widely artificial intelligence research fed on personal images.
When tech companies created the facial recognition systems that are rapidly remaking government surveillance and chipping away at personal privacy, they may havereceived help from an unexpected source: your face.
Companies, universities and government labs have used millions of images collected from a hodgepodge of online sources to develop the technology. Now researchers have built an online tool, Exposing.AI, that lets people search many of these image collections for their old photos.
The tool, which matches images from the Flickr online photo-sharing service, offers a window onto the vast amounts of data needed to build a wide variety of AI technologies, from facial recognition to online "chatbots."
"People need to realize that some of their most intimate moments have been weaponised," said one of its creators, Liz O'Sullivan, technology director at the Surveillance Technology Oversight Project, a privacy and civil rights group.
She helped create Exposing.AI with Adam Harvey, a researcher and artist in Berlin.
Systems using artificial intelligence do not magically become smart. They learn by pinpointing patterns in data generated by humans — photos, voice recordings, books, Wikipedia articles and all sorts of other material. The technology is getting better all the time, but it can learn human biases against women and minorities.
People may not know they are contributing to AI education. For some, this is a curiosity. For others, it is enormously creepy. And it can be against the law. A 2008 law in Illinois, the Biometric Information Privacy Act, imposes financial penalties if the face scans of residents are used without their consent.
In 2006, Brett Gaylor, a documentary filmmaker from Victoria, British Columbia, uploaded his honeymoon photos to Flickr, a popular service then. Nearly 15 years later, using an early version of Exposing.AI provided by Harvey, he discovered that hundreds of those photos had made their way into multiple data sets that may have been used to train facial recognition systems around the world.
Flickr, which was bought and sold by many companies over the years and is now owned by the photo-sharing service SmugMug, allowed users to share their photos under what is called a Creative Commons license. That license, common on internet sites, meant others could use the photos with certain restrictions, although these restrictions may have been ignored. In 2014, Yahoo, which owned Flickr at the time, used many of these photos in a data set meant to help with work on computer vision.
Gaylor, 43, wondered how his photos could have bounced from place to place. Then he was told that the photos may have contributed to surveillance systems in the United States and other countries and that one of these systems was used to track China's Uighur population.
"My curiosity turned to horror," he said.
How honeymoon photos helped build surveillance systems in China is, in some ways, a story of unintended — or unanticipated — consequences.
Years ago, AI researchers at leading universities and tech companies began gathering digital photos from a wide variety of sources, including photo-sharing services, social networks, dating sites like OkCupid and even cameras installed on college quads. They shared those photos with other organizations.
That was just the norm for researchers. They all needed data to feed into their new AI systems, so they shared what they had. It was usually legal.
One example was MegaFace, a data set created by professors at the University of Washington in 2015. They built it without the knowledge or consent of the people whose images they folded into its enormous pool of photos. The professors posted it to the internet so others could download it.
MegaFace has been downloaded more than 6,000 times by companies and government agencies around the world, according to a New York Times public records request. They included US defense contractor Northrop Grumman; In-Q-Tel, the investment arm of the CIA; ByteDance, the parent company of Chinese social media app TikTok; and Chinese surveillance company Megvii.
Researchers built MegaFace for use in an academic competition meant to spur the development of facial recognition systems. It was not intended for commercial use. But only a small percentage of those who downloaded MegaFace publicly participated in the competition.
"We are not in a position to discuss third-party projects," said Victor Balta, a University of Washington spokesperson. "MegaFace has been decommissioned, and MegaFace data are no longer being distributed."
Some who downloaded the data have deployed facial recognition systems. Megvii was blacklisted last year by the Commerce Department after the Chinese government used its technology to monitor the country's Uighur population.
The University of Washington took MegaFace offline in May, and other organizations have removed other data sets. But copies of these files could be anywhere, and they are likely to be feeding new research.
O'Sullivan and Harvey spent years trying to build a tool that could expose how all that data was being used. It was more difficult than they had anticipated.
They wanted to accept someone's photo and, using facial recognition, instantly tell that person how many times his or her face was included in one of these data sets. But they worried that such a tool could be used in bad ways — by stalkers, companies and nation-states.
"The potential for harm seemed too great," said O'Sullivan, who is also vice president of responsible AI with Arthur, a New York company that helps businesses manage the behavior of AI technologies.
In the end, they were forced to limit how people could search the tool and what results it delivered. The tool, as it works today, is not as effective as they would like. But the researchers worried that they could not expose the breadth of the problem without making it worse.
Exposing.AI itself does not use facial recognition. It pinpoints photos only if you already have a way of pointing to them online, with, say, an internet address. People can search only for photos that were posted to Flickr, and they need a Flickr username, tag or internet address that can identify those photos. (This provides the proper security and privacy protections, the researchers said.)
Although this limits the usefulness of the tool, it is still an eye-opener. Flickr images make up a significant swath of the facial recognition data sets that have been passed around the internet, including MegaFace.
It is not hard to find photos that people have some personal connection to. Simply by searching through old emails for Flickr links, the Times turned up photos that, according to Exposing.AI, were used in MegaFace and other facial recognition data sets.
Several belonged to Parisa Tabriz, a well-known security researcher at Google. She did not respond to a request for comment.
Gaylor is particularly disturbed by what he has discovered through the tool because he once believed that the free flow of information on the internet was mostly a positive thing. He used Flickr because it gave others the right to use his photos through the Creative Commons license.
"I am now living the consequences," he said.
His hope — and the hope of O'Sullivan and Harvey — is that companies and government will develop new norms, policies and laws that prevent mass collection of personal data. He is making a documentary about the long, winding and occasionally disturbing path of his honeymoon photos to shine a light on the problem.
Harvey is adamant that something has to change.
"We need to dismantle these as soon as possible — before they do more harm," he said.