SAN FRANCISCO — In July 2014, a team of four Swedish and Polish researchers began using an automated program to better understand what people posted on Facebook.
The program, known as a “scraper,” let the researchers log every comment and interaction from 160 public Facebook pages for nearly two years. By May 2016, they had amassed enough information to track how 368 million Facebook members behaved on the social network. It is one of the largest known sets of user data ever assembled from Facebook.
“We’re concerned about how easy it was to collect this,” said Fredrik Erlandsson, one of the researchers and a lecturer at the Blekinge Institute of Technology in Sweden. Last December, he and his colleagues published a research paper in the journal Entropy detailing how their methods of trawling social media sites could be replicated.
For more than a decade, professors, doctoral candidates and researchers from academic institutions around the world have harvested information from Facebook using techniques similar to those of Dr. Erlandsson and his team. They have compiled hundreds of Facebook data sets that captured the behavior of a few thousand to hundreds of millions of individuals, according to interviews with more than a dozen scholars.
Their practices came to light in March when The New York Times and The Observer of London reported that Aleksandr Kogan, a University of Cambridge psychology professor, had obtained the data of up to 87 million Facebook users through a quiz app. Mr. Kogan sold the information to Cambridge Analytica, a political consulting firm with ties to the Trump campaign, so it could build psychographic profiles of American voters. Last week, Cambridge Analytica said it would cease operations after the uproar over its use of personal information.
But while what happened with Mr. Kogan’s Facebook data set is now known, the fate of other information hoards is murkier. In many cases, the data was used for research or scholarly articles. The information was then sometimes left unsecured and stored on open servers that offered access to anyone. Some academics said the data could have been easily copied and sold to marketers or political consulting firms.
The potential result is more leakage of Facebook users’ information through academic circles, said Rasmus Kleis Nielsen, a professor of political communication at the University of Oxford who has studied data collection from Facebook.
“The academic world is highly decentralized, and each individual, each institution, has a different way of securing their data,” Dr. Nielsen said. “Even if almost everyone in the academic community is careful and protects the data, you still can end up in a situation where someone is careless or acts in bad faith and sells access. It’s hard to imagine how Facebook stops that from happening.”
The Times reviewed half a dozen Facebook data sets compiled by academics from 2006 to 2017. One, gathered from 2015 to 2017 by researchers in Denmark and New Zealand, examined 1.3 million people in Denmark — about a quarter of the country’s population — to determine how liking one political page on Facebook could predict how someone would vote in the future. Another set, from 2013, by a group of Norwegian academics focused on the civic engagement of 21 million Facebook members on four continents.
The Danish research team did not respond to a request for comment. Petter Bae Brandtzaeg, one of the Norwegian researchers, said he understood concerns about data gathering.
“As a researcher you get immediate access to people’s behavior, attitudes, feelings and relationships, which are of course tempting for all,” he wrote in an email. He said many researchers lacked the technical expertise to properly secure data.
The Facebook data was typically amassed through scraper programs that crawled the social network to document what was posted, or through quiz apps that requested access to people’s profiles. The results included users’ locations, interests, political affiliations, Facebook interactions and even music preferences.
In most cases, researchers assigned numbers to people whose Facebook information they had obtained to maintain anonymity. But the more data there is, the easier it is to overlay one information set with another to identify someone. One 2015 paper published in the journal Science looked at credit card spending data and found that data scientists could pinpoint 90 percent of the shoppers by name with just four random pieces of information from sites like Facebook, Instagram and Twitter.
Once people are identified and their interests and interactions known, they can be targeted with advertising and mobilized for political campaigns or other causes.
For years, Facebook had no specific policies about academics’ access to user data, though it had guidelines on working with third parties. While the company has a rule that forbids the use of scrapers, it has not enforced that policy against scholars. And at times, it has assisted researchers with studies.
In 2014, though, Facebook began limiting third-party apps, like quizzes, from obtaining users’ information.
Since Mr. Kogan’s actions were revealed, fueling an outcry over data privacy, Facebook has made further changes. The company has given people more control over their privacy settings. It has said it will audit all apps that collected large amounts of Facebook data, and it temporarily stopped allowing new apps to gather information from its members.
Last month, Facebook also narrowed the number of academics it would work with, saying it would collaborate with those who wanted to research the impact of social media on elections through an “independent election research commission.” Only scholars with election-related projects can apply.
“We are taking a hard look at the information apps can use when you connect them to Facebook, as well as other data practices,” Susan Glick, a Facebook spokeswoman, said in a statement. “These other data practices include academic research.”
Before social media existed, researchers hoping to study human behavior had to painstakingly seek out groups of people to examine. Social media has let them easily find masses of subjects — as well as information like their date of birth, gender and interests — and observe some of their online behavior in real time.
“It was unprecedented,” said Christian Rudder, a founder of OkCupid, a dating and social media site, who published the book “Dataclysm” in 2014 on how much people revealed through their online lives.
One of the academic community’s earliest known Facebook data sets was collected in 2006 by Harvard University professors. It covered 1,700 people who agreed to have their Facebook information anonymously analyzed. The data was later easily traced back by other academics to Harvard freshmen.
In Britain, researchers were doing similar work through different means. In 2007, Michal Kosinski, then deputy director at the Psychometrics Center at the University of Cambridge, worked with a colleague, David Stillwell, to create My Personality, a quiz app that offered to assess people’s personalities in exchange for data about them. It was one of the first times a quiz app had been used for obtaining Facebook members’ information.
My Personality has now collected details on more than six million Facebook users, according to the academics who have gathered the data. Many researchers have since copied the quiz app method, including Mr. Kogan.
In interviews with The Times, Dr. Kosinski and Dr. Stillwell said they took great care to keep the data they procured anonymous. Dr. Stillwell added that the information had been widely shared with other researchers, but any academic who wished to use it was vetted.
Dr. Kosinski acknowledged that data is not a physical item that is easy to control. Once a data set is created, it can be copied and shared until its original source is unknown. He said collection of information from Facebook had become widespread over the years, not only by academics but also by developers, marketers, data analytics companies and others.
“What Kogan did was wrong. But what Kogan did, many others do on a much larger scale,” Dr. Kosinski said. “They just don’t get caught.”
In 2014, after Facebook announced it would restrict third-party apps from gaining access to user data, the reach of quiz apps became limited. But scrapers continued to improve and more speedily compile information from the social network.
A group of German academics used a scraper to harvest the profiles of 60,000 Facebook users in the New Orleans area starting in late 2008. The researchers, whose goal was to study how people’s friendships change online over time, recorded over 800,000 interactions during a two-year period. They did not respond to a request for comment.
Some scholars said Facebook’s recent privacy changes may have gone too far by also cutting off academics who behaved responsibly.
“Academics would argue that we need access to primary data,” said Dr. Nielsen of Oxford. He said the changes might lead to an asymmetry, with internal Facebook researchers accumulating mounds of data while outside academics would not.
“If that happens, only Facebook will really know very much about how Facebook actually operates and how people act on Facebook,” he said.
Dr. Erlandsson said the paper he and his colleagues published last December initially made little splash. But since the Cambridge Analytica revelations, he has seen renewed interest. He said he had been contacted by companies — he declined to name them — interested in buying the data on 368 million Facebook members.
“I’m not interested in selling,” he said. “And the truth is, anyone could easily do this themselves.”
Follow Sheera Frenkel on Twitter: @sheeraf.