Access to AI Training Data Sparks Legal Questions

(MeshCube/Shutterstock)

As “vibe coding” goes mainstream, AI companies are rushing to build the biggest and most authoritative tech knowledge bases to train the next generation of AI copilots. But how will AI companies obtain these curated troves of valuable tech data? Recent moves by Stack Overflow and Reddit show how it might play out.

Vibe coding–or telling a coding copilot what you want, and then sitting back while the AI generates code for you–is all the rage today. Searches for “vibe coding” are up 6,700% over the past 12 months, and even renowned technologist like CEO Ali Ghodsi rely on them.

“You’d even hear Ali himself tell you these days, ‘Look, I just mostly ask [Databricks] Assistant for what I need,” said Databricks VP of Marketing Joel Minnick. “If the first attempt at the code doesn’t work, I just kind of give it the error code and tell it ‘try again,’ and it tries again, and now it’s right.”

The combination of huge swaths of sample code and the incredible learning power of large language models (LLMs) give coding copilots their capabilities. What’s more, when questions arise over some technical topic, the Web’s vast array of discussion boards provides ample fodder for copilots to get even the small details correct.

The question then becomes: How do these coding copilots get access to the discussion boards to learn about the millions tech tricks and edge cases? In some cases, the AI companies just take it without asking.

(Mamun Sheikh/Shutterstock)

That is what Reddit, which is one of the most popular news aggregation and social media websites in the world with 102 million daily active users, is accusing Anthropic of doing. On June 4, Reddit filed a lawsuit against Anthropic accusing the AI company of scraping its website for content to train its AI models, in violation of its data policy.

As Ali Azhar writes in a story on BigDATAwire’s sister publication, AIWire:

“Reddit claims that Anthropic accessed its platform more than 100,000 times since July 2024 to scrape user-generated content for AI training, in violation of Reddit’s terms of service. The platform also claims that Anthropic reportedly assured it had blocked its bots from accessing Reddit, but continued to do so anyway.”

Anthropic, which creates Claude–considered to be one of the top AI models for coding copilots–didn’t pay for the data it took from the Reddit website, Reddit claims. In comparison, Google and OpenAI have signed contracts with Reddit to gain access to user-generated data, with some restrictions to secure user privacy.

Another popular source for technical content is Stack Overflow, which is laser-focused on technical topics. Stack Overflow has about 29 million registered users and more than 100 million monthly users (most of whom are not registered). Its knowledge base, dubbed Stack Exchange, includes more than 24 million questions and about 36 million answers. If you have a specific question about how Kubernetes works–and really, who doesn’t these days?–then Stack Overflow is a great place to get an answer.

One day before the Reddit lawsuit was filed, Stack Overflow signed a deal with Snowflake to enable make its user-generated data available to users via the Snowflake Marketplace. Prashanth Chandrasekar, Stack Overflow’s CEO, said the move makes make it easier for Snowflake users to get access to high quality question-and-answer pairs curated by humans.

Prashanth Chandrasekar is the CEO of Stack Overflow

“You’re getting immediate access to all the data,” Chandrasekar told BigDATAwire at the Snowflake Summit. “It’s pre-indexed and the latency of that is super low. And most important, it’s licensed.”

The Snowflake agreement primarily is to use Stack Overflow’s knowledge base for retrieval augmented generation (RAG), as opposed to training AI models, Chandrasekar said, adding that Stack Overflow has different mechanisms for pure AI training. But the end goal is the same: helping customers build AI systems based on trusted, curated data.

“I think removing the friction to the user to realize the dream of AI systems in a company–I think that is the name of the game,” Chandrasekar said. “Now users, while they’re using Snowflake, can get access to our data versus having to wait for that company to strike something with us.”

Reddit and Stack Overflow are opposites in many ways, with the former being a bit of a wild, anything-goes place, and the latter known more for restraint and ruthless adherence to facts. But their recent moves show they have one thing in common: unauthorized access to its content will not be tolerated.

The nature of the World Wide Web has changed since its egalitarian beginnings late in the 20^th century. Over the past 15 years, giant tech firms have hoovered up vast swaths of the Internet, first for targeted analytics and more recently to train AI models. Enclaves that have yet to be fully mined, like Reddit and Stack Overflow, are now working to ensure that any monetization is done according to their terms and conditions, which puts more control back in the hands of users.

Stack Overflow has taken steps to not only prevent its data from being scraped for AI purposes, but also to prevent AI from infiltrating the knowledge base. For instance, it utilizes Cloudflare to authenticate that users are human. It also has a strict policy against allowing AI-generated answers on the site. Human curation is essential to Stack Overflow’s process.

(Dennis Diatel/Shutterstock)

Signing deals with companies like Snowflake could be a boon for Stack Overflow, which has seen its website traffic decline and the number of questions asked on Stack Exchange decrease in recent years. About three-quarters of Stack Overflow’s revenue is from hosting private knowledge bases for enterprises, while only one-quarter is from advertising on the public Stack Exchange site, Chandrasekar said.

“I think the nature of the Internet has changed in the past couple of years, the social contract of people building websites, monetizing off ads based on traffic on the website,” he said. “We want to have relationships with everyone and be exposed in a way that we will go wherever the developer is, wherever the user is, wherever they want to be.”

The message to AI model builders and users is clear: If high quality, human-sourced data is important to your endeavor, then you should be willing to pay the provider a fair sum, while simultaneously ensuring user privacy is maintained at all times. After all, it’s only money.

Related Items:

Rethinking ‘Open’ for AI

Self-Regulation Is the Standard in AI, for Now

Regs Needed for High-Risk AI, ACM Says–‘It’s the Wild West’

Access to AI Training Data Sparks Legal Questions

Rehan

Leave a Reply Cancel reply

Rehan

Leave a Reply Cancel reply

You May Like

Pipeline Flow Monitoring | Databricks Blog

Agora Launches Conversational AI Engine

Building RAG Apps Without the Bloat: Meet Shraga