July 15, 2025
Scaling the Knowledge Graph Behind Wikipedia

Scaling the Knowledge Graph Behind Wikipedia

Scaling the Knowledge Graph Behind Wikipedia

(Image courtesy Wikipedia)

As the fifth most popular website on the Internet, keeping Wikipedia running smoothly is no small feat. The free encyclopedia hosts more than 65 million articles in 340 different languages, and serves 1.5 billion unique device visits per month. Behind the site’s front-end Web servers are a host of databases serving up data, including a massive knowledge graph hosted by Wikipedia’s sister organization, Wikidata.

As an open encyclopedia, Wikipedia relies on teams of editors to keep it accurate and up to date. The organization, which was founded in 2001 by Jimmy Sales and Larry Sanger, has established processes to ensure that changes are checked and that the data is accurate. (Even with those processes, some people complain about the accuracy of Wikipedia information.)

If Wikipedia editors strive to maintain the accuracy of facts in Wikipedia articles, then the goal of the Wikidata knowledge graph is to document where those facts came from and to make those facts easy to share and consume outside of Wikipedia. That sharing includes allowing developers to access Wikipedia facts as machine-readable data that can be used in outside applications, says Lydia Pintscher, the portfolio lead for Wikidata.

“It’s this basic stock of information that a lot of developers need for their applications,” Pintscher says. “We want to make that available to Wikipedia, but also really to anyone else out there. There are a large number of applications that people build with that data that are not Wikipedia.”

For instance, data from Wikidata is piped directly into the digital travel assistant KDE Itinerary, which is developed by the free software community KDE (where Pintscher sits on the board). If a user is travelling to a certain country, KDE Itinerary can inform them what side of the road they drive on, or what type of electrical adapter they will need.

(Image courtesy Wikidata)

“You can also say ‘Give me an image of the current mayor of Berlin’ and you will be able to get that, or ‘Give me the Facebook profile of this famous person,’” Pintscher tells BigDATAwire. “You will be able to get that with a simple API call.”

It is certainly a noble goal to gather the facts of the world into one place and then make them available via API. However, actually building such a system requires more than good intentions. It also requires infrastructure and software that can scale to meet the sizable digital demand.

When Wikidata started in 2012, the organization selected a semantic graph database called Blazegraph to house the Wikipedia knowledgebase. Blazegraph stores data in sets of Resource Description Framework (RDF) statements called tuples, which roughly correspond to the subject-predicate-object relationship. Blazegraph allows users to query these RDF statements using the SPARQL query language.

The Wikidata database started out small, but it has grown in leaps and bounds over the years. The size of the database increased substantially in the late 2010s when the team imported large amounts of data related to articles in scientific journals. For the past six years or so, it has grown more modestly. Today, the database encompasses about 116 million items, which corresponds to about 16 billion triples.

That data growth is putting stress on the underlying data store. “It’s beyond what it was built for,” Pintscher says. “We’re stretching the limits there.”

Semantic knowledge graphs store data in RDF triples

Blazegraph is not a natively distributed database, but Wikidata’s dataset is so big, it has forced the team to manually shard its data so it can fit across multiple servers. The organization runs its own computing infrastructure with about 20 to 30 paid employees of the Wikimedia Foundation.

Recently, the Wikidata team split the knowledge graph into two, one for the data from the scientific journals and another holding everything else. That doubles the maintenance effort for the Wikidata team, and it also creates more work for developers who want to use data from both databases.

“What we’re struggling with is really the combination of the size of the data and the pace of change of that data,” Pintscher says. “So there are a lot of edits happening every day on Wikidata, and the amount of queries that people are sending, since it’s a public resource with people building applications on top of it.”

But the biggest issue facing Wididata is Blazegraph has reached its end of life (EOL). In 2017, Amazon launched its own graph database, called Neptune, atop the open source Blazegraph database, and a year later, it acquired the company behind it. The database has not been updated since then.

Pintscher and the Wikidata team are looking at alternatives to Blazegraph. The software must be open source and actively maintained. The organization would prefer to have a semantic graph database, and it has looked closely at Qlever and MilleniumDB, among others. It is also considering property graph databases, such as Neo4j.

“We haven’t made the final decision,” Pintscher says. “But so much of what Wikidata is about is related to RDF and being able to access it in SPARQL, so that is definitely a big factor.”

Lydia Pintscher is the Portfolio Lead for Wikidata

In the meantime, development work continues. The organization is looking at ways it can provide companies with access to Wikimedia content with certain service level guarantees. It’s also working on building a vector embedding of Wikidata data that can be used in retrieval-augmented generation (RAG) workflows for AI applications.

Building a free and open knowledge base that encompasses a sizable swath of human knowledge is a noble endeavor. Developers are building interesting and useful application with that data, and in some cases, such as the Organized Crime and Corruption Reporting Project, the data is going to help bring people to justice. That keeps Pintscher and her team motivated to continue pushing to find a new home for what might be the biggest repository of open data on the planet.

“As someone who spent the last 13 years of her life working on open data, I truly do believe in open data and what it enables, especially because opening up that data allows other people to do things with it that you have not thought of,” Pintscher says. “There’s a ton of stuff that people are using the data for. That’s always great to see, because the work our community is putting into that every single day is paying off.”

Related Items:

Groups Step Up to Rescue At-Risk Public Data

NSF-Funded Data Fabric Takes Flight

Prolific Puts People, Ethics at Center of Data Curation Platform

Leave a Reply

Your email address will not be published. Required fields are marked *