Published on

The Semantic Web, Linked Data, and Knowledge Graphs

23 years ago Tim Berners-Lee, James Hendler, and Ora Lassila published a seminal article introducing the concept of the Semantic Web. To set the context, in 2001 the World Wide Web was still very much in its infancy, at a stage we now refer to as “Web 1.0”. The web was primarily static - the content of websites was fixed and did not change frequently, and there was limited user interaction. More importantly, the web was mainly designed for users to be the consumers of the content. Information on a website could be easily understandable for humans, but to machines they were just seen as chunks of text.

The goal of the Semantic Web was to make the information on the web machine-readable so that software agents could understand the relationships between different pieces of data and browse the web as a vast distributed data structure 1. By doing so, instead of the web being a collection of unstructured documents it would become a massive knowledge base that machines could leverage to conduct automated reasoning, improve search engine information retrieval, and aid knowledge discovery.

These promises all sound appealing but what did people do to start making this into a reality? The general plan was to add metadata with well-defined semantics to the data, and have query languages that can then manipulate and retrieve the data. In 2004 the Resource Description Framework (RDF) was published as a data model for metadata and as a standard for data exchange on the web. Its core purpose was to enable different systems to understand and use data seamlessly. RDF represents data as “triples”, which is a statement with three components:

[ subject ] --- predicate ---> [ object ]

The subject and object are nodes, and are connected by the predicate (the relation). In RDF every component can be expressed using the URI (Uniform Resource Identifier) format. A set of triple statements form a directed graph. Take the following figure2 as an example:

rdf

We see that “Helsinki” (subject) has the relation “is birthPlace of” (predicate) of “Tarja Halonen” (object). Note that a node can be both a subject and an object, depending on the direction of the relation. In addition to RDF, the Web Ontology Language (OWL) was developed. OWL is based on description logics, designed for creating and defining ontologies ~ a formal representation of knowledge within a domain. We see in the figure above we are using an ontology from DBpedia. By representing data using RDF and defining the structure and semantics of the data using OWL, we can perform more sophisticated querying and automated reasoning on the data.

From 2006 onwards the concept of “linked (open) data” grew in popularity3. This data consisted of multiple RDF graphs that were connected together through the sharing of the same URI identifiers. During this time several practical use cases of linked RDF graphs started to emerge:

  • One of the largest linked data projects was DBpedia, a community-driven project that aimed to extract and make accessible data from Wikipedia, by transforming information into RDF format. Given how it was one of the first linked datasets available, it served as the hub for many other datasets which linked to it.

  • Another project that is prominently used today is schema.org, a collaborative initiative launched by major search engines that encouraged website providers to annotate entities on their websites using a common ontology. The search providers in turn would promise improved search results by using the annotations as metadata.

Despite these applications, the linked data movement arguably never took off. I think a reason for this was because of the reluctance of the industry to fully adopt and invest in implementing linked data solutions. Not only did this technology require a significant amount of upfront investment in time/money/resources, but it also had unclear value propositions - it was challenging to map the theoretical benefits into tangible outcomes.

While the industry steered away from linked data, they didn’t completely abandon the concept and started developing knowledge graphs. Unlike linked data, which were accessible for anyone to use and append data, knowledge graphs were mainly proprietary, and used within individual companies. This in turn resulted in tighter controlled schemas and higher quality of content. We see KGs being used in different use cases, for example in recommender or fraud detection systems. Knowledge graphs do not need to follow the semantic web standards, they could be either built using RDF or Labelled Property Graphs (LPG) - I’ll save the technical discussion of differences for another time. A prominent example of a knowledge graph is Google’s Knowledge Graph, which serves the relevant information you can find in the infobox next to the search results.

While the semantic web vision did not become the new paradigm for the web, it nevertheless pushed the development of improved knowledge representation and data integration. What's more, given the rapid rise of LLMs in the past months and years, there’s been a surprising renewal of interest in knowledge graphs. Current LLMs have several flaws - by themselves they can’t acquire knowledge outside of what they were trained on, and some people argue the implicit knowledge LLMs have gained by training on large amounts of data are merely statistical patterns rather than true understanding and reasoning4. In response, there have been efforts to use knowledge graphs as external sources of knowledge for the LLM to stay up to date, but more importantly to ground and verify LLM outputs (we already see this in current RAG applications). There is also an inverse relationship between the two; LLMs can be used to help construct knowledge graphs - we already see them being used to extract structured information from unstructured data. The synergy between the two is still in its early stages, who knows what they’ll bring about in the future?

Footnotes

  1. https://www.lassila.org/publications/2001/SciAm.pdf

  2. Figure from this https://theses.hal.science/tel-02524361 thesis

  3. https://cacm.acm.org/research/a-review-of-the-semantic-web-field

  4. https://arxiv.org/abs/2308.06374