How Does the World Wide Web Look Like?

Mon, Jan 04, 21

The landing page of this website has a little confusing line : Welcome ... to my own slice of the bowtie that is the World Wide Web. If I were you, the first thing that would come to my mind is “Bowtie? What is this guy talking about?”. Well hopefully by the end of this blogpost we will be on the same page and understand the cleverness of this line.

To begin with, we shall explore the idea that the Web, meaning web pages on the World Wide Web, can be thought of as a graph. It is evident, I hope, that the underlying infrastructure of what we know as the Internet can be easily represented as a graph, where nodes represent hosts (s.a. switches/bridges, servers, firewalls, etc.), and edges are represented by links between these hosts (s.a. ethernet, wireless, etc.). But how would one represent the Web? Well, the intuitive approach would be to have any page that is hosted on the Web as a node, and hyperlinks between pages as directed edges. In doing so, we can start to build a graph between Web pages. To demonstrate this idea, the image below shows what this website would look like as a graph.

Rough look at graph structure of this website.
Graph representation of website with random blogpost.

Now it is evident that the image above is just a rough representation, although the message is brought across quite clearly. Now, I personally know that my github page links to my website, and therefore, we would have an incoming edge from my github profile to my website. Following this outlined procedure, we would be able to construct a graph representation of the entire World Wide Web. Sadly, it would be way to complex for anyone to understand and be able to retain any useful information. So what does this have anything to do with a bowtie? To get to this part, we need to define - or review - a couple basic graph theory definitions.

Firstly, we will define what is a directed graph. To put it simply, a directed graph is a graph where the edges have direction. We can also describe this by saying, for any given node v, In(v) = {w | w can reach v} and Out(v) = {w | v can reach w}. This way, we can described to whom a certain node is connected to and note that whilst node v can reach w, w may not be able to reach v. The following image shows an example of a directed graph where Out(A)={B, C, D}, but In(A)={B, C}.

Example of a directed graph.
Directed graph example.

Evidently, the World Wide Web graph representation, as we previously described it, is a directed graph. In order to further simplify the mess that is our current representation, we need to define two types of directed graphs: Strongly Connected and Directed Acyclic Graph or DAG for short. A strongly connected directed graph can be easily described as a graph such that for all nodes v within the graph, In(v)=Out(v). Whereas, a directed acyclic graph is described as a graph that has no cycles. This means that for any node v, if v can reach node w, then node w cannot reach node v. The following two images show examples of a strongly connected directed graph and a directed acyclic graph.

Example of a strongly connected directed graph.
Example of a strongly connected directed graph.
Example of a directed acyclic graph.
Example of a directed acyclic graph.

Now that we have identified what a strongly connected directed graph is, we can point out that a directed graph may have certain nodes that are themselves strongly connected - meaning that there is some subset V* that is an element of V that fits the criteria of a strongly connected directed graph, but V* != V. This means that some area of our graph is strongly connected, but other areas are acyclic. We call such a phenomenon, where there exists some subset V* of V, where V* != V, that is strongly connected a Strongly Connected Component or SCC for short. Although, there is an important aspect of SCCs that must be held true, that there is no subset V** that is a superset of V* . In otherwords, V* must not be contained within some bigger strongly connected component. The image below shows SCCs within a directed graph. A simple way to split a graph into it’s SCCs is to take random node v that is not a part of any SCC, and compute the intersection of Out(v) and In(v); said intersection is a SCC.

Example of SCCs in a directed graph.
Example of SCCs in a directed graph.

Now we will move onto an important fact. We will not prove it, but you could do it on your free time if you so wish. Every directed graph is a DAG on it’s SCCs. This is a very powerful fact if you know how to use it. This fact helps us to simplify very complicated graph structures into simpler, more human readable graphs. When simplifying, we can represent a SCC as a single node, with any links to nodes that are in other SCCs as one link between the two respective SCCs.

Now, back to the original question of how is this at all related to a bowtie and the World Wide Web? Well, a study was done in 2000 by Broder et Al. of the graph structure of the World Wide Web. They found something extremely facinating. The image below shows their findings that utilized the fact that we’ve just stated. They discovered that the WWW can be described as a bowtie tentacle graph. They showed that there exists a giant SCC at the center of the WWW with a inbound SCC and and outbound SCC that connect in and out - respectively - of this main SCC. There also exist ‘Tendrils’ which may be though of as tangents - similarly to how someone can go on a tangent whilst talking - going in, or out, of the inbound/outbound SCCs. Lastly, they also found there exist bypasses from the inbound SCC to the outbound SCC that never touches the main SCC. Lastly, there also exist isolated SCC that are not related to any other SCC whatsoever, these can be seen at the bottom of the image.

Structure of the Web as per Broder et Al..
Structure of the Web as per Broder et Al..

So where would this website be located within this graph structure? Well, a thought experiment can be quickly completed to determine it’s location. I know that there is a direct link from my github profile to this website. Github does have the subdomain of WWW, and since it is so highly rooted in our modern internet, we can conclude that Github is within the main SCC. My website does not have a WWW subdomain (yet) but it does have outgoing links back to my github profile, therefore it cant be in the outgoing SCC as the outgoing SCC doesn’t lead back into the main SCC. Therefore, I can conclude that this website is found somewhere in the main SCC.

References

  1. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., … & Wiener, J. (2000). Graph structure in the web. Computer networks, 33(1-6), 309-320.