
It’s happened to all of us while surfing the Internet: somewhere along the way you click on a broken link and get a message saying that the web page you’re looking for doesn’t exist.
It’s both common and frustrating, and it seems like the problem will only get worse as the Internet continues to expand and old web pages are migrated or abandoned.
There’s even a name for the problem. It’s called link rot, a term that dates back to the 1990s, when the Internet came to prominence.
Earlier this month, the Pew Research Center released a report that dug deep into the issue, showing that a third of web pages that existed in 2013 are no longer accessible.

Here are some other insights Pew discovered:
- “23% of news web pages contain at least one broken link, as do 21% of government site web pages.”
- “54% of Wikipedia pages contain at least one link in the ‘References’ section that points to a page that no longer exists.”
- “Nearly one in five tweets are no longer publicly visible on the site just months after they were posted.”
Joseph Reagle, associate professor of communication studies at Northeastern University, says the problem starts with the infrastructure of URL technologies, which stand for Uniform Resource Locator.
URLs serve as address points for web pages on the Internet, similar to addresses for physical places like your home or work. URLs are great because they make it easy for people to find websites, but the problem is they’re easy to crack, he says.
In the 1990s, Reagle worked with Tim Berners-Lee, who is largely credited with inventing the World Wide Web, at the World Wide Web Consortium as a policy analyst. The problems surrounding URLs were discussed extensively.
“For example, we knew that URLs are not very well maintained. If you are an organization or a company and you decide to reorganize or decide to change platforms, all URLs generally stop working.”
In the early days of the Web, Internet technologists explored the idea of using alternatives to the URL system. One proposal was to instead use URN-based technologies, which stand for Uniform Resource Name, which would work similarly to the ISBN system used to catalog books, Reagle says.
But the problem is that a larger organization would be responsible for managing it. The ISBN system is managed by the International ISBN Agency, an entity appointed by the International Organization for Standardization.
“So you have two problems,” says Reagle. “Either you let everyone create their own URLs and manage their resources, and they often get very bad at that over time, or you create centralized repositories with persistent identities, but setting them up is expensive and difficult to maintain.”
So the URL system has become the primary way people interact with the Internet, he notes, and problems surrounding link rot remain.
“People bring up the subject every now and then. It gets a little attention, and then the world moves on,” says Reagle. “Attempts have been made to find solutions, but the problems remain.”
Archival organizations have come out of the woodwork to help solve these problems. A few notable projects include the Wayback Machine, archive.today, and perma.cc, which allow people to access old versions of web pages that are no longer active and archive new web pages themselves.
But these services largely exist precariously and in the shadows, Reagle notes, largely by small groups of people with a deep interest in online conservation.
These efforts also require individual users to help build out their databases, which can be seen as a tall order and insufficient to adequately archive large parts of the Internet.
“They’re all a little different, and they’re not all perfect,” he says. “Perma.cc and other programs like it require people to proactively say, ‘Hey, make a copy of this page.’ and not everyone will do that. There are large parts of the internet that are not on Perma.cc.”
These issues go beyond infrastructure and human collaboration challenges. There are also issues surrounding copyright issues and what legal protections individuals have in maintaining the Internet, Reagle adds.
That’s where the federal government could help play a role.
“I can imagine [Congress] a law has been passed that, for example, provides safe harbor provisions for people archiving content for educational or research purposes,” he says.