Premium

500 billion pages and counting: How Google rules the web

Daisuke Wakabayashi

New York Times·

17 Dec, 2020 04:00 PM8 mins to read

Subscribe to listen

Access to Herald Premium articles require a Premium subscription. Subscribe now to listen.

Already a subscriber?

Listening to articles is free for open-access content—explore other articles or learn more about text-to-speech.

‌

Save

Share this article

Reminder, this is a Premium article and requires a subscription to read.

The Google headquarters in Mountain View, California. Photo / Laura Morton, The New York Times

In 2000, just two years after it was founded, Google reached a milestone that would lay the foundation for its dominance over the next 20 years: it became the world's largest search engine, with an index of more than 1 billion web pages.

The rest of the internet never caught

Now, as regulators around the world examine ways to curb Google's power, including a search monopoly case expected from US state attorneys-general and the antitrust lawsuit the Justice Department filed in October, they are wrestling with a company whose sheer size has allowed it to squash competitors. And those competitors are pointing investigators toward that enormous index, the gravitational center of the company.

"If people are on a search engine with a smaller index, they're not always going to get the results they want. And then they go to Google and stay at Google," said Matt Wells, who started Gigablast, a search engine with an index of around 5 billion web pages, about 20 years ago. "A little guy like me can't compete."

Understanding how Google's search works is a key to figuring out why so many companies find it nearly impossible to compete and, in fact, go out of their way to cater to its needs.

Advertise with NZME.

Every search request provides Google with more data to make its search algorithm smarter. Google has performed so many more searches than any other search engine that it has established a huge advantage over rivals in understanding what consumers are looking for. That lead only continues to widen, since Google has a market share of about 90 per cent.

Google directs billions of users to locations across the internet, and websites, hungry for that traffic, create a different set of rules for the company. Websites often provide greater and more frequent access to Google's so-called web crawlers — computers that automatically scour the internet and scan web pages — allowing the company to offer a more extensive and up-to-date index of what is available on the internet.

When he was working at the music site Bandcamp, Zack Maril, a software engineer, became concerned about how Google's dominance had made it so essential to websites.

Advertise with NZME.

In 2018, when Google said its crawler, Googlebot, was having trouble with one of Bandcamp's pages, Maril made fixing the problem a priority because Google was critical to the site's traffic. When other crawlers encountered problems, Bandcamp would usually block them.

Maril continued to research the different ways that websites opened doors for Google and closed them for others. Last year, he sent a 20-page report, "Understanding Google," to a House antitrust subcommittee and then met with investigators to explain why other companies could not recreate Google's index.

Software engineer Zack Maril has explained to investigators how Google's index gives it so much power. Photo / Jared Soares, The New York Times

"It's largely an unchecked source of power for its monopoly," said Maril, 29, who works at another technology company that does not compete directly with Google. He asked that The New York Times not identify his employer since he was not speaking for it.

A report this year by the House subcommittee cited Maril's research on Google's efforts to create a real-time map of the internet and how this had "locked in its dominance". While the Justice Department is looking to unwind Google's business deals that put its search engine front and center on billions of smartphones and computers, Maril is urging the Government to intervene and regulate Google's index. A Google spokesperson declined to comment.

Websites and search engines are symbiotic. Websites rely on search engines for traffic, while search engines need access to crawl the sites to provide relevant results for users. But each crawler puts a strain on a website's resources in server and bandwidth costs, and some aggressive crawlers resemble security risks that can take down a site.

Since having their pages crawled costs money, websites have an incentive to let it be done only by search engines that direct enough traffic to them. In the current world of search, that leaves Google and — in some cases — Microsoft's Bing.

Google and Microsoft are the only search engines that spend hundreds of millions of dollars annually to maintain a real-time map of the English-language internet. That's in addition to the billions they have spent over the years to build out their indexes, according to a report this summer from Britain's Competition and Markets Authority.

Google holds a significant leg up on Microsoft in more than market share. British competition authorities said Google's index included about 500 billion to 600 billion web pages, compared with 100 billion to 200 billion for Microsoft.

Advertise with NZME.

Other large tech companies deploy crawlers for other purposes. Facebook has a crawler for links that appear on its site or services. Amazon says its crawler helps improve its voice-based assistant, Alexa. Apple has its own crawler, Applebot, which has fuelled speculation that it might be looking to build its own search engine.

But indexing has always been a challenge for companies without deep pockets.The privacy-minded search engine DuckDuckGo decided to stop crawling the entire web more than a decade ago and now syndicates results from Microsoft. It still crawls sites like Wikipedia to provide results for answer boxes that appear in its results, but maintaining its own index does not usually make financial sense for the company.

"It costs more money than we can afford," said Gabriel Weinberg, chief executive of DuckDuckGo. In a written statement for the House antitrust subcommittee last year, the company said that "an aspiring search engine startup today (and in the foreseeable future) cannot avoid the need" to turn to Microsoft or Google for its search results.

When FindX started to develop an alternative to Google in 2015, the Danish company set out to create its own index and offered a build-your-own algorithm to provide individualised results.

FindX quickly ran into problems. Large website operators, such as Yelp and LinkedIn, did not allow the fledgling search engine to crawl their sites. Because of a bug in its code, FindX's computers that crawled the internet were flagged as a security risk and blocked by a group of the internet's largest infrastructure providers. What pages they did collect were frequently spam or malicious web pages.

"If you have to do the indexing, that's the hardest thing to do," said Brian Schildt Laursen, one of the founders of FindX, which shut down in 2018.

Schildt Laursen launched a new search engine last year, Givero, which offered users the option to donate a portion of the company's revenue to charitable causes. When he started Givero, he syndicated search results from Microsoft.

Most large websites are judicious about who can crawl their pages. In general, Google and Microsoft get more access because they have more users, while smaller search engines have to ask for permission.

"You need the traffic to convince the websites to allow you to copy and crawl, but you also need the content to grow your index and pull up your traffic," said Marc Al-Hames, a co-chief executive of Cliqz, a German search engine that closed this year after seven years of operation. "It's a chicken-and-egg problem."

In Europe, a group called the Open Search Foundation has proposed a plan to create a common internet index that can underpin many European search engines. It is essential to have a diversity of options for search results, said Stefan Voigt, the group's chairman and founder, because it is not good for only a handful of companies to determine what links people are shown and not shown.

"We just can't leave this to one or two companies," Voigt said.

When Maril started researching how sites treated Google's crawler, he downloaded 17 million so-called robots.txt files — essentially rules of the road posted by nearly every website laying out where crawlers can go — and found many examples where Google had greater access than competitors.

ScienceDirect, a site for peer-reviewed papers, permits only Google's crawler to have access to links containing PDF documents. Only Google's computers get access to listings on PBS Kids. On Alibaba.com, the US site of the Chinese e-commerce giant Alibaba, only Google's crawler is given access to pages that list products.

This year, Maril started an organisation, the Knuckleheads' Club ("because only a knucklehead would take on Google"), and a website to raise awareness about Google's web-crawling monopoly.

"Google has all this power in society," Maril said. "But I think there should be democratic — small d — control of that power."

Written by: Daisuke Wakabayashi

Photographs by: Jared Soares and Laura Morton

Save