How do search engines work
How do search engines work? While typing in some keywords and hitting “search” may seem simple from the user side of things, the search engines are fairly complex. They have experienced a considerable amount of development and change over the last 20 years, growing from simple plain text lists to in-depth content displays to comprehensive SEO-thwarting algorithms. They now impact global commerce and even how web users conceptualize information. In this article, we explore how search engines came to be, how they work, and where they’re going.
A Brief History of Search Engines
The development of the World Wide Web gave rise to a culture of information that was suddenly too vast for ordinary search methods. One of the first engines was developed in 1990. “Archie,” short for “archive,” was created by McGill University student Alan Emtage, and provided a list of filenames on the web that matched search terms inputted by the user. This relatively crude engine was quickly replaced by Veronica, which could search plain text files instead of just web files. In 1992, Tim Berners-Lee, the man widely credited with inventing the term “World Wide Web” created a virtual library on his CERN web server which allowed users to find other web servers.
The functionality to search for file names and individual links would eventually be combined by other engines to produce richer search results. In 1993, there was a virtual explosion of new engines: Excite, World Wide Web Wanderer, Aliweb, and Primitive Web Search burst onto the scene. Each was slightly more sophisticated than its predecessors. Both World Wide Web Wanderer and Primitive Web Search utilized “bots” or “spiders.” Although the two terms have been used interchangeably, bots generally try to categorize pages in terms of context, while spiders pick up existing data, such as addresses, links, and keywords. The bots in the World Wide Web Wanderer initially were designed to count active web servers, but eventually started recording URL addresses. The bots became so active that they started clogging up the load times of pages, and the search engine was eventually shut down.
In 1994, WebCrawler, Yahoo Search, Lycos, AltaVista, ENet Galaxy and Infoseek launched. These engines made it easier for webmasters to upload or change their URL’s and include a brief description along with it. Some of the engines, such as AltaVista, provided searchers with the ability to ask for “tips.”
Google, the engine that has now become synonymous with search, was invented in 1996, but did launch commercially until 1998. It conceptualized of information in an entirely new way. Instead of simply indexing pages or creating a list of “favorite” web sites, Google analyzed pages by the number of links to other websites, in addition to how frequently a particular web site was referenced by other sites. It monitored how users interacted with sites to determine their relevance, thereby displacing keywords as the primary source of search. This “Page Rank” methodology would not become popular until several years later, when the amount of information online became so voluminous that a more advanced way to compare information became necessary.
Google’s introduction of AdWords in 2000 would ultimately dramatically change how a search engine operated. By coupling commerce with search results, Google opened the door for the engines to become more than amusing index tools.
How Indexes Work
Because the Internet is so large, the engines compile an “index.” When you enter a term and click “search,” the engine searches this index and provides you with search results.
To stay relevant, the engines must continually update their index. They do this by using bots or spiders. Bots and spiders check for context, keywords, and specific links on websites. Depending on the size of the search engine, the spiders “crawl” the entire known Internet every few days or weeks and then update their index. In the past, webmasters had to submit their URL’s to the engines in order to gain notice. Now, if a new website is linked to other websites, the spiders will detect these links and find the URL on their own.
The spiders have been recording more and more information as the engines have grown in complexity. Spiders not only note how many times a word appears on a page, but also attempt to assign values or “weight” to those word appearances. The value algorithms that the engines use to determine the importance of a particular keyword on a site are proprietary to prevent webmasters from artificially stacking their websites to gain better placement in search results. Therefore, different engines rate websites differently in terms of their relevance to a given search term.
Paid vs. Organic Results
Google’s AdWords is an example of a “paid” search result. When a user enters a particular keyword that an advertiser has paid for, the website URL and brief description of the site appears in a box at the top of the search results page.
The results that display beneath this paid box are referred to as “organic” results. These results have earned their place in the rankings because the search engine’s spiders and bots have determined that they are the most relevant to the user’s search terms.
Oddly enough, studies have shown that most people prefer clicking on the “organic” search results, and not the paid results. Because of this, the field of “SEO,” or search engine optimization, has become enormously popular among commercial websites. Companies pay SEO Internet marketers to design their websites so that their sites will “organically” appear in the search listings.
The Internet has become a major source of revenue for many companies. However, no search engine wants to compromise the integrity of its results and become simply a bullhorn for commercial interests. DemandMedia, a content farm that created thousands of articles that were comprised mainly of meaningless keywords in an attempt to artificially bolster its number of page visits, actually prompted Google to extensively reprogram its search bots so that these keyword stuffed articles would no longer clog up the search rankings.
Increased Complexity and Customization
In addition to classifying websites according to their perceived relevance or authority, certain engines now also tailor searches to the particular user executing the search. Many engines now access a user’s “search history” to determine precisely what kind of information the user is likely to want.
As an example, a search term such as “guns” may turn up different results for two different users based on their search history. A user who performs searches on hunting or outdoor sports is likely to see sites that either sell guns or describe how they can be used to hunt. A user who performs searches on the prevention of violence is likely to be shown sites that advocate gun control.
Many analysts predict that search will increasingly be defined by concepts, and not by individual keywords or links. The overwhelming amount of information uploaded onto the Internet shows no signs of slowing down. The ICANN, (the Internet Corporation for Assigned Names and Numbers) which acts as the governing body of domain names, recently proposed the introduction of over 200 new URL suffixes.
One thing is certain: to remain relevant, the engines must constantly reinvent how they compile their index. Google is interested to keep their position as a trusted search tool i.e. users receive the most relevant answers to their search queries. The latest challenge to this “relevant search results” are the content farms and Google has countered with their algorithm change called Panda.