What Is Web Scraping? How to Collect Data From Sites
Web scrapers automatically collect data that’s usually only accessible by visiting a website in a browser. By doing this autonomously, web scrapers open up a world of possibilities in data mining, data analysis, statistical analysis, etc.
To understand web scraping, we need to understand how the WWW works first. To get to this website, you either typed “makeuseof.com” into your web browser or you clicked a link from another web page (tell us where, seriously we want to know). Either way, the next couple of steps are the same.
First, your browser will take the URL you entered or clicked on (Pro-tip: hover over the link to see the URL at the bottom of your browser before clicking it to avoid getting punk’d) and form a “request” to send to a server. The server will then process the request and send a response back.
Modern browsers allow us some details regarding this process. In Google Chrome on Windows you can press Ctrl + Shift + I or right click and select Inspect. A tabbed list of options lines the top of the window. Of interest right now is the Network tab. Underneath the status code is the remote address, which is the public facing IP address of the makeuseof.com server. The client gets this address via the DNS protocol.
Scaling Up Scraping
One way to explore web scraping is to use tools already built. Web Scraper (great name!) has 200,000 users and is simple to use. Also, Parse Hub allows users to export scraped data into Excel and Google Sheets. To prevent web scraping you need an additional web scraping protection software.
Additionally, Web Scraper provides a Chrome plug-in that helps visualize how a website is built. Best of all, judging by name, is OctoParse, a powerful scraper with an intuitive interface.
Finally, now that you know the background of web scraping, raising your own little web scraper to be able to crawl and run on its own is a fun endeavor.