I am experimenting and attempting to make a minimal web crawler. I understand the whole process at a very high level. So getting into the next layer of details, how does a program ‘connect’ to different websites to extract the HTML?
Am I using Sockets to connect to servers and sending http requests? Am I giving commands to the terminal to run telnet or ssh?
Also, is C++ a good language of choice for a web crawler?
Thanks!
Short answer, no. I prefer coding in C++ but this instance calls for a Java application. The API has many html parsers plus built in socket protocols. This project will be a pain in C++. I coded one once in java and it was somewhat blissful.
BTW, there are many web crawlers out there but I assume you have custom needs 🙂