New Go Project: Distributed Crawler October 4, 2016
My distributed crawler covers a lot of what Go offers. Network communication using Gob and channels, concurrent programming, goroutines. Crawl farm was fun to write, and I still break it out on occasion.
In work, we created a site for a client. While the development work was pretty straightforward (although I did do a few neat things), it was very, extremely content heavy. We're talking 5000 or more pages. While the development also was pretty straightforward, it was ongoing until the end, and doing that while content entry is going on proves to be a bit of a challenge. If the way something works changes, and content is not present, things break.
There are tools like that yelling amphibian (you know the one) to help crawl sites and report on errors, like 404s, 500s, etc. However, it takes a long time to run, costs money, doesn't do specifically what I need, and did I mention it takes forever?
One of the reasons it takes a long time to run is that it's probably doing more in there than it needs to do. I don't know, since I didn't write it. So, the benefit of being a software developer, besides getting to do awesome things all the time, is that I can write code to replace software that I don't know how it works, don't know what it's doing that makes it slow, and it is free* to replace (* takes my time but I learn stuff, so it pays for itself :)
So I wrote a crawler. A parallel crawler with go routines. Go and get the main site, parse out any URLs (image sources and anchor hrefs), add any that are unique to a map, crawl the next one. It was a fun exercise in getting familiar with Go and concurrent processing again.
But it takes 24 minutes! Unacceptable across 5000 pages. So my idea was to create a distributed crawler, using the same core code, modified for distributed workload across multiple machines. However many you want, just start up the client on a new computer and it will dial the server and begin getting URLs that haven't been processed yet. It's done. Pretty awesome thing. I'm not going to post the code here now, because it's on a different computer, and I haven't figured out a good project structure and put it in git yet.
But here's the general idea. You can just imagine the code :)
Server has a "job" config with the site that it wants to crawl. It starts up and immediately goes into listen mode, just waiting for clients. It does no crawling of its own (although the way the code is shared, that code is compiled into the server executable as well).
Client connects. To the first client that connects, the server sends the first URL. The client does the parsing, and sends back a status code as well as the original URL, and the list of URLs that it found.
The server records the status along with the URL, and adds any unique URLs found by the client. These URLs go into the "UnprocessedLinks" channel of type Link ( Url string, Referrer string). Referrer is kept so we know where the page was that lead to the 404 or whatever.
The server then continues to send any links on the UnprocessedLinks channel to any client that connects (The first link is sent to the first client, it is just added to the UnprocessedLinks channel when the server starts)
There were a few challenges that I overcame while writing this. First, learning Gobs. Next, ensuring that packets should be sent and received first, are indeed, sent and received first. The other thing is determining, in Go, when a client disconnects or the server shuts down. That was really all, the rest of the logic was written when I wrote the parallel crawler, so that was all figured out.
There's another thing I have to figure out. At any one time, a client can have a few hundred URLs in its process queue, because the server just sends those across all clients when it gets them (uniquely giving 1 URL to 1 client and not duplicating any work). However, if a client disconnects and has loads of URLs left in its queue, those URLs are currently lost forever. So on the server, I have to keep track of which URLs were sent, and which ones were received back as processed. Then if it disconnects, I can pump the remaining URLs that were sent and not processed back into the UnprocessedLinks channel, for future processing by future or current clients.
That's it. I haven't ran it yet across multiple computers, but it does a pretty good job of doing what it's supposed to do. I can't wait to test it soon :)