Build a simple web crawler with Scala and GridGain
Recently, as a proof-of-concept, I had to build a crawler. Of course I cannot share much details about that project other than to state that it’s an absolute privilege to be part of. :-)
I set out to build this crawler.
Prior experience had made aware of distributed computing technologies such as Hadoop and GridGain, so I knew there was my start. I immediately picked GridGain over Hadoop. Pretty obvious reasons: More examples, better support etc.
My next choice was a programming language. Java was the obvious choice but I took a risk and chose Scala. GridGain’s support for Scala and abundance of examples made this choice a bit easier. A quick, unofficial definition for those unaware: Scala is an Objective-Functional programming language that is very attractive to programmers and has proved itself in high-scalability situations (Twitter, LinkedIn, FourSquare etc.)
Note – I am new to Scala and my Scala code may look more Java like than Functional. I’m still learning and future examples should be better. “Awesomeness of Scala code” not a valid parameter to judge this blog post!
Professional etiquette (and NDAs + lawyers) will not allow me to share exact details of this crawler. After all, it is not my intellectual property. But for the sake of this example I will consider my target to be a simple web crawler that would be used by search engines to index the content on the internet.
What would our web crawler do?
- Start at some base URL
- Index content of this URL
- Search for more URLs to index
- Repeat 2 & 3 for these new URLs
This blog post will not get into the operational logic of loading a URL, extracting keywords, adding to index, extracting URLs etc. That I believe has been done to death. Alternatively I will look at how to scale up the crawling process using Scala and GridGain.
Those already familiar with GridGain, for the sake of this example I would request you to merge the concepts of a GridTask and a GridJob. Here we will create custom GridTasks which have one corresponding, unique custom GridJob.
Our GridTask-GridJob Pairs will be:
- LoadUrlDataTask, LoadUrlDataJob
- IndexKeywordsTask, IndexKeywordsJob
Much of the game is being played in LoadUrlDataJob. Its role is envisioned as follows:
- Make HTTP request to URL
- Gather response data from URL
- Trigger IndexKeywordsTask for URL data
- Fetch new URLs from URL data
- Trigger LoadUrlDataTask for new URLs
While the rest have simple roles:
- LoadUrlDataTask = Return one LoadUrlDataJob
- IndexKeywordsTask = Return one IndexKeywordsJob
- IndexKeywordsJob = Parse data and index keywords
In other words, an IndexKeywords job would index keywords and die. In contrast, a LoadUrlData job would trigger exactly one IndexKeyword job and trigger potentially multiple LoadUrlData jobs.
Let’s look at the sources:
A quick look at the role of LoadUrlDataJob and we know that this needs to scale and scale big. Here is a visualization showing three levels of LoadUrlData wherein each LoadUrlDataJob spawns three other LoadUrlDataJobs and one IndexKeywords Job.
GridGain takes care of this seamlessly and divides the tasks among available nodes without any configuration or instruction. Here are screenshots showing three nodes of GridGain, one inside my IDE while other two on the console.
Is this a perfect web crawler? No. Far from it. For one, you need to control its spawn-rate else your machine will die. :-)
But it is an example that does showcase the power of GridGain and the ease with which Scala / Scalar can leverage it.