Crawling a website with python scrapy

Think of a website as a web, how do we crawl that web? Chances are you went to that navigation menu and found a link that you found interesting you clicked on that and you went to that page to find that important information that you were looking for. Or probably your favourite search engine did it for you. How did your search engine did that or how can you make that traversal automatically? Exactly thats where crawler comes into business. Chances are your search engine started crawling on your website based on a link you shared somewhere. We will create one such crawler using python’s crawling framework called scrapy. For last couple of months I have been using it, so felt like it would be wrong not to have a blog about it.

It is always better to have a python virtual environment, so lets set it up:

$ virtualenv .env
$ source .env/bin/activate

Now that we have a virtual environment running, we will install scrapy.
$ pip install scrapy

it has some dependency, like lxml which is basically being used for html parsing using selectors, cryptography and ssl related python libraries will also be installed. Pip takes care of everything, but when we will be writing codes, we will see this in our error message very often, so it is always good idea to have some idea about the dependencies.

Now that we have it installed, we have access to few new commands. Using these commands we can create our own scrapy project, which is not strictly necessary but still I personally like to have everything bootstrapped here the way the creator wanted me to have, that way I could possibly have the same code standard the author of scrapy had while writing this framework.

$ scrapy startproject blog_search_engine

It will create bunch of necessary and unnecessary files you can read about all of those files at documentation, but the interesting part here is that it will create a configuration file called scrapy.cfg , which empowers you with few extra commands. Your spider resides inside the other folder. Spiders are basically the BOT that contains the characteristics defination of that BOT. Usually you can create a spider using following command as a solid start:

$ scrapy genspider wordpress wordpress.com

It will generate a spider called wordpress inside your blog_search_engine/blog_search_engine/spiders/ directory. It creates a 4 or 5 lines of code at your file which does nothing. Lets give it some functionality, shall we? But we don’t know yet what we are automating. We are visiting wordpress.com and we will find the a links of an article, and then we will go that link and get that article. So before we write our spider we need to define what we are looking for right? Lets define our model. Model are usually stored inside items.py . A possible Article might have following fields.

class Article(scrapy.Item):
    title = scrapy.Field()
    body = scrapy.Field()
    link = scrapy.Field()

Now we will define our spider.

class WordPressSpider(scrapy.Spider):
    name = 'wordpress'
    start_urls = [ 'www.wordpress.com' ]

    def parse(self, response):
        article_links = response.css("#post-river").xpath("//a/@href").extract()

        for link in article_links:
            if "https://en.blog.wordpress.com/" in link:
                yield scrapy.Request(article_url,
                                     self.extract_article)

    def extract_article(self, response):
        article = Article()
        css = lambda s: response.css(s).extract()
        
        article['title'] = css(".post-title::text").extract()[0]

        body_html=" ".join(css('.entrytext'))
        body_soup = BeautifulSoup(body_html)
        body_text = ''.join(soup.findAll(text=True))


        article['body'] = body_text
        yield article

As we had configured at our scrapy settings yield at the parse hands over your article to pipeline, as it looks, pipeline could be a great place for database operations. This is possibly out of the scope of this particular blog, but yet you can have an outline of what you might need to do if you are using sqlalchemy as database, although sqlalchemy won’t be particularly helpful to deal with what we intend to do here, still i felt it would be helpful to have them.

class BlogSearchEnginePipeline(object):
    def process_item(self, item, spider):
        # a = Article(title=item['title'],body=item['body'])
        # db.session.add(instance)
        # db.session.commit()
        print 'article found:', item['title'], item['body']

        return item

Now we have a spider defined. But how do we run it? Its actually easy, but remember that you need to be inside your scrapy project to make this command work!

$ scrapy crawl wordpress

On the side note scrapy actually provide us options to pass parameters from commandline to pass argument to spider, we just need to define an intializer parameter

class WordPress
        name = "wordpress"
        ...
        def __init__(self, param=None):
                pass
        ...

Now we could call:

$ scrapy crawl wordpress -a param=helloworld

In this blog I tried to give you an outline of database orms. Sofar we have a spider but this spider has no great use so far, we will try to create a search engine with this spider at my next blog. Databases that sqlalchemy deals with are not particularly super good with text searches elastic search could be a great option if we are looking forward to implement a search option so at my next blog, I will be writing about a basic search engine implementation using elastic search. Thats in my todo list for this weekend.