Crawling the web with Python

You can use two libraries to examine webpages in Python.  The first, requests, is built into the language, and is apparently better than the other built in library urllib. The second, BeautifulSoup, for parsing the HTML can be installed with

pip install beautifulsoup4

Then, it’s really easy to implement

import requests
from bs4 import BeautifulSoup

def main:
    url="http://www.bbc.co.uk"

    # Get webpage
    r  = requests.get(url)

    # What status code was returned?
    r.status_code

    # Get text data (this is the source HTML)
    data = r.text

    # Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(data)

    # Then do some interesting stuff!
    # For example, get all the linked URLs in the page
    for link in soup.find_all('a'):
        print link.get('href')

if __name__ == '__main__':
    main()
Advertisements
This entry was posted in Technology and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s