Simple web scraping with Python Beautifulsoup

Mathi Maheswaran
2 min readMar 22, 2021

--

On the internet, we have a massive source of data. Whereas, those data have not to structure to analysis further. For example, if you want to analyze the weather information for one year, you have to collect one-year data and do the analysis. It will take more manual effort to do this.

To avoid manual processes, people are using web scraping methods to scraping the data from the web.

Python is a more powerful language for web scraping. Python has a lot of additional packages are available for web scraping. I will explain step-by-step instructions to extract the data from the website.

Necessary python libraries required for web scraping. If you are not installed the libraries, Please install them.

1.Requests

The requests library is used to make the request to the website and extract the HTML data.
pip install requests

2.Beautifulsoup4

The beautifulsoup4 library is used to navigating the HTML tree structure and extracting what you need from the raw HTML data.
pip install beautifulsoup4

3.lxml
BeautifulSoup is also relies on a parser, the default is lxml
pip install lxml

To begin, we need to import BeautifulSoup and request, and grab source data:

from bs4 import BeautifulSoup
import requests

To make the request to get the data.

webpage = "https://www.cricbuzz.com/cricket-series/3362/england-tour-of-india-2021/stats"
webpage = requests.get(webpage).text # url source
soup = BeautifulSoup(webpage, "lxml")

It will return raw HTML text content that is parsed and represented in the tree-based structure. We need to identify the HTML DOM element to get the data. You can use the Chrome developer tool to identify DOM.

Chrome Developer Tool View

From the above, The players’ names are available on the “a” tag with the class name of “cb-text-link”

playerNames = soup.findAll('a', attrs = {'class':'cb-text-link'})

All the player names are available on the “playerNames” array. We can display the players name using the for loop.

for player in playerNames:
print(player.text)

It will give you the below result.

Python Output Screen

Complete code.

from bs4 import BeautifulSoup
import requests

webpage = "https://www.cricbuzz.com/cricket-series/3362/england-tour-of-india-2021/stats"
webpage = requests.get(webpage).text # url source
soup = BeautifulSoup(webpage, "lxml")

playerNames = soup.findAll('a', attrs = {'class':'cb-text-link'})

for player in playerNames:
print(player.text)

Now we are able to scrap the data from the website.

Happy scrapping!!

Reference site: https://mcubedata.com/simple-web-scraping-with-python-beautifulsoup/

--

--

No responses yet