Web Scraping is the method used for extracting or harvesting data from websites. The extracted data is then processed and stored as a more structed form.

Everyone of us might have copied some information from websites for our requirements. Web scraping also extracts information from the websites, but in huge volume by means of any script of special software’s.

When we do a web search, search engines present to us tonnes of information from millions of websites. How do they do that? It is done by crawling through the websites and fetching the information to build their database.

This article is about how to scrape information from a demo web page (a simple one) using python.

About legality, please do your research !!!

Note: This is article only for the purpose of knowledge sharing.

The demo web page

HTML code of website

<!DOCTYPE html>
<html lang="en-US">

Python code to scrape the information from website

import requests
import bs4
import csv

#class definition
#This class objects hold the data of each item
class Data_Class:
  col_data_1 = None
  col_data_2 = None
  col_data_3 = None

  def __init__(self, col_data_1, col_data_2, col_data_3):
    self.col_data_1 = col_data_1
    self.col_data_2 = col_data_2
    self.col_data_3 = col_data_3

  def __iter__(self):
    return iter([self.col_data_1, self.col_data_2, self.col_data_3])

#This function writes the list of objects to a csv file
def write_csv(data_list, file_path):
  with open(file_path, 'w', newline='') as file:
    writer = csv.writer(file)

#This function will fetch the website data and pass the deta to write_csv function
def fetch_webdata(filepath, weburl):
  data_list = list()
  row_count = 0
  #get the website data and parse using beautifulsoup
  response = requests.get(weburl)
  soup = bs4.BeautifulSoup(response.content, 'html.parser')
  #find all the table rows and loop through the items
  allrows = soup.find_all('tr')
  for row in allrows:
    #Fetching the heading
    if (row_count == 0):
      row_count += 1
      allheadings = row.find_all('th')
      head_count = 0
      head_name_1 = ""
      head_name_2 = ""
      head_name_3 = ""
      for heading in allheadings:
        if (head_count == 0):
          head_name_1 = heading.get_text()
        elif (head_count == 1):
          head_name_2 = heading.get_text()
          head_name_3 = heading.get_text()
        head_count = head_count + 1
      data_class_obj = Data_Class(head_name_1, head_name_2, head_name_3)
    #Fetching the rows
    #in each row, get the columm and loop through values
    allcols = row.find_all('td')
    col_count = 0
    col_data_1 = ""
    col_data_2 = 0
    col_data_3 = 0
    #loop through the values in the sequence as they appear in the html table
    for col in allcols:
      if (col_count == 0):
        col_data_1 = col.get_text()
      elif (col_count == 1):
        col_data_2 = col.get_text()
        col_data_3 = col.get_text()
      col_count = col_count + 1
    #assign the vales to object of the declared class "Data_Class"
    data_class_obj = Data_Class(col_data_1, col_data_2, col_data_3)
    #append the object to the list
  #calling the write csv function to write the data to a csv file
  write_csv(data_list, filepath) 

filepath = r'C:\Users\weapon-x\Python\File.csv'
weburl = 'https://tekcookie.com/samplepage.html'
fetch_webdata(filepath, weburl)

The above code will save the data as a csv file

When the website has multiple tables and complex contents, the code has to be modified to access the required table by the class id, table names, etc.

Hope this article is informative and thank you for reading.