Web Scraping is the method used for extracting or harvesting data from websites. The extracted data is then processed and stored as a more structed form.
Everyone of us might have copied some information from websites for our requirements. Web scraping also extracts information from the websites, but in huge volume by means of any script of special software’s.
When we do a web search, search engines present to us tonnes of information from millions of websites. How do they do that? It is done by crawling through the websites and fetching the information to build their database.
This article is about how to scrape information from a demo web page (a simple one) using python.
About legality, please do your research !!!
Note: This is article only for the purpose of knowledge sharing.
The demo web page
HTML code of website
<!DOCTYPE html>
<html lang="en-US">
<head>
</head>
<body>
<table>
<tr>
<th>Name</th>
<th>SerialNo</th>
<th>Price</th>
</tr>
<tr>
<td>Butter</td>
<td>22315</td>
<td>12</td>
</tr>
<tr>
<td>Gum</td>
<td>11452</td>
<td>5</td>
</tr>
<tr>
<td>Milk</td>
<td>55462</td>
<td>23</td>
</tr>
<tr>
<td>Sugar</td>
<td>55411</td>
<td>18</td></tr></table>
</body>
</html>
Python code to scrape the information from website
import requests
import bs4
import csv
#class definition
#This class objects hold the data of each item
class Data_Class:
col_data_1 = None
col_data_2 = None
col_data_3 = None
def __init__(self, col_data_1, col_data_2, col_data_3):
self.col_data_1 = col_data_1
self.col_data_2 = col_data_2
self.col_data_3 = col_data_3
def __iter__(self):
return iter([self.col_data_1, self.col_data_2, self.col_data_3])
#Functions
#This function writes the list of objects to a csv file
def write_csv(data_list, file_path):
with open(file_path, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(data_list)
#This function will fetch the website data and pass the deta to write_csv function
def fetch_webdata(filepath, weburl):
data_list = list()
row_count = 0
#get the website data and parse using beautifulsoup
response = requests.get(weburl)
soup = bs4.BeautifulSoup(response.content, 'html.parser')
#find all the table rows and loop through the items
allrows = soup.find_all('tr')
for row in allrows:
#Fetching the heading
if (row_count == 0):
row_count += 1
allheadings = row.find_all('th')
head_count = 0
head_name_1 = ""
head_name_2 = ""
head_name_3 = ""
for heading in allheadings:
if (head_count == 0):
head_name_1 = heading.get_text()
elif (head_count == 1):
head_name_2 = heading.get_text()
else:
head_name_3 = heading.get_text()
head_count = head_count + 1
data_class_obj = Data_Class(head_name_1, head_name_2, head_name_3)
data_list.append(data_class_obj)
continue
#Fetching the rows
#in each row, get the columm and loop through values
allcols = row.find_all('td')
col_count = 0
col_data_1 = ""
col_data_2 = 0
col_data_3 = 0
#loop through the values in the sequence as they appear in the html table
for col in allcols:
if (col_count == 0):
col_data_1 = col.get_text()
elif (col_count == 1):
col_data_2 = col.get_text()
else:
col_data_3 = col.get_text()
col_count = col_count + 1
#assign the vales to object of the declared class "Data_Class"
data_class_obj = Data_Class(col_data_1, col_data_2, col_data_3)
#append the object to the list
data_list.append(data_class_obj)
#calling the write csv function to write the data to a csv file
write_csv(data_list, filepath)
filepath = r'C:\Users\weapon-x\Python\File.csv'
weburl = 'https://tekcookie.com/samplepage.html'
fetch_webdata(filepath, weburl)
The above code will save the data as a csv file
When the website has multiple tables and complex contents, the code has to be modified to access the required table by the class id, table names, etc.
Hope this article is informative and thank you for reading.
Recent Comments