Parallelizing web scraping to improve performance and scalability
Web scraping is the process of extracting data from websites, usually for analysis or other purposes. Web scraping is increasingly important in e-commerce because it enables businesses to extract valuable data from competitors' websites, especially product prices. This data can be used to gain...
Saved in:
Main Author: | |
---|---|
Format: | Final Year Project / Dissertation / Thesis |
Published: |
2023
|
Subjects: | |
Online Access: | http://eprints.utar.edu.my/6039/1/fyp_CS_2023_NYC.pdf http://eprints.utar.edu.my/6039/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-utar-eprints.6039 |
---|---|
record_format |
eprints |
spelling |
my-utar-eprints.60392024-01-02T14:52:25Z Parallelizing web scraping to improve performance and scalability Na, Yi Chun T Technology (General) TD Environmental technology. Sanitary engineering Web scraping is the process of extracting data from websites, usually for analysis or other purposes. Web scraping is increasingly important in e-commerce because it enables businesses to extract valuable data from competitors' websites, especially product prices. This data can be used to gain insights into market trends, optimize pricing strategies, and improve product offerings. However, web scraping can be a challenging task due to some reasons such as the sheer volume of data available, network bandwidth, etc., when monitoring product prices across numerous e-commerce platforms. The main purpose of this project is to help whoever is suffering from the long waiting time to scrape desired information much faster than usual regardless of whether the data is small-scale or large-scale from the internet through the integration of web scraping technologies and distributed computer system. The proposed solution requires user interaction to be configured and initialized. It employs a message queuing algorithm to divide scraping tasks into smaller units and utilizes multiple worker nodes for concurrent web data extraction. In the scraping process, Selenium Web Driver interacts with specified web elements based on user-defined selectors or XPATH, allowing asynchronous HTTP requests and responses. Performance metrics such as response times, bandwidth data, and task status will be monitored for benchmarking and error handling. After finishing the scraping process, the scraped results are stored in a CSV file for further analysis. 2023-06 Final Year Project / Dissertation / Thesis NonPeerReviewed application/pdf http://eprints.utar.edu.my/6039/1/fyp_CS_2023_NYC.pdf Na, Yi Chun (2023) Parallelizing web scraping to improve performance and scalability. Final Year Project, UTAR. http://eprints.utar.edu.my/6039/ |
institution |
Universiti Tunku Abdul Rahman |
building |
UTAR Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Tunku Abdul Rahman |
content_source |
UTAR Institutional Repository |
url_provider |
http://eprints.utar.edu.my |
topic |
T Technology (General) TD Environmental technology. Sanitary engineering |
spellingShingle |
T Technology (General) TD Environmental technology. Sanitary engineering Na, Yi Chun Parallelizing web scraping to improve performance and scalability |
description |
Web scraping is the process of extracting data from websites, usually for analysis or other purposes. Web scraping is increasingly important in e-commerce because it enables businesses to extract valuable data from competitors' websites, especially product prices. This data can be used to gain insights into market trends, optimize pricing strategies, and improve product offerings. However, web scraping can be a challenging task due to some reasons such as the sheer volume of data available, network bandwidth, etc., when monitoring product prices across numerous e-commerce platforms. The main purpose of this project is to help whoever is suffering from the long waiting time to scrape desired information much faster than usual regardless of whether the data is small-scale or large-scale from the internet through the integration of web scraping technologies and distributed computer system. The proposed solution requires user interaction to be configured and initialized. It employs a message queuing algorithm to divide scraping tasks into smaller units and utilizes multiple worker nodes for concurrent web data extraction. In the scraping process, Selenium Web Driver interacts with specified web elements based on user-defined selectors or XPATH, allowing asynchronous HTTP requests and responses. Performance metrics such as response times, bandwidth data, and task status will be monitored for benchmarking and error handling. After finishing the scraping process, the scraped results are stored in a CSV file for further analysis. |
format |
Final Year Project / Dissertation / Thesis |
author |
Na, Yi Chun |
author_facet |
Na, Yi Chun |
author_sort |
Na, Yi Chun |
title |
Parallelizing web scraping to improve performance and scalability |
title_short |
Parallelizing web scraping to improve performance and scalability |
title_full |
Parallelizing web scraping to improve performance and scalability |
title_fullStr |
Parallelizing web scraping to improve performance and scalability |
title_full_unstemmed |
Parallelizing web scraping to improve performance and scalability |
title_sort |
parallelizing web scraping to improve performance and scalability |
publishDate |
2023 |
url |
http://eprints.utar.edu.my/6039/1/fyp_CS_2023_NYC.pdf http://eprints.utar.edu.my/6039/ |
_version_ |
1787140952229412864 |
score |
13.209306 |