There are different frameworks and libraries that you would have to learn and make use of while understanding the basics of web scraping. With good knowledge of various HTTP methods like GET and POST and utilizing selenium web scraping, your data extraction process would become easier.
Selenium is a widely known tool for automated web browsing interactions. Combining it with other technologies like **BeautyifulSoup **would give you even better results when you perform web scraping. Selenium works by automating the processes of your written script so there is no need for human intervention like clicking, scrolling, etc. to facilitate the interaction between the script and the browser.
Interesting Read : Using Web Scraping for Lead Generation
Even though selenium is described as the perfect tool for testing web applications, its functions go beyond that.
And so in this guide, we would be dealing with selenium web scraping using python 3.x. as the input language.
Post Quick Links
Jump straight to the section of the post you want to read:
Set Up Selenium
You would need to download the selenium package first and to do so, execute this pip command in your terminal:
pip install selenium
After this, you would need to install selenium drivers too. This will allow python to control and interact with the web browser on the level of the operating system. If you are doing a manual installation, it would be available via the PATH variable. Selenium drivers for Firefox, Edge, and Mozilla can be downloaded here.
Starting Selenium
Let us begin by staring up your web browser:
- Open a new browser window
- Load any page of your choice. In this instance, ours would be used
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://limeproxies.com/')
Doing this will launch a headful mode. If you want to switch your browser into a headless mode and run it on a server, it should first look like this:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.firefox(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.oxylabs.io/")
print(driver.page_source)
driver.quit()
Selenium vs Real-Time Crawler
If you want to learn web scraping, a great option would be to use selenium. It's best to use it together with BeautifulSoup, learning HTTP protocols, the processes involved in data exchange between server and browser, and also how cookies and headers work. If you are looking for an easier way to perform web scraping, you have a variety of tools to help with this. Depending on the amount of data you wish to collect, and the targets, using a web scraping tool would not only save you time but will also save you resources.
Real-Time Crawler is a tool that can be used for an easier web scraping process. Its two main functionalities are:
- HTML Crawler API: this functionality allows you to scrape most websites in HTML
- Data API: this is mainly for re-commerce and search engine websites, and it allows you to receive the data in structured JSON format
You can easily integrate real-time crawler, and here is the process for python:
import requests
from pprint import pprint
# Structure payload.
payload = {
'source': 'universal',
'url': 'https://stackoverflow.com/questions/tagged/python',
'user_agent_type': 'desktop',
}
# Get response.
response = requests.request(
'POST',
'https://realtime.oxylabs.io/v1/queries',
auth=('user', 'pass1'),
json=payload,
)
# This will return the JSON response with results.
pprint(response.json())
With real-time crawler and selenium, there are a lot of advantages including:
- Easy scraping
- Every successful request has a guaranteed 100% success rate
- No need for extra coding
- Automated web scraping processes
- There is a built-in tool for proxy rotation
Selenium Web Scraping by Locating Elements
Find_element
There are different functions that you can use to find elements using selenium on a page:
- Find_element_by_id
- Find_element_by_name
- Find_element_by_xpath
- Find_element_by_link_text (that is using text value)
- Find_element_by_partial_link_text (that is by matching some part of a hyperlink text)
- Find_element_by_tag_name
- Find_element_by_class_name
- Find_element_by_css_selector (that is using a CSS selector for id class)
For example, lets locate the H1 tag on limeproxies homepage using selenium
<html>
<head>
... something
</head>
<body>
<h1 class="someclass" id="greatID"> Partner Up With Proxy Experts</h1>
</body>
</html>
h1 = driver.find_element_by_name('h1')
h1 = driver.find_element_by_class_name('someclass')
h1 = driver.find_element_by_xpath('//h1')
h1 = driver.find_element_by_id('greatID')
You can also use the find_elements function to return to a list of elements:
all_links = driver.find_elements_by_tag_name('a')
Doing this will provide you with all anchors in a page. Some elements are however not easy to access using an ID or class, so you would need XPath.
WebElement
In selenium, WebElement represents an HTML element. The following are some of the most common actions:
- Element.text (to access text element)
- Element.click() (click on element)
- Element.get_attribute(‘class’ (to access attribute)
- Element.send_keys(‘mypassword”) (send text to an input)
XPath
XPath is a syntax language and can help you find an object in DOM. It finds the node from the root element using either a relative path or an absolute path. Example:
- / : select node from the root. /html/body/div(1) will find the first div
- //: select node from current node irrespective of their location. //form(1) will find the initial form element
- (attributename=’ value’): a predicate. It finds a specific node or a node with a specific value
//input[@name='email'] will find the first input element with the name "email".
<html>
<body>
<div class = "content-login">
<form id="loginForm">
<div>
<input type="text" name="email" value="Email Address:">
<input type="password" name="password"value="Password:">
</div>
<button type="submit">Submit</button>
</form>
</div>
</body>
</html>
Render Solutions for Slow Websites
Some websites have a lot of JavaScript in their encoding to render content. It can be a bit tricky to deal with this as they also use a lot of AJAX calls and this issue can be solved in any of the following ways:
- Time.sleep(ARBITRRY_TIME)
- WebDriverWait()
Example
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "mySuperId"))
)
finally:
driver.quit()
This way, the located element can be loaded after 10 seconds.
About the author
Rachael Chapman
A Complete Gamer and a Tech Geek. Brings out all her thoughts and Love in Writing Techie Blogs.
Related Articles
How web scraping can benefit the healthcare industry?
The healthcare industry can benefit from the solution of web scraping. Know the factors involved in web scraping for healthcare
How To Fix High Pings in Window 8 and Windows 10?
According to Digital Trends, Windows 8 and Windows 10 are two of the most intricately coded operating systems. They have a number of features that require a lot of system resources counteracted by the latency iteration of Windows 10.