Logo
5 Tips When Using a Bot To Avoid Proxy Blocks

5 Tips When Using a Bot To Avoid Proxy Blocks

Reduce the Risk of Getting Your Proxies Blocked

If you are constantly involved in web scraping, then you will know that there are two things you must consider as you go about the process. There are legal considerations and IP blocks. Even though it isn’t outrightly illegal to extract data from a website without the consent of the website owners, the act is frowned upon hence the reason for constant IP blocks.

Apart from the fact that this data can be used to give you an upper hand in the business, the use of bots and bot activities on a site can reduce its performance and ultimately crash the site.

So if you want to get into web scraping, make sure you can complete the process before you even begin to avoid wasting your resources. One way to ensure your success is to avoid your IPs getting blocked and in this article, we will discuss ways to reduce the risk of getting your proxies blocked.

Your choice of proxies is another factor to carefully consider before you get into the scraping proper. While all proxies offer you anonymity, some are careful and more difficult to detect than others.

Premium proxies like Limeproxies offer you dedicated IPs that are not easily detected by a website’s anti-bot detection software and so it’s a great place to get your IPs from.

Apart from the guide on reducing the risk of getting your proxies blocked, we would also offer advice on the configuration of scraping software to improve your chances of completing your web scraping processes.

Post Quick Links

Jump straight to the section of the post you want to read:

Identification and Blocking of Bots

Before you go into preventing your proxies from getting blocked by websites when you scrape, you need to first understand how these websites identify your bot and distinguish it from a real user.

It’s safe to say that websites and those involved in data extraction are playing a cat and mouse chase. As websites keep improving in their methods of detecting and blocking such activities, those involved in them keep looking for ways to hide their presence as they carry out automated processes.

You may think it’s a straightforward process to distinguish a real user from a bot but it isn’t. Suspicious activities would have to be detected, then flagged, and only after further tracking would the website block them.

The most common methods used by websites to identify web scraping bots are as follows:

  • When large amounts of requests are sent from a single IP to a URL, it is considered as coming from a bot as humans can’t operate that fast.
  • Websites can also detect bot use if WebRTC leaks your real IP address to the website’s servers.
  • When the request sent to a website’s server has different attributes that don’t correlate. Always make sure that the request you send is from the same location as your time zone and language chosen to avoid getting flagged.
  • When suspicious browser configurations are detected, a website can link it to a bot use and block the IP. An example of this is a disabled JavaScript. Different browsers use different versions of JavaScript and based on the functions it supports and other criteria, the website can recognize your browser.
  • Connecting to a website without cookies is suspicious and points to bot use. The presence of cookies doesn’t however set you free to use your bot without being discovered as cookies are used to track you.
  • Websites also notice non-human behavior on the web page. Mouse and keyboard actions are hard to simulate by a bot and can easily be detected. Bots are easily predictable unlike humans and when this happens, the website security features become suspicious.

Identification of bot activity in web scraping is the first reaction of websites to you. After they suspect your activities, they can respond to it in various ways ranging from tracking you, showing you an error page, or giving you false data. You may ultimately get blocked from accessing the site.

Websites frown at bot use because of spam actions in the reviews and comment pages. The high number of sent requests also has a drag on the website’s performance and can slow it down.

This is capable of causing the site to crash, or lead to a poor user experience. No one likes to lose in a competition, and many sites especially eCommerce sites are in competition with other brands.

Data extraction is one way a competitor can do better than you as they would capitalize on your weakness and improve on your strengths. So this is another reason why websites block bots.

Interesting Read- The Role of Proxies in Automation and Digital Economy

Web Scraping Scripts Using Python: Reduce the Risk of Your Scraper Getting Blocked

User Agent

When building a web scraper, you need to set the user agent so you can have access to the target site. A user header gives the website information about you so you can receive the requested data in a form that would be appreciated by you.

You can get your user agent string by searching for ‘what is my user agent?’ on google. If you are using the requests library, you can set your user agent through the following way;

headers = {

\ 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',

\ }

r = requests.get('example.com',headers=headers)

Make your set user agent look real by either picking a random one from a database or text file. You can get a list of user agents for various browsers here.

The following function will return a random user agent so you could choose when sending requests to reduce changes of your not getting blocked:

import numpy as np

def get_random_ua():

\ random_ua = ''

\ ua_file = 'ua_file.txt'

\ try:

\ with open(ua_file) as f:

\ lines = f.readlines()

\ if len(lines) > 0:

\ prng = np.random.RandomState()

\ index = prng.permutation(len(lines) - 1)

\ idx = np.asarray(index, dtype=np.integer)[0]

\ random_proxy = lines[int(idx)]

\ except Exception as ex:

\ print('Exception in random_ua')

\ print(str(ex))

\ finally:

\ return random_ua

Using the function get_random_ua a random user agent would be used from the text file.

Proxy

One of the major reasons bots get blocked is because of the use of static proxies, or the use of proxies that follow a sequence. Use multiple and random proxies so the website won’t be able to figure out any pattern.

Also use IP rotators if you can’t get rotating proxies so that you can use one proxy at a time, and possibly for a day. Also take note of any proxies that get blocked by the target site.

Delay

The moment your requests come rapidly, the website will block your bot because that’s not organic or human-like behavior. You can delay requests using numpy.random.choice()

delays = [7, 4, 6, 2, 10, 19]

delay = np.random.choice(delays)

time.sleep(delay)

Referers

Setting Up the referer is another important part of preventing Scraper blocks. Generally, if you’re Scraping a listing page or a home page, then you should set google’s main page for that country. If you’ll be scraping individual product pages, then you have the options of either setting the relevant category URL or find the domain backlink you are crawling.

Request Headers

Some websites are more sophisticated than others and will require more effort from you if you must scrape without getting blocked. Such websites will go as far as looking for some request header entries as part of bot detection strategies, and if those headers are not found, they will either block or spoof the content you are looking for.

By inspecting the page, you can tell which headers are being requested. Implement them or implement the headers one after the other after testing.

Reduce the Risk of Your Proxy Getting Blocked: Web Scraping Tips

The following are guidelines on how you can prevent your proxies from getting blocked by the website.

Adhere to The Website’s Policy

Before crawling a website, know what its crawling policies are. It’s generally accepted that the best results are gotten by being nice and working by the website’s crawling policies.

You will find a robots.txt file for most websites stored in the root directory that contains details such as what can be scraped and what can’t be scrapped. It also has details on the frequency at which you can scrape.

You should also look at the Terms of Service of a website as you will find information concerning data on the site. You will know if the data are public or copyrighted, and the best ways to access the target server and data you need.

Scrape At a Slower Speed and Rate

Bots are used because they are more efficient and faster than any human can, and this speed is their undoing as it's one of the ways websites detect and block them. This action just as any other action that doesn’t look natural or human is suspicious and to go about your activities without drawing attention to yourself, you need to regulate the number of requests you send per time. Too many requests also have negative effects on the target server as it overloads it and makes it slow and unresponsive.

Reconfigure your scraper and slow it down by making it to sleep randomly between requests. Also, give it longer sleep breaks of varying duration after a varying number of pages have been scrapped. It’s a good idea to be as random as possible to avoid too many suspicions or red flags.

Rotating IPs

Everyone who has been involved in web scraping isn’t a stranger to the common warning to avoid sending out too many requests using the same IP address. Doing this guarantees that you will get blocked and so before you begin scraping, you need multiple proxies. To extract data, you will need to send several requests to the webserver and the number of requests you send depends on the amount of data you need. Normal human behavior has a limit to the number of requests that can be sent per time and so anything more would be regarded as bot action.

To use multiple proxies for your web scraping action, you will need an IP rotator. This software takes an IP per session or for a specified time and sends out requests through it. Doing this would trick the target server into believing the requests are from the same device hence preventing you from getting blocked.

Randomize Your Crawling Pattern

Anti-bot features of websites can detect bot use by monitoring their activities and finding patterns in their actions and the way they move on to other websites. This is the case especially if you have a fixed pattern you work with and that’s why being random is good.

In reducing the risk of your proxies getting blocked, configure your bot to perform some actions like mouse movement, mouse clicks, or mouse scrolls randomly.

Humans are unpredictable in these ways and what you are aiming for is human-like behavior. So the more random you appear, the more human-like you would appear.

User Agents

Your user agent HTTP request header shares information such as the type of application being used, the operating system, the software, and the software version with the target server, and it also allows the target server to decide the type of HTML layout to send; desktop layout or mobile layout.

If the user agent is empty or unusual, it could be a red flag as the website server may see it as coming from a bot. To avoid this, make sure you use regular configurations so you won’t be suspected.

Common user agent configuration for various browsers are as follows:

  • Apple iPad

Mozilla/5.0 (iPad; CPU OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12H321 Safari/600.1.4

  • Apple iPhone

Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1

  • Bing Bot (Bing Search Engine Bot)

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

  • Curl

curl/7.35.0

Google Chrome

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36

  • Google Nexus

Mozilla/5.0 (Linux; U; Android-4.0.3; en-us; Galaxy Nexus Build/IML74K) AppleWebKit/535.7 (KHTML, like Gecko) CrMo/16.0.912.75 Mobile Safari/535.7

  • HTC

Mozilla/5.0 (Linux; Android 7.0; HTC 10 Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.83 Mobile Safari/537.36

  • Googlebot (Google Search Engine Bot)

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html

  • Lynx

Lynx/2.8.8pre.4 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/2.12.23

  • Mozilla Firefox

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0

  • Microsoft Edge

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393

  • Microsoft Internet Explorer 6 / IE 6

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

  • Microsoft Internet Explorer 7 / IE 7

Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)

  • Microsoft Internet Explorer 8 / IE 8

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)

  • Microsoft Internet Explorer 9 / IE 9

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)

  • Microsoft Internet Explorer 10 / IE 10

Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0; MDDCJS)

  • Microsoft Internet Explorer 11 / IE 11

Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko

  • Samsung Phone

Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-G570Y Build/MMB29K) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/4.0 Chrome/44.0.2403.133 Mobile Safari/537.36

  • Samsung Galaxy Note 3

Mozilla/5.0 (Linux; Android 5.0; SAMSUNG SM-N900 Build/LRX21V) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/2.1 Chrome/34.0.1847.76 Mobile Safari/537.36

  • Samsung Galaxy Note 4

Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-N910F Build/MMB29M) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/4.0 Chrome/44.0.2403.133 Mobile Safari/537.36

  • Wget

Wget/1.15 (linux-gnu)

Sending too many requests to the server from a single user agent is bad and it doesn’t look like human action. So instead, simulate real human use by switching between headers for different requests sent..

Interesting Read- Scrambling Your IP: Guide to Changing IPs for Every Connection

FAQ's

The use of bots is very necessary when data needs to be extracted from a target website. This process can be done manually but it’s very tedious, hence the need for an automated process. The use of bots enables multiple requests to be sent rapidly to the site’s servers so you can get as much data as you need within a short time.

This speed however can be damaging to the website and together with the fact that the data would be stolen by a rival, website owners frown at data extraction. And so they have anti-bot measures put in place to detect and block proxies.

To reduce the risk of your proxies getting blocked, you need to aim to achieve human behavior like being random, and reducing the rate at which you send requests. You would need multiple proxies and also an IP rotator. The type of proxy you use matters, and that’s why we recommend that you use dedicated proxies from Limeproxies.

About the author

Rachael Chapman

A Complete Gamer and a Tech Geek. Brings out all her thoughts and Love in Writing Techie Blogs.

Icon NextPrev5 Ways to Reduce the Risk of Getting Your Proxies Blocked
NextProxy Networks: The Retailing Game Changer of 2020Icon Prev

Join 5000+ other businesses that use Limeproxies