Crawl data from the Amazon using Python

5 min readAug 21, 2020

Here is a simple example to get some product data on amazon using python language.

Install the environment
Doing
Conclusion

Crawl data from web pages is a not-so-unfamiliar concept to web programmers. However, after more than two years working as a programmer, I had my first experience crawling data, but also from a famous shopping site, Amazon.

There are many of you asking me to be a programmer, what do I do to crawl data from the Amazon website? It started a few months ago when my best friend got drunk and went on a motorbike to hit an electric pole, broke his leg, and has yet to recover from going to work.

A few days ago, I talked about iced tea and heard about it when I was at home doing drop shipping or something, I went to the internet search, and it was quite good. But the pain of this, as it says, it is complicated to control the product price on the amazon page, to go to each product link and write the price again. So I thought about why not crawl data from the amazon site for fast.

And after a day of groping, copying all kinds of code, I was able to get what my friend wanted from the amazon website. This article will guide you on how to crawl data from the Amazon site using Python.

Install the environment

We will need Python and some packages to download the web page and parse the HTML.

Python: you can download the latest version here.
Python PIP to install packages.
Python Requests allows you to send HTTP requests.
Python LXML for HTML parsing.
If your PC already has pip, installing LXML will be a breeze. Just run the following command in the terminal:

pip install requests lxml

Doing

When you go to a product link, for example, https://www.amazon.com/Upgraded-Dimmable-Spectrum-Adjustable-Gooseneck/dp/B07PXP7DW5

You can see the following information:

My job is to get the price and name of the product only. But there are URLs that Amazon will ask you to have cookies to display product prices.

Getting cookies in the browser is very simple, so you should find out for yourself.

We will have:

cookies = { ... }headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'
}

We will load the HTML page, first need to request to the web page to return the response:

response = requests.get (url, headers = headers, verify = False, cookies = cookies)

The URL here is https://www.amazon.com/Upgraded-Dimmable-Spectrum-Adjustable-Gooseneck/dp/B07PXP7DW5.

And then, the content of the response is the Html page that you need to get. We will have to know how the name and price properties of the product will display on that HTML page.

The easiest way is to check in on the browser.

Thus the product price will have an id of priceblock_ourprice

The product’s name will be in the h1 tag whose id is the title.

We need to write a function to get the values above after downloading the Html page from the URL:

from lxml import html
import csv
import os
import requests
from exceptions import ValueError
from time import sleep
from random import randintdef parse (url):
    headers = {
        'User-Agent': 'Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 42.0.2311.90 Safari / 537.36'
    }
    
    try:
        # Retrying for failed requests
        for i in range (20):
            # Generating random delays
            sleep (randint (1,3))
            # Adding verify = False to avold ssl related issues
        cookies = {// cookies on your device}
            response = requests.get (url, headers = headers, verify = False, cookies = cookies)if response.status_code == 200:
                doc = html.fromstring (response.content)
                XPATH_NAME = '// h1 [@ id = "title"] // text ()'
                XPATH_PRICE = '// span [contains (@id, "priceblock_ourprice")] / text ()'RAW_NAME = doc.xpath (XPATH_NAME)
                RAW_PRICE = doc.xpath (XPATH_PRICE)NAME = '' .join (''. Join (RAW_NAME) .split ()) if RAW_NAME else None
                PRICE = '' .join (''. Join (RAW_PRICE) .split ()). Strip () if RAW_SALE_PRICE else Nonedata = {
                    'NAME': NAME,
                    'PRICE': PRICE,
                    'URL': url,
                }
                return data
            
            elif response.status_code == 404:
                breakexcept Exception as e:
        print e

After getting the data, we can save them to the CSV file; for example, I have 2 URLs to get the price and product name, the code will be executed as follows:

def ReadUrl ():
     UrlList = ['https://www.amazon.com/Upgraded-Dimmable-Spectrum-Adjustable-Gooseneck/dp/B07PXP7DW5', 'https://www.amazon.com/Autel-MS906-Automotive-Diagnostic-Adaptations/ dp / B01CQNNBA4? ref_ = Oct_DLandingS_PC_b8ca5425_2 & smid = A3MNQOSQ336D3K ']
     extracted_data = []for i in UrlList:
         print "Processing:" + i
         # Calling the parser
         parsed_data = parse (i)
         if parsed_data:
             extracted_data.append (parsed_data)# Writing scraped data to csv file
     with open ('scraped_data.csv', 'w') as csvfile:
         fieldnames = ['NAME', 'PRICE', 'URL']
         writer = csv.DictWriter (csvfile, fieldnames = fieldnames, quoting = csv.QUOTE_ALL)
         writer.writeheader ()
         for data in extracted_data:
             writer.writerow (data)if __name__ == "__main__":
     ReadUrl ()

You save the file name as product_amazon.py and run python product_amazon.py to see the results.

Conclusion

Here is a simple example to get some product data on amazon using python language. There are many ways to crawl data.

Hopefully, the article will be useful for you, especially those of you who are working with the left-hand side of drop shipping.

Thank you for reading my article!

Crawl data from the Amazon using Python

Table of contents

Install the environment

Doing

Conclusion

Written by Beribey