Scraping Amazon Products Data using Golang

Golang colly module can be used to scrape web pages and retrieve useful information based on the HTML tags. In this tutorial, we will learn how to scrape Amazon product data using Golang.

Usecase

If you are building a system around Amazon products, or let’s say you want to track the pricing changes of a product regularly, web scraping and collecting useful data makes sense. It’s not possible to perform this work manually, but we can write a program to download the web page and parse it based on the HTML tags and get the information.

Scraping Amazon Data using Golang Colly Module

The first step is to import the colly module into our Go program.

package main
import(
		"fmt"
		"github.com/gocolly/colly"
}

After that, in the main method, we have to initialize the colly collector.

c := colly.NewCollector(colly.AllowedDomains("www.amazon.in"))

We have to call the Visit() method to download the web page for further processing.

c.Visit("https://www.amazon.in/s?k=keyboard")

Here, I am searching for the “keyboard” and then I will parse all the products on the page and print their name, ratings, and price on the console.

When we call the Visit() function, first of all, OnRequest() callback method is called, where we can do some pre-processing.

c.OnRequest(func(r *colly.Request){
        fmt.Println("Link of the page:", r.URL)
    })

The most important callback function is OnHTML(), where we will process the HTML response received. We have to find out the HTML tags from which we have to get the required information. Based on the HTML received in the browser, I can see that the search page contains the HTML tag “div.s-result-list.s-search-results.sg-row” and we have “div.a-section.a-spacing-base” for each item. The following tags contain the information we need for this example.

name – span.a-size-base-plus.a-color-base.a-text-normal
stars – span.a-icon-alt
price – span.a-price-whole

Final Code

package main
import(
		"fmt" //formatted I/O
		"github.com/gocolly/colly" //scraping framework
}


func main(){

		c := colly.NewCollector(colly.AllowedDomains("www.amazon.in"))

		c.OnRequest(func(r *colly.Request){
				fmt.Println("Link of the page:", r.URL)
		})

        c.OnHTML("div.s-result-list.s-search-results.sg-row", func(h*colly.HTMLElement){
				h.ForEach("div.a-section.a-spacing-base", func(_ int, h*colly.HTMLElement){
						var name string
						name = h.ChildText("span.a-size-base-plus.a-color-base.a-text-normal")
						var stars string
						stars = h.ChildText("span.a-icon-alt")
					    var price string
                        price = h.ChildText("span.a-price-whole")

                    	fmt.Println("ProductName: ", name)
						fmt.Println("Ratings: ", stars)
						fmt.Println("Price: ", price)

				})
		})

c.Visit("https://www.amazon.in/s?k=keyboard")
}

Running the Code from Terminal

First of all, we have to initialize the go project. I have created a new directory and saved the above code in the “scraper.go” file.

% go mod init scraper
go: creating new go.mod: module scraper
go: to add module requirements and sums:
	go mod tidy
%

It will create two files required to run the code – go.mod and go.sum.

The next step is to get the colly module for our project.

 % go get github.com/gocolly/colly
go get: added github.com/PuerkitoBio/goquery v1.8.0
go get: added github.com/andybalholm/cascadia v1.3.1
go get: added github.com/antchfx/htmlquery v1.2.5
go get: added github.com/antchfx/xmlquery v1.3.12
go get: added github.com/antchfx/xpath v1.2.1
go get: added github.com/gobwas/glob v0.2.3
go get: added github.com/gocolly/colly v1.2.0
go get: added github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e
go get: added github.com/golang/protobuf v1.3.1
go get: added github.com/kennygrant/sanitize v1.2.4
go get: added github.com/saintfish/chardet v0.0.0-20120816061221-3af4cd4741ca
go get: added github.com/temoto/robotstxt v1.1.2
go get: added golang.org/x/net v0.0.0-20220826154423-83b083e8dc8b
go get: added golang.org/x/text v0.3.7
go get: added google.golang.org/appengine v1.6.7
%

Now, we can use the following command to run our go program.

% go run scraper.go              
Link of the page: https://www.amazon.in/s?k=keyboard
ProductName:  Logitech K380 Multi-Device Bluetooth Wireless Keyboard with Easy-Switch for Upto 3 Devices, Slim, 2 Year Battery for PC, Laptop, Windows, Mac, Chrome OS, Android, iPad OS, Apple TV (Dark Grey)
Ratings:  4.5 out of 5 stars
Price:  2,994
ProductName:  HP 100 Wired Keyboard with USB Compatibility,Numeric keypad, Full Range of 109 Key(Including 12 Function Keys and 3 Hotkeys),Adjustable Height and Contoured Design.3-Years Warranty (2UN30AA)
Ratings:  4.3 out of 5 stars
Price:  449
ProductName:  iClever BK10 Bluetooth Keyboard for Mac, Multi Device Wireless Keyboard Rechargeable Bluetooth 5.1 Stable Connection with Number Pad Ergonomic Design Keyboard for iPad, iPhone, Tablet, iOS, Android, Windows, Sliver/White
Ratings:  4.2 out of 5 stars
Price:  2,699
ProductName:  HP 230 Wireless Black Keyboard with 2.4GHz connectivity up to 10m, 12 Function Keys and 16-Month Long Battery Life. 3-Years Warranty.(3L1E7AA)
Ratings:  4.1 out of 5 stars
Price:  1,299
ProductName:  Logitech K580 Slim Multi-Device Wireless Keyboard – Bluetooth/Receiver, Compact, Easy Switch, 24 Month Battery, Win/Mac, Desktop, Tablet, Smartphone, Laptop Compatible - Graphite
Ratings:  4.5 out of 5 stars
Price:  3,495
ProductName:  Zebronics ZEB-KM2100 Multimedia USB Keyboard Comes with 114 Keys Including 12 Dedicated Multimedia Keys & with Rupee Key
Ratings:  3.6 out of 5 stars
Price:  329
ProductName:  Dell Km117 Wireless Keyboard Mouse-
Ratings:  4.2 out of 5 stars
Price:  
%

What’s Next?

The above is a very basic program, as you can see that there is no pricing found for some items, we can add some validators to remove the items with no pricing information.
We can further extend this code to save the information into some CSV file or database to track the pricing changes of a product.
The biggest issue comes when the HTML tags are changed in the response, it will break our program. So, we can read the tags from some property files that can be changed on the fly without changing the code.

Amazon Blocking Request and use of Dedicated Proxies

If you will run the Amazon scraping code continuously, Amazon may block your IP. In that case, you won’t be able to access Amazon for your regular usage too. So, it’s best to run these programs from some server. If you are building some business applications, then it’s recommended to go with dedicated proxies to mitigate these issues.

Conclusion

Web scraping is a very common way to retrieve useful information regularly from a web page. Golang colly module is a good choice to create these programs and you can build a system using dedicated proxies to avoid request blocking issues.

References: Colly Official Docs