Web Data scraping using XPath, CSS Selectors

sarikazdubey

5 years ago

Web Data scraping using XPath, CSS Selectors

Web scraping is one of the must acquire data science skills. Considering that the internet is vast and is growing larger every day, the demand for web scraping also has increased. The extracted data has helped many businesses in their decision making processes. Not to mention that a web scraper always has to ensure polite scraping parallelly.

This article helps beginners understand XPath and CSS selectors locator strategies that are essential base aspects in scraping.

A Brief Overview of The Steps in The Web Scraping Process

Once the locator strategies are understood with regards to the web elements, you can proceed to choose amongst the scraping technologies. As mentioned earlier, web scraping should be performed politely keeping in mind to respect the robots.txt associated with the website, ensuring that the performance of the sites is never degraded, and the crawler declares itself with who it is, and the contact info.

Speaking about the scraping technologies available, scraping can be performed by coding in Python, Java, etc. For example, you could use Python-based Scrapy and Beautiful Soup, and Selenium, etc. are some of the scraping technologies that are recommended. Also, there are ready-made tools available in the market which allow you to scrape websites without having to code.

Beautiful Soup is a popular web scraping library; however, it is slower as compared to Scrapy. Also, Scrapy is a much more powerful and flexible web crawler compared to Selenium and Beautiful Soup.

In this article, we will be making references to Scrapy, which is an efficient web crawling framework which is written in Python. Once it is installed, the developer can extract the data from the desired web page with the help of XPath or CSS Selectors using several ways. This article particularly mentions how one can derive XPath/CSS using automated testing tools. Once this is derived, Scrapy is provided with these attribute values, and the data gets extracted.

How to Explore Web Page Elements

Now, what are the elements that we see on a webpage? What needs to be selected? Try this out on your web page – Right-click on any webpage and proceed to inspect it by clicking ‘Inspect’.

As a result, on the right-hand side of the page, you will be able to view the elements of that page. You can choose the element hovering tool to select the element to be inspected.

And, once you select the items using that tool, you can hover over the elements to be inspected, and the corresponding HTML code is displayed on the inspection pane.

CSS and XPath – How Can it be Viewed?

While using Scrapy, we will need to know the CSS and XPath text to be used. The XPath is nothing but the XML web path associated with a web page element. Similarly, CSS selectors help you find or select the HTML elements that you want to style.

We Can Either Derive Either By

Manually – To manually derive it is cumbersome. Even if we were to use the Inspect developer tool, it is time-consuming. Hence, to support a web scraper, tools are required to help find the XPath and CSS text associated with the elements in that web page.

Using Web Browser Add-ons that are available to be installed in the browser. For example, Chropath is one such Chrome browser add-on that gives information about the XPath, CSS selector, className, text, id, etc., of the desired element which is selected. Once you install it, you can view the web element’s associated extensive details once you inspect the web page.

For example, using ChroPath, the XPath, CSS can be vie-wed the following way,

Using web automation testing tools which have inbuilt web element locators in it. The locator gives all the info a web scraper may require to select and extract data from the web element being considered. This article explains how we can use the web automation tools can be used for extraction of the CSS and XPath.

A Scrappy Example – Using Web Automation Testing Tools to Derive CSS/Xpath

We will be scraping the quotes.toscrape.com web site in this article. Note that the scrapy tutorial website also provides examples to scrape from this particular website.

We will be extracting the author names and list of quotes on the website using scrapy. The web page is as follows,

Launching scrapy with the URL to be scraped

To launch up scrapy and associate it with the URL that we wish to scrape, we launch the following command, which starts up the scrapy bot.

Like mentioned earlier, we have several ways to derive the XPath and CSS Selector. In this example, I have mentioned how we can use any web automation test tool that has the Web UI test recorders/and locators to derive it. For doing so, you could launch the test recorder, select the web item, and figure out the XPath and CSS associated.

I have used the TestProject tool to demonstrate how the Xpath and CSS selector can be found. Once that is done, you can use the response.XPath() and response.css() commands to help query responses using XPath and CSS, respectively.

Xpath derivation using web test automation tool, and then scraping.

For example, now to derive the text based on the XPath, we issue the following command in scrapy, to scrape the Quotes on the website. The result is as follows:

CSS derivation using web test automation tool, and then scraping.

Similarly, you could also derive the CSS path by right-clicking to derive the CSS value.

With that CSS value, you can pass that info as the attribute values of the response.css command as follows, and it results in the list of authors being extracted. The result is as follows:

Conclusion

Both XPath and CSS are syntaxes that help to target elements within a webpage’s DOM. It is a good idea to understand how XPath and CSS function internally so that you can decide which to choose amongst them. It is important to know that Xpath primarily is the language for selecting nodes in the XML docs, and CSS is a language for apply styles to the HTML document.

Of course, CSS selectors perform efficiently and, faster than XPath.Thanks to the technologies we have today that easily gives us XPath and CSS details, it makes the job of a web scraper much easier.