Web scraping is one of the must acquire data science skills. Considering that the internet is vast and is growing larger every day, the demand for web scraping also has increased. The extracted data has helped many businesses in their decision making processes. Not to mention that a web scraper always has to ensure polite scraping parallelly.
This article helps beginners understand XPath and CSS selectors locator strategies that are essential base aspects in scraping.
A Brief Overview of The Steps in The Web Scraping Process
Speaking about the scraping technologies available, scraping can be performed by coding in Python, Java, etc. For example, you could use Python-based Scrapy and Beautiful Soup, and Selenium, etc. are some of the scraping technologies that are recommended. Also, there are ready-made tools available in the market which allow you to scrape websites without having to code.
Beautiful Soup is a popular web scraping library; however, it is slower as compared to Scrapy. Also, Scrapy is a much more powerful and flexible web crawler compared to Selenium and Beautiful Soup.
In this article, we will be making references to Scrapy, which is an efficient web crawling framework which is written in Python. Once it is installed, the developer can extract the data from the desired web page with the help of XPath or CSS Selectors using several ways. This article particularly mentions how one can derive XPath/CSS using automated testing tools. Once this is derived, Scrapy is provided with these attribute values, and the data gets extracted.
How to Explore Web Page Elements
Now, what are the elements that we see on a webpage? What needs to be selected? Try this out on your web page – Right-click on any webpage and proceed to inspect it by clicking ‘Inspect’.
CSS and XPath – How Can it be Viewed?
While using Scrapy, we will need to know the CSS and XPath text to be used. The XPath is nothing but the XML web path associated with a web page element. Similarly, CSS selectors help you find or select the HTML elements that you want to style.
We Can Either Derive Either By
- Manually – To manually derive it is cumbersome. Even if we were to use the Inspect developer tool, it is time-consuming. Hence, to support a web scraper, tools are required to help find the XPath and CSS text associated with the elements in that web page.
- Using Web Browser Add-ons that are available to be installed in the browser. For example, Chropath is one such Chrome browser add-on that gives information about the XPath, CSS selector, className, text, id, etc., of the desired element which is selected. Once you install it, you can view the web element’s associated extensive details once you inspect the web page.
For example, using ChroPath, the XPath, CSS can be vie-wed the following way,
- Using web automation testing tools which have inbuilt web element locators in it. The locator gives all the info a web scraper may require to select and extract data from the web element being considered. This article explains how we can use the web automation tools can be used for extraction of the CSS and XPath.
A Scrappy Example – Using Web Automation Testing Tools to Derive CSS/Xpath
We will be extracting the author names and list of quotes on the website using scrapy. The web page is as follows,
To launch up scrapy and associate it with the URL that we wish to scrape, we launch the following command, which starts up the scrapy bot.
Like mentioned earlier, we have several ways to derive the XPath and CSS Selector. In this example, I have mentioned how we can use any web automation test tool that has the Web UI test recorders/and locators to derive it. For doing so, you could launch the test recorder, select the web item, and figure out the XPath and CSS associated.
I have used the TestProject tool to demonstrate how the Xpath and CSS selector can be found. Once that is done, you can use the response.XPath() and response.css() commands to help query responses using XPath and CSS, respectively.
For example, now to derive the text based on the XPath, we issue the following command in scrapy, to scrape the Quotes on the website. The result is as follows:
CSS derivation using web test automation tool, and then scraping.
Similarly, you could also derive the CSS path by right-clicking to derive the CSS value.
With that CSS value, you can pass that info as the attribute values of the response.css command as follows, and it results in the list of authors being extracted. The result is as follows:
Both XPath and CSS are syntaxes that help to target elements within a webpage’s DOM. It is a good idea to understand how XPath and CSS function internally so that you can decide which to choose amongst them. It is important to know that Xpath primarily is the language for selecting nodes in the XML docs, and CSS is a language for apply styles to the HTML document.
Of course, CSS selectors perform efficiently and, faster than XPath.Thanks to the technologies we have today that easily gives us XPath and CSS details, it makes the job of a web scraper much easier.