python data analysis concept

Python for Marketers: Using starts-with() and contains() in Selenium web scraper

  • What this is for: Isolating elements you want to scrape with Selenium
  • Requirements: Python Anaconda distribution, Basic knowledge of HTML structure and Chrome Inspector tool
  • Concepts covered: Selenium, XPath

Occasionally when you’re testing or scraping a web page with Selenium, you may need to select an element or group of elements where you may only know a portion of an attribute.

For instance, suppose we needed to gather the text of multiple elements where the id changes or we may only know a portion of the id.

<span id="Content_1">Text 1</span>
<span id="Content_2">Text 2</span>
<span id="Content_3">Text 3</span>
<span id="Completely different span">We Don’t Want This!</span>

In this case, we don’t want to capture all span elements because we’re only interested in the first three. However, the span id changes in each instance.

Or suppose you needed to needed to interact with a link element but part of the id is loaded dynamically and changes each time the page loads. For example:

<a id="unique_number_1234_next">Next Story</a>
<a id="unique_number_1234_previous">Previous Story</a>

If we only wanted to capture information related to the first link, we couldn’t capture all a elements. We know the id contains “next” but the starting values will change.

XPaths allow us to easily select elements we are interested in. Similar to how regular expressions work, we can further use starts-with() and contains() functions with our XPath to isolate the exact elements we are interested in.

Read more on working with XPath in Selenium for Python.

Starts With

For the first example, we wanted to interact only with the span elements with an id starting with “Content_” so our Xpath would look like this:

//span[starts-with(@id, 'Content_')]

To find all of these in the document, we would use the find_elements_by_xpath() method.

my_list = driver.find_elements_by_xpath("//span[starts-with(@id, 'Content_')]")

The data will now be stored in a list called my_list and can be manipulated similar to other web driver elements. For example, if we wanted to get the text inside the span tags, we would use:

my_list[0].text

which would equal a value of “Text 1”

Contains

But what if we can’t filter ids based on how they start? In the second example, both ids start with “custom_number_1234_” but suppose only want to capture the next page link.

In this case, we could use contains() function. Our XPath would look like this:

//a[contains(@id, ‘next’)]

Similarly, if we wanted to capture the text in the link, we could use

my_text = driver.find_element_by_xpath(“//a[contains(@id, 'next')]")
my_text.text

You may also like