PyConFr 2015 | Presentation: XPath for web scraping

samedi 17:20:00–17:45:00

XPath for web scraping

Paul TREMBERTH

Audience level:: Novice

Description

All you need to know about XPath 1.0 in a web scraping project: the different axes, attribute matching, string functions, EXSLT extensions plus a few other handy patterns like CSS selectors and Javascript parsing.

Abstract

When you need to extract data from web pages, you usually parse HTML documents into a DOM tree and then use libraries like BeautifulSoup or the ElementTree API to extract data from it. Some libraries also support XPath expressions which can express more complex traversal and search patterns.

Everything about XPath 1.0 is defined in W3C lengthly specification but it can be obscure to read at first. The basics are quite simple to grasp though, and this talk will go over the most useful syntax patterns you need to get started.

What we'll cover: - axes and how to look for specific tags, parent element, children or siblings nodes - predicates and selecting nodes based on attribute or content values - built-in string functions that you should know about - EXSLT extensions supported by lxml and how they can solve tricky lookups

We'll end the talk with a few handy tips: - how to use CSS selectors to do some of the above - how to parse Javascript code with XPath