Parsing HTML at the command line with CSS selectors
pup is a command-line tool designed for processing and filtering HTML content using CSS selectors. It reads HTML from stdin and outputs filtered results to stdout, making it an invaluable tool for extracting specific data from web pages directly in the terminal. Inspired by jq's approach to JSON processing, pup brings similar flexibility and power to HTML manipulation.
The tool supports a comprehensive range of CSS selectors including classes, IDs, attributes, pseudo-classes, and complex combinators. Beyond simple filtering, pup offers powerful display functions that can extract text content, attribute values, or convert HTML structures to JSON format. It also automatically cleans and properly indents HTML output, with optional color coding for better readability.
Developers, system administrators, and data analysts who work with HTML content in scripts or need to quickly extract information from web pages will find pup particularly useful. Its simple syntax and pipe-friendly design make it perfect for integration into shell scripts, data processing pipelines, or quick one-off HTML parsing tasks. The tool excels at web scraping scenarios where you need to extract specific elements from HTML documents without writing complex parsing code.
# via Homebrew
brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb
# via Go
go get github.com/ericchiang/pup