Lambda Soup – Functional HTML Scraping for OCaml

Tomte1 pts0 comments

Lambda Soup - Functional HTML Scraping for OCaml

GitHub<br>Installing

Contributing

Lambda Soup

module Soup: sig .. end<br>Easy functional HTML scraping and manipulation.

Lambda Soup is an HTML data extraction and analysis library. It supports CSS<br>selectors, DOM traversals, mutation, and HTML output. This very documentation<br>page was generated by ocamldoc and then

rewritten by Lambda Soup!

Here are some usage examples:

open Soup

let soup = read_channel stdin |> parse in

(* Print the page title. *)<br>soup $ "title" |> R.leaf_text |> print_endline;

(* Print the targets of all links. *)<br>soup $$ "a[href]"<br>|> iter (fun a -> print_endline (R.attribute "href" a));

(* Find the first unordered list. *)<br>let ul = soup $ "ul" in

(* Print the contents of all its items. *)<br>ul $$ "li"<br>|> iter (fun li -><br>trimmed_texts li |> String.concat "" |> print_endline)

(* Find all subsequent sibling elements of the same list. *)<br>let _ = ul $$ "~ *" in

(* Find all previous sibling elements instead. *)<br>let _ = ul |> previous_siblings |> elements in

(* ... *)

Lambda Soup is based around two kind of values: nodes, which represent<br>HTML elements, text content, and so on, and traversals, which are lazy<br>sequences of nodes. The top-level node is the soup node (a.k.a. document<br>node), which you typically get by calling parse on a string containing<br>HTML.

Once you have a node, you call select on it to traverse to other nodes<br>using CSS. There are also specialized functions, such as ancestors and<br>previous_siblings, which allow you to traverse in directions that CSS<br>cannot express.

Traversals can be manipulated with familiar combinators such as map,<br>fold, and filter. They can also be terminated early.

Once you have traversed to a node you are interested in, you can extract its<br>content or attributes, mutate it, cause other side effects, begin another<br>traversal, or do anything else your application requires. Enjoy!

Lambda Soup is developed on<br>GitHub and distributed under the<br>BSD<br>license.

This documentation page is for version 0.6.1 of the library. Documentation<br>for other versions can be downloaded from the<br>releases page.

Types

type element

type general

type soup

"Phantom" types for use with 'a node. See explanation below.

Module contents<br>[Top]

Types<br>High-level interface<br>Options<br>Early termination<br>Element access<br>Content access<br>Elementary traversals<br>Combinators<br>Projection<br>Convenience<br>Printing<br>Parsing signals<br>Equality<br>Mutation<br>I/O

GitHub

type 'a node

HTML nodes. These come in three varieties: element node represents a node<br>that is known to be an element, soup node represents an entire document,<br>and general node represents a node that might be anything, including an<br>element, a document, or text. There is no phantom type specifically for text<br>nodes.

Throughout Lambda Soup, if a function can operate on any kind of node, the<br>argument is typed at 'a node. If an element node or the entire document is<br>required, the argument type is element node or soup node,<br>respectively. general node is the result of a function that can't<br>guarantee that it evaluates to only elements or only documents.

type 'a nodes

Sequence of nodes. This is always instantiated as either element nodes or<br>or general nodes. The sequence is lazy in the sense that only as many<br>elements as needed are evaluated. This can be used with with_stop to<br>traverse part of a document until some condition is met.

High-level interface

val parse : string -> soup node<br>Parses the given HTML and produces a document node. Entity references are<br>resolved. The character encoding is detected automatically.

If you need to parse XML, want finer control over parsing, or want to feed<br>Lambda Soup something other than bytes, see Parsing<br>signals.

val select : string -> 'a node -> element nodes<br>select selector node is all the descendants of node matching CSS<br>selector selector. All<br>CSS3 selectors are<br>supported, except those which imply layout or a user interface:

:link, :visited, :hover, :active, :focus, :target, :lang, :enabled,<br>:disabled, :checked, :indeterminate, ::first-line, ::first-letter,<br>::selection, ::before, ::after

XML namespace selectors are not supported. Lambda Soup supports the canceled

:contains("foo") pseudo-class.

In regular CSS, a selector cannot start with a combinator such as >.<br>Lambda Soup allows selectors such as > p, + p, and ~ p, which select<br>immediate children of node, adjacent next siblings, and all next siblings,<br>respectively.

In addition, you can use the empty selector to select node itself. In this<br>case, note that if node is not an element (for example, it is often the<br>soup node), select will result in nothing: select always results in<br>sequences of element nodes only.

val select_one : string -> 'a node -> element node option<br>Like select, but evaluates to at most one element. Note that there is also<br>R.select_one if you don't want an optional result, which is explained at<br>require.

val ($$) : 'a node -> string -> element nodes<br>node $$ selector is the same as select...

node soup element nodes lambda html

Related Articles