Lambda Soup - Functional HTML Scraping for OCaml
GitHub<br>Installing
Contributing
Lambda Soup
module Soup: sig .. end<br>Easy functional HTML scraping and manipulation.
Lambda Soup is an HTML data extraction and analysis library. It supports CSS<br>selectors, DOM traversals, mutation, and HTML output. This very documentation<br>page was generated by ocamldoc and then
rewritten by Lambda Soup!
Here are some usage examples:
open Soup
let soup = read_channel stdin |> parse in
(* Print the page title. *)<br>soup $ "title" |> R.leaf_text |> print_endline;
(* Print the targets of all links. *)<br>soup $$ "a[href]"<br>|> iter (fun a -> print_endline (R.attribute "href" a));
(* Find the first unordered list. *)<br>let ul = soup $ "ul" in
(* Print the contents of all its items. *)<br>ul $$ "li"<br>|> iter (fun li -><br>trimmed_texts li |> String.concat "" |> print_endline)
(* Find all subsequent sibling elements of the same list. *)<br>let _ = ul $$ "~ *" in
(* Find all previous sibling elements instead. *)<br>let _ = ul |> previous_siblings |> elements in
(* ... *)
Lambda Soup is based around two kind of values: nodes, which represent<br>HTML elements, text content, and so on, and traversals, which are lazy<br>sequences of nodes. The top-level node is the soup node (a.k.a. document<br>node), which you typically get by calling parse on a string containing<br>HTML.
Once you have a node, you call select on it to traverse to other nodes<br>using CSS. There are also specialized functions, such as ancestors and<br>previous_siblings, which allow you to traverse in directions that CSS<br>cannot express.
Traversals can be manipulated with familiar combinators such as map,<br>fold, and filter. They can also be terminated early.
Once you have traversed to a node you are interested in, you can extract its<br>content or attributes, mutate it, cause other side effects, begin another<br>traversal, or do anything else your application requires. Enjoy!
Lambda Soup is developed on<br>GitHub and distributed under the<br>BSD<br>license.
This documentation page is for version 0.6.1 of the library. Documentation<br>for other versions can be downloaded from the<br>releases page.
Types
type element
type general
type soup
"Phantom" types for use with 'a node. See explanation below.
Module contents<br>[Top]
Types<br>High-level interface<br>Options<br>Early termination<br>Element access<br>Content access<br>Elementary traversals<br>Combinators<br>Projection<br>Convenience<br>Printing<br>Parsing signals<br>Equality<br>Mutation<br>I/O
GitHub
type 'a node
HTML nodes. These come in three varieties: element node represents a node<br>that is known to be an element, soup node represents an entire document,<br>and general node represents a node that might be anything, including an<br>element, a document, or text. There is no phantom type specifically for text<br>nodes.
Throughout Lambda Soup, if a function can operate on any kind of node, the<br>argument is typed at 'a node. If an element node or the entire document is<br>required, the argument type is element node or soup node,<br>respectively. general node is the result of a function that can't<br>guarantee that it evaluates to only elements or only documents.
type 'a nodes
Sequence of nodes. This is always instantiated as either element nodes or<br>or general nodes. The sequence is lazy in the sense that only as many<br>elements as needed are evaluated. This can be used with with_stop to<br>traverse part of a document until some condition is met.
High-level interface
val parse : string -> soup node<br>Parses the given HTML and produces a document node. Entity references are<br>resolved. The character encoding is detected automatically.
If you need to parse XML, want finer control over parsing, or want to feed<br>Lambda Soup something other than bytes, see Parsing<br>signals.
val select : string -> 'a node -> element nodes<br>select selector node is all the descendants of node matching CSS<br>selector selector. All<br>CSS3 selectors are<br>supported, except those which imply layout or a user interface:
:link, :visited, :hover, :active, :focus, :target, :lang, :enabled,<br>:disabled, :checked, :indeterminate, ::first-line, ::first-letter,<br>::selection, ::before, ::after
XML namespace selectors are not supported. Lambda Soup supports the canceled
:contains("foo") pseudo-class.
In regular CSS, a selector cannot start with a combinator such as >.<br>Lambda Soup allows selectors such as > p, + p, and ~ p, which select<br>immediate children of node, adjacent next siblings, and all next siblings,<br>respectively.
In addition, you can use the empty selector to select node itself. In this<br>case, note that if node is not an element (for example, it is often the<br>soup node), select will result in nothing: select always results in<br>sequences of element nodes only.
val select_one : string -> 'a node -> element node option<br>Like select, but evaluates to at most one element. Note that there is also<br>R.select_one if you don't want an optional result, which is explained at<br>require.
val ($$) : 'a node -> string -> element nodes<br>node $$ selector is the same as select...