Lambda Soup – Functional HTML Scraping for OCaml

Lambda Soup - Functional HTML Scraping for OCaml

GitHub Installing

Contributing

Lambda Soup

module Soup: sig .. end Easy functional HTML scraping and manipulation.

Lambda Soup is an HTML data extraction and analysis library. It supports CSS selectors, DOM traversals, mutation, and HTML output. This very documentation page was generated by ocamldoc and then

rewritten by Lambda Soup!

Here are some usage examples:

open Soup

let soup = read_channel stdin |> parse in

(* Print the page title. *) soup $ "title" |> R.leaf_text |> print_endline;

(* Print the targets of all links. *) soup $$ "a[href]" |> iter (fun a -> print_endline (R.attribute "href" a));

(* Find the first unordered list. *) let ul = soup $ "ul" in

(* Print the contents of all its items. *) ul $$ "li" |> iter (fun li -> trimmed_texts li |> String.concat "" |> print_endline)

(* Find all subsequent sibling elements of the same list. *) let _ = ul $$ "~ *" in

(* Find all previous sibling elements instead. *) let _ = ul |> previous_siblings |> elements in

(* ... *)

Lambda Soup is based around two kind of values: nodes, which represent HTML elements, text content, and so on, and traversals, which are lazy sequences of nodes. The top-level node is the soup node (a.k.a. document node), which you typically get by calling parse on a string containing HTML.

Once you have a node, you call select on it to traverse to other nodes using CSS. There are also specialized functions, such as ancestors and previous_siblings, which allow you to traverse in directions that CSS cannot express.

Traversals can be manipulated with familiar combinators such as map, fold, and filter. They can also be terminated early.

Once you have traversed to a node you are interested in, you can extract its content or attributes, mutate it, cause other side effects, begin another traversal, or do anything else your application requires. Enjoy!

Lambda Soup is developed on GitHub and distributed under the BSD license.

This documentation page is for version 0.6.1 of the library. Documentation for other versions can be downloaded from the releases page.

Types

type element

type general

type soup

"Phantom" types for use with 'a node. See explanation below.

Module contents [Top]

Types High-level interface Options Early termination Element access Content access Elementary traversals Combinators Projection Convenience Printing Parsing signals Equality Mutation I/O

GitHub

type 'a node

HTML nodes. These come in three varieties: element node represents a node that is known to be an element, soup node represents an entire document, and general node represents a node that might be anything, including an element, a document, or text. There is no phantom type specifically for text nodes.

Throughout Lambda Soup, if a function can operate on any kind of node, the argument is typed at 'a node. If an element node or the entire document is required, the argument type is element node or soup node, respectively. general node is the result of a function that can't guarantee that it evaluates to only elements or only documents.

type 'a nodes

Sequence of nodes. This is always instantiated as either element nodes or or general nodes. The sequence is lazy in the sense that only as many elements as needed are evaluated. This can be used with with_stop to traverse part of a document until some condition is met.

High-level interface

val parse : string -> soup node Parses the given HTML and produces a document node. Entity references are resolved. The character encoding is detected automatically.

If you need to parse XML, want finer control over parsing, or want to feed Lambda Soup something other than bytes, see Parsing signals.

val select : string -> 'a node -> element nodes select selector node is all the descendants of node matching CSS selector selector. All CSS3 selectors are supported, except those which imply layout or a user interface:

:link, :visited, :hover, :active, :focus, :target, :lang, :enabled, :disabled, :checked, :indeterminate, ::first-line, ::first-letter, ::selection, ::before, ::after

XML namespace selectors are not supported. Lambda Soup supports the canceled

:contains("foo") pseudo-class.

In regular CSS, a selector cannot start with a combinator such as >. Lambda Soup allows selectors such as > p, + p, and ~ p, which select immediate children of node, adjacent next siblings, and all next siblings, respectively.

In addition, you can use the empty selector to select node itself. In this case, note that if node is not an element (for example, it is often the soup node), select will result in nothing: select always results in sequences of element nodes only.

val select_one : string -> 'a node -> element node option Like select, but evaluates to at most one element. Note that there is also R.select_one if you don't want an optional result, which is explained at require.

val ($$) : 'a node -> string -> element nodes node $$ selector is the same as select...

Lambda Soup – Functional HTML Scraping for OCaml

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play