Ferret v0.8

More features, better API

July 23, 2019 by Tim Voronov


Hooray, Ferret v0.8 has been released!

It’s been a while since the last release, but we worked hard to bring new and better Ferret. This release has many new exciting features, but unfortunately, there are also some breaking changes.

You can find the full changelog here.

Let’s go!

What’s added

iframe

Ferret finally supports iframe elements.
When a page gets loaded, Ferret finds all available elements and provieds an access to them via the .frames property of a page object.

Here’s an example of how to find a target frame:

LET page = DOCUMENT("https://www.w3schools.com/html/html_iframe.asp", { driver: "cdp" }) LET content = ( FOR f IN page.frames FILTER f.URL == "https://www.w3schools.com/html/default.asp" RETURN f.head.innerHTML ) RETURN FIRST(content)

Alternatively, you can filter them out by url or access to a target iframe by index, if you know it’s position.

Due to CORS security policies, you still may have issues with iframes if it points to another domain.

Namespaces

With this release, we introduce a new language feature - namespaces.

Namespaces allow library authors (and us) to isolate functions into dedicated sub sections.

Here is an example:

LET blob = DOWNLOAD("https://raw.githubusercontent.com/MontFerret/ferret/master/assets/logo.png") RETURN IO::FS::WRITE("logo.png", blob)

To namespace a function, use the new namespace method. The namespace method is chainable:

package main import ( "context" "encoding/json" "fmt" "os" "strings" "github.com/MontFerret/ferret/pkg/compiler" "github.com/MontFerret/ferret/pkg/runtime/core" "github.com/MontFerret/ferret/pkg/runtime/values" "github.com/MontFerret/ferret/pkg/runtime/values/types" ) func main() { c := compiler.New() c.Namespace("IO").Namespace("FS").RegisterFunction("Read", fs.Read) }
In future releases we will put HTML related functions into HTML:: namespaces.

XPath

A good web scraping tool needs XPath support, and Ferret finally has it!
Ferret provides simple interface to XPath engine for both drivers - CDP and HTTP.
It automatically detects the output value type and deserializes them accordingly.

RETURN XPATH(page, "//div[contains(@class, 'form-group')]")
RETURN XPATH(page, "count(//div)")

These two queries will return 2 different types:

  1. Returns an array of serialized elements (their inner HTML)
  2. Returns a number indicating how many “div” elements are on the page.

Regular expression operator

This release provides a shorthand for using regexp assertions:

LET result = "foo" =~ "^f[o].$" // returns "true"
LET result = "foo" !~ "[a-z]+bar$" // returns "true"

New functions to manipulate DOM

There are some cases when you might need to change the existing DOM. To help with that, we added the INNER_HTML_SET and INNER_TEXT_SET functions.

// Using document and selector INNER_HTML_SET(doc, "body", "Hello") INNER_TEXT_SET(doc, "body", "Hello") // Or an element directly INNER_HTML_SET(doc.body, "Hello") INNER_TEXT_SET(doc.body, "Hello")

Viewport settings

In this release, you can override default values of a viewport in headless mode.

LET doc = DOCUMENT(@url, { driver: 'cdp', viewport: { width: 1920, height: 1080 } })

Better emulation of user interaction

This is a big change in how Ferret handles page interactions.

Now Ferret interacts with pages in a more advanced way - your script can scrolls down or up to an element, moves the mouse, focuses and types… with random delays. Just like a real person!

Other

There are many other many small changes here and there, like adding FOCUS, ESCAPE_HTML, UNESCAPE_HTML and DECODE_URI_COMPONENT functions; improving performance; and changing internal design of some parts of the system.

What’s broken

We try to maintain backwards compatibility, but some of the new features required serious design changes that lead to breaking compatibility with previous versions. As we approach to release v1.0, the API is becoming more stable and will require fewer dramatic changes.

Most of the breaking changes will affect only embedded solutions, use of HTML drivers in particular. No changes in the syntax, so no scripts need to change.

Virtual DOM structure

Work on iframe support required us to redesign the structure of the virtual DOM by introducing top level entity called HTMLPage:

type HTMLPage interface { core.Value core.Iterable core.Getter core.Setter collections.Measurable io.Closer IsClosed() values.Boolean GetURL() values.String GetMainFrame() HTMLDocument GetFrames(ctx context.Context) (*values.Array, error) GetFrame(ctx context.Context, idx values.Int) (core.Value, error) GetCookies(ctx context.Context) (*values.Array, error) SetCookies(ctx context.Context, cookies ...HTTPCookie) error DeleteCookies(ctx context.Context, cookies ...HTTPCookie) error PrintToPDF(ctx context.Context, params PDFParams) (values.Binary, error) CaptureScreenshot(ctx context.Context, params ScreenshotParams) (values.Binary, error) WaitForNavigation(ctx context.Context) error Navigate(ctx context.Context, url values.String) error NavigateBack(ctx context.Context, skip values.Int) (values.Boolean, error) NavigateForward(ctx context.Context, skip values.Int) (values.Boolean, error) }

Previously, HTMLDocument contained the open page, but iframe nodes introduce the need to have multiple documents representing each node. This led to a new entity in the structure.

Driver API

Because of the changes in Virtual DOM structure, the driver API has been changed as well in order to be reasonable.

Driver.LoadDocument and LoadDocumentParams are renamed to Driver.Open and Params.

type Driver interface { io.Closer Name() string Open(ctx context.Context, params Params) (HTMLPage, error) }

Other

In the context of API stabilization and consistency, there are some other minor changes in vDOM elements like extra returned value (usually an error) or Get prefix in some methods.