Ferret v0.14.0

Happy new year!

March 6, 2021 by Tim Voronov


Hello friends,

Belated happy new 2021 year! Hope things will be better and we can finally get back to normal life.

Meanwhile, in between of work, personal lives and anxiety, we’ve managed to bring you some great features with a new release of Ferret - Ferret v0.14.0.

This release contains some syntax updates, DOM API fixes and extra flexibility. Let’s dive in!

What’s added

Support of History API

Before v0.14.0, Page object from the CDP driver, would only return the url that was set during full page load or redirect but ignored all in page navigation using History API (e.g. using react-router).

Starting this relase, the behavior has changed. Now the Page object always returns the url which is in the url-bar of your browser. If you need previous behavior, access url property of the root document.

By other words, page.url and document.url may point to different locations if History API is used on your page.

LET page = DOCUMENT("https://soundcloud.com", { driver: "cdp"}) LET doc = page.mainFrame WAIT_ELEMENT(doc, ".trendingTracks") SCROLL_ELEMENT(doc, ".trendingTracks") WAIT_ELEMENT(doc, ".trendingTracks .badgeList__item") LET song = ELEMENT(doc, ".trendingTracks .badgeList__item") CLICK(song) WAIT_ELEMENT(doc, ".l-listen-hero") RETURN { page: page.url, doc: doc.url }

Support of custom HTTP transport in the HTTP driver

Now it’s possible to provide your custom HTTP transport to the underlying HTTP client (pester) in HTTP driver:

import ( h "net/http" "github.com/MontFerret/ferret/pkg/drivers/http" ) func main() { httpDriver := http.NewDriver( http.WithCustomTransport(&h.Transport{}), ) }

Added LIKE operator

It’s been a part of ArrangoDB Query Language since the beginning, but was not ported to FQL in its early times. And now, LIKE operator has finally landed!

LET values = ( FOR str IN ["foo", "bar", "qaz"] FILTER str LIKE "*a*" RETURN str ) RETURN FIRST(values) NOT LIKE "b**" ? 'failure' : 'success'

As you might noticed, FQL’s implementation has some deviations from its ArrangoDB counterpart: syntax of pattern matching. Ferret is using stadard Unix wildcards.

Support of ignoring page resources

Now you can ignore all the “noise” loaded with your web pages and speed up your scraping by disabling particular files from loading:

LET p = DOCUMENT("https://www.gettyimages.com/", { driver: "cdp", ignore: { resources: [ { url: "*", type: "image" } ] } }) RETURN NONE

You can either use just a type of the resource (stylesheet, script, image, font and etc) or add an url pattern to do it more selectevley.

Support of handling non-200 HTTP status codes in the HTTP driver

This possibility added to the HTTP driver that allows you to handle situation when a target website responds with HTTP code other than 200 but still has content.

LET p = DOCUMENT("https://www.gettyimages.com/", { ignore: { statusCodes: [ { url: "*", code: 418 } ] } }) RETURN NONE

As with ignoring resources, you can either use just a code or add an url pattern to do it more selectevley.

DOCUMENT_EXISTS function

Now you can check if a webpage exists before navigating to it.

RETURN DOCUMENT_EXISTS("www.asdadasda.sds")

What’s fixed

Ok, now, let’s take a look what bugs have been fixed!

  • RAND(0,100) always same result
  • Element.children always returns empty array
  • Passing parameters with a nested nil structure leads to panic

Summary

Thanks everyone who contributed to the project by either providing new feature or fixing bugs or just asking questions!