|
|
@@ -19,10 +19,18 @@ module Agents
|
19
|
19
|
|
20
|
20
|
`url` can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape)
|
21
|
21
|
|
|
22
|
+ The WebsiteAgent can also scrape based on incoming events. It will scrape the url contained in the `url` key of the incoming event payload. If you specify `merge` as the `mode`, it will retain the old payload and update it with the new values.
|
|
23
|
+
|
|
24
|
+ # Supported Document Types
|
|
25
|
+
|
22
|
26
|
The `type` value can be `xml`, `html`, `json`, or `text`.
|
23
|
27
|
|
24
|
28
|
To tell the Agent how to parse the content, specify `extract` as a hash with keys naming the extractions and values of hashes.
|
25
|
29
|
|
|
30
|
+ Note that for all of the formats, whatever you extract MUST have the same number of matches for each extractor. E.g., if you're extracting rows, all extractors must match all rows. For generating CSS selectors, something like [SelectorGadget](http://selectorgadget.com) may be helpful.
|
|
31
|
+
|
|
32
|
+ # Scraping HTML and XML
|
|
33
|
+
|
26
|
34
|
When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`. It then evaluates an XPath expression in `value` on each node in the node set, converting the result into string. Here's an example:
|
27
|
35
|
|
28
|
36
|
"extract": {
|
|
|
@@ -37,6 +45,8 @@ module Agents
|
37
|
45
|
|
38
|
46
|
Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions all namespaces are stripped from the document unless a toplevel option `use_namespaces` is set to true.
|
39
|
47
|
|
|
48
|
+ # Scraping JSON
|
|
49
|
+
|
40
|
50
|
When parsing JSON, these sub-hashes specify [JSONPaths](http://goessner.net/articles/JsonPath/) to the values that you care about. For example:
|
41
|
51
|
|
42
|
52
|
"extract": {
|
|
|
@@ -44,6 +54,8 @@ module Agents
|
44
|
54
|
"description": { "path": "results.data[*].description" }
|
45
|
55
|
}
|
46
|
56
|
|
|
57
|
+ # Scraping Text
|
|
58
|
+
|
47
|
59
|
When parsing text, each sub-hash should contain a `regexp` and `index`. Output text is matched against the regular expression repeatedly from the beginning through to the end, collecting a captured group specified by `index` in each match. Each index should be either an integer or a string name which corresponds to <code>(?<<em>name</em>>...)</code>. For example, to parse lines of <code><em>word</em>: <em>definition</em></code>, the following should work:
|
48
|
60
|
|
49
|
61
|
"extract": {
|
|
|
@@ -66,7 +78,7 @@ module Agents
|
66
|
78
|
|
67
|
79
|
Beware that `.` does not match the newline character (LF) unless the `m` flag is in effect, and `^`/`$` basically match every line beginning/end. See [this document](http://ruby-doc.org/core-#{RUBY_VERSION}/doc/regexp_rdoc.html) to learn the regular expression variant used in this service.
|
68
|
80
|
|
69
|
|
- Note that for all of the formats, whatever you extract MUST have the same number of matches for each extractor. E.g., if you're extracting rows, all extractors must match all rows. For generating CSS selectors, something like [SelectorGadget](http://selectorgadget.com) may be helpful.
|
|
81
|
+ # General Options
|
70
|
82
|
|
71
|
83
|
Can be configured to use HTTP basic auth by including the `basic_auth` parameter with `"username:password"`, or `["username", "password"]`.
|
72
|
84
|
|
|
|
@@ -84,7 +96,7 @@ module Agents
|
84
|
96
|
|
85
|
97
|
Set `unzip` to `gzip` to inflate the resource using gzip.
|
86
|
98
|
|
87
|
|
- The WebsiteAgent can also scrape based on incoming events. It will scrape the url contained in the `url` key of the incoming event payload. If you specify `merge` as the mode, it will retain the old payload and update it with the new values.
|
|
99
|
+ # Liquid Templating
|
88
|
100
|
|
89
|
101
|
In Liquid templating, the following variable is available:
|
90
|
102
|
|